0% found this document useful (0 votes)

18 views15 pages

Unireplknet: A Universal Perception Large-Kernel Convnet For Audio, Video, Point Cloud, Time-Series and Image Recognition

The document proposes UniRepLKNet, a universal perception large-kernel convolutional neural network for audio, video, point cloud, time-series and image recognition. It introduces four architectural guidelines for designing large-kernel ConvNets: 1) use efficient structures to increase depth without increasing computation, 2) use a proposed dilated re-parameterization to increase receptive field without increasing depth, 3) use a combination of large and small kernel blocks to extract patterns at different scales, and 4) apply the model to various domains beyond vision with modality-related preprocessing.

Uploaded by

proaction2021

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

18 views15 pages

Unireplknet: A Universal Perception Large-Kernel Convnet For Audio, Video, Point Cloud, Time-Series and Image Recognition

Uploaded by

proaction2021

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 15

UniRepLKNet: A Universal Perception Large-Kernel ConvNet for Audio, Video,

Point Cloud, Time-Series and Image Recognition

Xiaohan Ding1 * Yiyuan Zhang2∗ Yixiao Ge1

Sijie Zhao1 Lin Song1 Xiangyu Yue2 Ying Shan1
1 2
Tencent AI Lab The Chinese University of Hong Kong
arXiv:2311.15599v1 [cs.CV] 27 Nov 2023

[email protected], [email protected]

Abstract pLKNet [11] when the status of ConvNets was challenged

by Vision Transformers (ViTs) [12, 38, 61, 65]. Inspired by
Large-kernel convolutional neural networks (ConvNets) ViTs that use global attention [12, 54, 65] or attention with
have recently received extensive research attention, but large windows [38, 49, 64], RepLKNet proposed to use very
there are two unresolved and critical issues that demand large conv kernels. In contrast to the common practice using
further investigation. 1) The architectures of existing large- small kernels (e.g., 3×3) [19, 22, 25, 48, 53, 58, 77], which
kernel ConvNets largely follow the design principles of fails to obtain a large Effective Receptive Field (ERF) [43]
conventional ConvNets or transformers, while the archi- even with numerous small-kernel layers, RepLKNet real-
tectural design for large-kernel ConvNets remains under- izes large ERF and impressive performance, especially on
addressed. 2) As transformers have dominated multiple tasks such as object detection and semantic segmentation.
modalities, it remains to be investigated whether ConvNets Nowadays, ConvNets with very large kernels become
also have a strong universal perception ability in domains popular, which mostly focus on making the large ker-
beyond vision. In this paper, we contribute from two as- nels even larger [36], ways to apply them to multiple
pects. 1) We propose four architectural guidelines for de- tasks [6, 42, 75], etc. However, we note that most archi-
signing large-kernel ConvNets, the core of which is to ex- tectures of the existing large-kernel ConvNets simply fol-
ploit the essential characteristics of large kernels that dis- low other models, e.g., RepLKNet [11] follows the archi-
tinguish them from small kernels - they can see wide with- tecture of Swin Transformer [37], and SLaK [36] follows
out going deep. Following such guidelines, our proposed ConvNeXt, which is a powerful architecture with medium-
large-kernel ConvNet shows leading performance in image sized (7×7) kernels. The architectural design for large-
recognition. For example, our models achieve an ImageNet kernel ConvNets remains under-explored.
accuracy of 88.0%, ADE20K mIoU of 55.6%, and COCO We explore large-kernel ConvNet architecture by re-
box AP of 56.4%, demonstrating better performance and thinking the design of conventional models that employ a
higher speed than a number of recently proposed powerful deep stack of small kernels. As we add a 3×3 conv to
competitors. 2) We discover that large kernels are the key to a small-kernel ConvNet, we expect it to take three effects
unlocking the exceptional performance of ConvNets in do- simultaneously - 1) make the receptive field larger, 2) in-
mains where they were originally not proficient. With cer- crease the abstract hierarchy of spatial patterns (e.g., from
tain modality-related preprocessing approaches, the pro- angles and textures to shapes of objects), and 3) improve
posed model achieves state-of-the-art performance on time- the model’s general representational capability via making
series forecasting and audio recognition tasks even without it deeper, bringing in more learnable parameters and non-
modality-specific customization to the architecture. Code linearities. In contrast, we argue that such three effects in a
and all the models at https://github.com/AILab- large-kernel architecture should be decoupled as the model
CVC/UniRepLKNet. should utilize the substantial strength of a large kernel - the
ability to see wide without going deep. Since increasing the
kernel size is much more effective than stacking more lay-
1. Introduction ers in enlarging the ERF [43], a sufficient ERF can be built
up with a small number of large-kernel layers, so that the
The design paradigm of convolutional neural networks
compute budget can be saved for other efficient structures
(ConvNets) with very large kernels originated from Re-
that are more effective in increasing the abstract hierarchy
* Equal contribution of spatial patterns or generally increasing the depth. For

1
input
example, when the objective is to extract higher-level lo- 4×
cal spatial patterns from lower-level ones, a 3×3 conv layer Stage 1 Dilated Re-param
DW conv
might be a more suitable option than a large-kernel conv LarK Block
Block
2×
layer. The reason is that the latter demands more computa- BN
Stage 2 avg-pool
tions and may result in patterns that are no longer restricted SmaK Block

2×
to smaller local regions, which could be undesirable in spe- SmaK Block
ReLU SE Block

cific scenarios. sigmoid

... multiply
Concretely, we propose four architectural guidelines Stage 3
LarK Block GELU
for large-kernel ConvNets - 1) use efficient structures such FFN

as SE Blocks [24] to increase the depth, 2) use a proposed SmaK Block

BN
2×
Dilated Reparam Block to re-parameterize the large-kernel add
drop path

Stage 4 SmaK Block

conv layer to improve the performance without inference LarK Block
costs, 3) decide the kernel size by the downstream task and
usually use large kernels only in the middle- and high-level Figure 1. Architectural design of UniRepLKNet. A LarK Block
comprises a Dilated Reparam Block proposed in this paper, an SE
layers, and 4) add 3×3 conv instead of more large kernels
Block [24], an FFN, and Batch Normalization (BN) [28] layers.
while scaling up the model’s depth. A ConvNet built up
The only difference between a SmaK Block and a LarK Block is
following such guidelines (Fig. 1) realizes the aforemen- that the former uses a depth-wise 3×3 conv layer in replacement
tioned three effects separately, as it uses a modest number of of the Dilated Reparam Block in the latter. Stages are connected
large kernels to guarantee a large ERF, small kernels to ex- by downsampling blocks implemented by stride-2 dense 3×3 conv
tract more complicated spatial patterns more efficiently, and layers. We may flexibly arrange the blocks in different stages and
multiple lightweight blocks to further increase the depth to the details of our provided instances are shown in Table 5.
enhance the representational capacity.
Our architecture achieves leading performance on Im- a “comeback” for ConvNet in its original domain but also
ageNet classification [9], ADE20K semantic segmenta- showcase large-kernel ConvNet’s potential to “conquer”
tion [80], and COCO object detection [34], outperforming new territories, expanding its applicability and versatility in
the existing large-kernel ConvNets such as RepLKNet [11], various tasks.
SLaK [36], and recent powerful architectures including
ConvNeXt V2 [70], FastViT [63], Swin V2 [39] and DeiT 2. Related Work
III [62] in terms of both accuracy and efficiency. Large kernels in early ConvNets. Classic ConvNets such
RepLKNet [11] was proposed partly “in defense of Con- as AlexNet [32] and Inceptions [55–57] used 7×7 or 11×11
vNets” as ViTs dominated multiple image recognition tasks in the low-level layers, but large kernels became not pop-
that were once dominated by ConvNets. However, in ular after VGG-Net [53]. Global Convolution Network
this work, we seek to not only reclaim the leading posi- (GCN) [45] used very large conv layers (1×K followed by
tion in image recognition tasks by surpassing ViTs’ per- K×1) for semantic segmentation. Local Relation Networks
formance but also demonstrate significant contributions in (LR-Net) [23] adopted a spatial aggregation operator (LR-
areas where ConvNets were not traditionally dominant. Layer) to replace the standard conv layer, which can be
Specifically, on audio, video, point cloud, and time-series viewed as a dynamic convolution. LR-Net benefited from
tasks, we achieve impressive performance with amazingly a kernel size of 7×7 but degraded with 9×9. With a kernel
universal and simple solutions. We use modality-specific size as large as the feature map, its top-1 accuracy signifi-
preprocessing approaches to transform all the data into 3D cantly reduced from 75.7% to 68.4%.
embedding maps just like what we do with images and use Explorations with large kernels. The concept of ker-
the same architecture as the backbone to process the embed- nel may be generalized beyond spatial convolution. Swin
ding maps. Our model shows universal perception ability Transformer [38] used shifted attention with window sizes
across multiple modalities with a unified architecture so ranging from 7 to 12, which can be seen as a dynamic
it is named UniRepLKNet. kernel. Han et al. [18] replaced the attention layers in
Impressively, UniRepLKNet achieves remarkable results Swin with static or dynamic 7×7 conv and still maintained
even on modalities that were not considered the stronghold comparable results. MetaFormer [76] suggested large-
of ConvNet, such as audio and temporal data. For example, kernel pooling layer was an alternative to self-attention.
on a huge-scale time-series forecasting task that predicts the Another representative work was Global Filter Network
global temperature and wind speed, UniRepLKNet, which (GFNet) [50], which optimized the spatial connection
is a generalist model originally designed for image recogni- weights in the Fourier domain. It is equivalent to circular
tion, even outperforms the latest state-of-the-art transformer global convolutions in the spatial domain.
model customized for the task. Such results not only signify Modern ConvNets with very large kernels. RepLKNet

2
first proposed that simply scaling up the kernel size of ex- the latter W′ ∈ R((k−1)r+1)×((k−1)r+1) can be elegantly
isting ConvNets resulted in improvements, especially on realized by a transpose convolution with a stride of r and an
downstream tasks [11]. It proposed several guidelines while identity kernel I ∈ R1×1 , which is scalar 1 but viewed as a
using large kernels, which were focused on the microstruc- kernel tensor. 1 With pytorch-style pseudo code, that is
tural design (e.g., using shortcut alongside large kernel) and
application (large-kernel ConvNets should be evaluated on W′ = conv transpose2d(W, I, stride = r) . (1)
downstream tasks). In terms of the architecture, RepLKNet
The equivalency can be easily verified - given an arbitrary
merely followed Swin Transformer for simplicity. In the
W ∈ Rk×k and an arbitrary input channel, a convolution
past two years, large-kernel ConvNets have been intensively
with W and a dilation rate r always yields identical results
studied. Some works succeeded in further enlarging the
to a non-dilated convolution with W′ . 2
kernel sizes [36], generalizing the idea to 3D scenarios [6]
Based on such equivalent transformations, we propose
and many downstream tasks, e.g., image dehazing [42] and
a Dilated Reparam Block, which uses a non-dilated small-
super-resolution [75]. However, we note that the architec-
kernel and multiple dilated small-kernel layers to enhance
tural design for ConvNets with very large kernels remains
a non-dilated large-kernel conv layer. Its hyper-parameters
under-explored. For example, SLaK [36] followed the ar-
include the size of large kernel K, sizes of parallel conv
chitecture developed by ConvNeXt, which is a powerful ar-
layers k, and the dilation rates r. The shown case (Fig. 2)
chitecture of medium-sized (7×7) kernels.
with four parallel layers is denoted by K=9, r=(1,2,3,4),
k=(5,3,3,3). For a larger K, we may use more dilated lay-
3. Architectural Design of UniRepLKNet ers with larger kernel sizes or dilation rates. The kernel
3.1. Dilated Reparam Block sizes and dilation rates of the parallel branches are flexible
and the only constraint is (k − 1)r + 1 ≤ K. For example,
with K=13 (the default setting in our experiments), we use
It is reported that a large-kernel conv should be used with five layers with k=(5,7,3,3,3), r=(1,2,3,4,5), so the equiva-
a parallel small-kernel one because the latter helps capture lent kernel sizes will be (5,13,7,9,11), respectively. To con-
the small-scale patterns during training [11]. Their out- vert a Dialted Reparam Block into a large-kernel conv layer
puts are added up after two respective Batch Normaliza- for inference, we first merge every BN into the preceding
tion (BN) [28] layers. After training, with the Structural conv layer, convert every layer with dilation r > 1 with
Re-parameterization [10] methodology, we merge the BN function 1, and add up all the resultant kernels with appro-
layers into the conv layers so that the small-kernel conv priate zero-paddings. For example, the layer in Fig. 2 with
can be equivalently merged into the large-kernel one for in- k=3,r=3 is converted into a sparse 7×7 kernel and added to
ference. In this work, we note that except for small-scale the 9×9 kernel with one-pixel zero paddings on each side.
patterns, enhancing the large kernel’s capability to capture
sparse patterns (i.e., a pixel on a feature map may be more 3.2. Architectural Guidelines for Large Kernels
related to some distant pixels than its neighbors) may yield Vanilla architecture. We first construct the vanilla archi-
features of higher quality. The need to capture such pat- tecture to experiment on. As a common practice, the main
terns exactly matches the mechanism of dilated convolution body of the model is split into four stages connected by
- from a sliding-window perspective, a dilated conv layer downsampling blocks. Specifically, the first downsampling
with a dilation rate of r scans the input channel to capture block uses two stride-2 3×3 conv layers to transform the
spatial patterns where each pixel of interest is r − 1 pix- raw input into C-channel feature maps, where C is an ar-
els away from its neighbor. Therefore, we use dilated conv chitectural hyper-parameter and the other three downsam-
layers parallel to the large kernel and add up their outputs. pling blocks each use one stride-2 3×3 conv layer perform-
To eliminate the inference costs of the extra dilated conv ing 2× channel expansion so that the numbers of channels
layers, we propose to equivalently transform the whole of the four stages are C, 2C, 4C, and 8C, respectively.
block into a single non-dilated conv layer for inference. A stage comprises blocks whose vanilla design resembles
Since ignoring pixels of the input is equivalent to insert- ConvNeXt, i.e., a depthwise (DW) conv layer and a Feed-
ing extra zero entries into the conv kernel, a dilated conv Forward Network (FFN) with GRN unit [70], but we use
layer with a small kernel can be equivalently converted into BN instead of LayerNorm [1] after the conv layer as BN can
a non-dilated (i.e., r = 1) layer with a sparse larger ker- be equivalently merged into the conv layer to eliminate its
nel. Let k be the kernel size of the dilated layer, by insert- 1 We showcase a single-channel conv and it is easy to generalize the
ing zero entries, the kernel size of the corresponding non- transformation to multi-channel cases. See the Appendix for details.
dilated layer will be (k − 1)r + 1, which is referred to as 2 In common cases where the shape of output equals that of input, i.e.,

the equivalent kernel size for brevity. We further note that the padding of the former is k−1
2
, note the padding of the latter should be
such transformation from the former kernel W ∈ Rk×k to (k−1)r
2
since the size of the equivalent sparse kernel is (k − 1)r + 1.

3
input Structure Perspective input

9×9 5×5 5×5 3×3 3×3

dilation 1 dilation 1 dilation 2 dilation3 dilation 4 9×9
dilation 1
BN BN BN BN BN
Re-parameterize
+
Parameter Perspective

Figure 2. Dilated Reparam Block uses dilated small-kernel conv layers to enhance a non-dilated large-kernel layer. Such dilated layers are
equivalent to a non-dilated conv layer with a larger sparse kernel, as shown from the parameter perspective so that the whole block can be
equivalently transformed into a single large-kernel conv. This example shows K=9 and we may use more dilated layers for larger K.

inference costs. We use another BN after the FFN, which DW avg-pool

BN
can also be equivalently merged into the preceding layer DW 3×3
DW 3×3 ReLU
(i.e., the second linear layer in FFN). The numbers of such DW 3×3

blocks in the four stages are denoted by N (N1 ,N2 ,N3 ,N4 ), FFN
add add add
sigmoid
multiply
BN
respectively. Following ConvNeXt-T, the vanilla architec- (A) (B) (C) (D)
ture uses C=96 and N =(3,3,9,3). By default, the last three Figure 3. Options of the extra structures to increase the depth.
stages use 13×13 Dilated Reparam Block as the DW layer, Table 1. Models with different efficient extra structures to increase
which means K=13, k=(5,7,3,3,3) and r=(1,2,3,4,5); the the depth. We report the ImageNet accuracy (Acc), ADE20K
first stage uses DW 3×3 conv as the DW layer. mIoU, and actual throughput (Img/s).
Experimental settings and metrics. It has been em-
Extra structure Params FLOPs Img/s Acc mIoU
phasized in the literature [11] that large-kernel ConvNets
None 31.3M 4.92G 1954 81.2 45.1
should be evaluated on downstream tasks, as their full po- (A) Bottleneck 32.9M 5.18G 1716 81.5 46.3
tential may not be accurately reflected by the ImageNet ac- (B) Two 1×1 32.9M 5.17G 1745 81.3 46.2
curacy alone. Therefore, except for the ImageNet-1K accu- (C) Two DW 3×3 31.4M 4.96G 1659 81.3 45.4
racy after 100-epoch training, we transfer the trained model (D) SE Block 32.9M 4.92G 1863 81.6 46.5
with UPerNet [74] to ADE20K to examine its performance
on semantic segmentation and report the single-scale mIoU
tion and observe a better performance and higher through-
after a 160k-iteration standard finetuning process [7]. Be-
put. We therefore use SE Block as a substructure of our
sides the parameters and FLOPs, we test the actual through-
block design in the following explorations.
put on an A100 GPU with a batch size of 128 and input reso-
Architectural Guideline 2 on Re-parameterization: use
lution of 224×224, which is measured in images per second
dilated small kernels to re-parameterize a large kernel.
(img/s). See the Appendix for detailed configurations.
For a fair comparison with Dilated Reparam Block, we try
Architectural Guideline 1 on Block Design: use efficient
two variants with the same numbers of parallel branches
structures that perform both inter-channel communica-
composed of non-dilated layers with A) the same kernel
tions and spatial aggregations to increase the depth. We
sizes or B) the same equivalent kernel sizes. For our
first seek to insert some structures to universally boost the
default setting of K=13, r=(1,2,3,4,5), k=(5,7,3,3,3), the
model’s representational capacity, which is required to com-
kernel sizes of the five branches will be k=(5,7,3,3,3) or
prise nonlinearity and efficient trainable transformations.
(5,13,7,9,11) for the two variants, respectively. All the mod-
We naturally try a bottleneck composed of a 1×1 conv that
els end up with the same inference structure but the train-
reduces the channels to 1/4, a DW 3×3 conv, and another
ing structures differ. Table 2 shows lower performance of
1×1 conv to expand the channels back (Fig. 3). We use BN
variants, suggesting that large kernel benefits from the par-
and ReLU after conv layers as a common practice. Table 1
allel dilated conv layers’ abilities to capture sparse patterns,
shows that the performance improves with acceptable costs
rather than merely the extra small kernels (variant A) or the
(+1.2 mIoU with 12% slow down). The performance de-
combination of different receptive fields (variant B). We use
grades as we remove the DW 3×3 conv so that only two
Dilated Reparam Block in the following explorations. 3
1×1 conv layers remain, or replace the bottleneck structure
Architectural Guideline 3 on Kernel Size: decide kernel
with two DW 3×3 layers, suggesting that such structures re-
size according to the downstream task and usually use
quire both spatial aggregation transformations and channel
large kernels in middle- and high-level layers. As intro-
mixing. Motivated by this, considering that SE Block [24]
duced above, the baseline model uses 3×3 conv in the first
elegantly realizes both transformations in a more efficient
way (i.e., global average pooling and nonlinear mapping of 3 While describing the architecture in this paper, using a K×K (K≥9)

the pooled vectors), we try it also with 1/4 channel reduc- conv means a K×K Dilated Reparam Block, unless otherwise noted.

4
Table 2. Models with different forms of Structural Re- Table 4. Different numbers of LarK and SmaK Blocks in Stage 3.
parameterization on the 13×13 conv layers.
N3 LarK SmaK Params FLOPs Img/s Acc mIoU
Re-param k r Acc mIoU 9 9 0 32.9M 4.92G 1863 81.6 46.5
27 27 0 56.7M 9.31G 1145 82.3 49.0
None N/A N/A 81.4 45.8 27 14 13, 3×3 55.9M 9.15G 1229 82.3 48.8
Dilated Reparam 5,7,3,3,3 1,2,3,4,5 81.6 46.5 27 9 18, 3×3 55.6M 9.10G 1264 82.3 48.8
Same kernel size 5,7,3,3,3 1,1,1,1,1 81.6 46.2 27 9 18, w/o 3×3 55.5M 9.08G 1289 82.2 47.8
Same eq kernel size 5,13,7,9,11 1,1,1,1,1 81.6 46.2

Table 3. Models with different kernel sizes in the four stages de- follows the traditional ConvNets, i.e., stacking more large
noted by S1 - S4. Numbers in parentheses are obtained with the kernels to build up a deeper model, but we argue that a
UPerNet only taking the outputs of S4. large-kernel ConvNet may not benefit from more large ker-
nels. In this group of experiments (Table 4), we scale up
S1 S2 S3 S4 Params FLOPs Img/s Acc mIoU
3 13 13 13 32.9M 4.92G 1863 81.6 46.5 (42.4) N3 from 9 to 27, following ConvNeXt-S [40]. Considering
3 11 11 11 32.6M 4.86G 1876 81.6 45.5 (41.9) that nine 13×13 blocks may have already built up sufficient
3 3 13 13 32.8M 4.85G 2006 81.7 46.1 receptive field, we examine if the added blocks should also
3 13 3 13 32.4M 4.81G 2015 81.6 45.9
use large kernels. Specifically, we refer to the block with
3 13 13 3 32.5M 4.90G 1884 81.4 45.8
3 15 15 15 33.3M 4.99G 1851 81.7 45.9 (42.7) a Dilated Reparam Block as the Large Kernel Block (LarK
13 13 13 13 33.0M 5.06G 1547 81.6 44.9 (42.4) Block) and name a block that uses a DW 3×3 conv as a
Small Kernel Block (SmaK Block) so that the shallow model
stage and 13×13 in the last three stages. Table 3 shows that has 3 SmaK Blocks in Stage 1 and 3, 9, 3 LarK Blocks in
replacing the large kernels in the last three stages with 3×3 the last three stages. While scaling up the depth of Stage
or changing K from 13 to 11 degrades the models, espe- 3, we tried the following options. A) All of the 27 blocks
cially in the ADE20K mIoU, which highlights the signifi- are LarK Blocks. B) We interleave SK with LarK Blocks so
cance of large kernels. Interestingly, using 13×13 in Stage that Stage 3 has 14 LarK Blocks and 13 SmaK Blocks. C)
1 or enlarging K from 13 to 15 makes almost no difference We place two SmaK Blocks after a LarK Block so that the
in the ImageNet accuracy but reduces the ADE20K mIoU. resultant model will have the same 9 LarK Blocks as before
Remark. We argue that this phenomenon does not mean but 18 extra SmaK Blocks. D) We remove the DW 3×3
larger kernels result in lower feature quality. It is due to the layers in SmaK Blocks. Table 4 shows that scaling up the
structural priors of UPerNet, which takes the features ex- depth brings significant improvements, which is expected,
tracted by the low-level layers of the backbone and assumes and 9 LarK Blocks are sufficient. Though 27 LarK Blocks
they should only encode local information so that combin- perform slightly better in the ADE20K mIoU, the inference
ing them with the high-level features extracted from the last speed is observably slowed down. Besides, the model with-
layers of the backbone results in better segmentation. With out 3×3 conv in SmaK Blocks shows significantly lower
larger kernels in lower stages, the low-level features are no mIoU with only minor improvements in the throughput,
longer confined to small local areas so the UPerNet benefits suggesting such small kernels in SmaK Blocks are useful
less from combining them with the high-level features. We while scaling up the depth of large-kernel ConvNet as they
verify this explanation by making the UPerNet only use the increase the abstract hierarchy of spatial patterns, though
high-level features (i.e., outputs of Stage 4) to evaluate the they may not effectively enlarge the ERF [11, 43]. This ob-
quality of the eventual features alone. Under this setting, servation supports our motivation to decouple the effects of
K=15 delivers the best mIoU (42.7), the model with large conv layers in enlarging the ERF and extracting more com-
kernels in Stage 1 performs as well as the baseline (42.4), plicated spatial patterns, as discussed in Sec. 1.
and K=11 performs the worst (41.9). Such observations
3.3. Architectural Specifications
confirm that large kernels, even when they are used inap-
propriately, do not damage the feature quality of ConvNet Following our proposed guidelines, we instantiate a series
but merely make the low-level features less favorable for of models (Table 5). For a fair comparison with ConvNeXt
certain downstream models that require local low-level fea- V2 [70], UniRepLKNet-A/F/P/N follows its configurations.
tures, suggesting we should decide the kernel size according We scale up the depth to build UniRepLKNet-T/S and scale
to the specific downstream tasks and framework. In our spe- up the width to construct UniRepLKNet-S/B/L/XL.
cific use cases (i.e., representative image recognition tasks
with common downstream frameworks), we employ 13×13 3.4. Generalizing UniRepLKNet beyond Image
kernels in the middle- and high-level stages by default. To utilize the universal perception ability of UniRe-
Architectural Guideline 4 on the Scaling Rule: while pLKNet, we preprocess the data of different modalities into
scaling up the depth, the added blocks should use small B × C ′ × H × W embedding maps, where B is the batch
kernels. The scaling rule of existing large-kernel ConvNets size and C ′ is determined by the modality, and configure the

5
Table 5. Architectural hyper-parameters of UniRepLKNet in- frames. For example, in our experiments, we have NF =16
stances, including the number of blocks in the four stages and h=w=224 so that H=W =896. Generally,
N1 , N2 , N3 , N4 and channels C of the first stage. Stage 1 uses
SmaK Blocks, and Stages 2 and 4 use LarK Blocks only. For Stage
3, e.g., “9 + 18” means 9 LarK Blocks and 18 SmaK Blocks. HW
xV ∈ RB×NF ×3×h×w → RB×3×H×W s.t. = NF .
N1 N2 N3 N4 C Params hw
UniRepLKNet-A 2 2 6+0 2 40 4.4M
(5)
UniRepLKNet-F 2 2 6+0 2 48 6.2M
UniRepLKNet-P 2 2 6+0 2 64 10.7M 4. UniRepLKNet for Image Recognition
UniRepLKNet-N 2 2 8+0 2 80 18.3M
UniRepLKNet-T 3 3 9+9 3 80 31.0M
UniRepLKNet-S 3 3 9 + 18 3 96 55.6M
ImageNet classification. Following ConvNeXt [40],
UniRepLKNet-B 3 3 9 + 18 3 128 97.9M we use the widely adopted 300-epoch receipt to train
UniRepLKNet-L 3 3 9 + 18 3 192 218.3M UniRepLKNet-A/F/P/N/T/S on ImageNet-1K; we pretrain
UniRepLKNet-XL 3 3 9 + 18 3 256 386.4M UniRepLKNet-S/B/L/XL on ImageNet-22K using the 90-
epoch receipt and fine-tune with ImageNet-1K for 30
input channel of the first layer of UniRepLKNet to C ′ . For epochs (see the Appendix for details). As our goal is to
simplicity, the other parts of the models are the same as the develop models that run with high actual speed, we eval-
UniRepLKNet initially designed for the image without any uate the actual throughput on the same A100 GPU using
modality-specific customization. By doing so, we directly a batch size of 128. Table 6 shows the top-1 accuracy on
apply a ConvNet typically used for image tasks to deal with the ImageNet-1K validation set where the results are sorted
data of other modalities. In other words, the UniRepLKNet by the throughput. We split the results into seven segments
for image tasks can be seen as a general UniRepLKNet for better readability. 1) UniRepLKNet-A/F outperforms
with C ′ =3 and no such preprocessing. We introduce how ConvNeXt-V2-A/F by 0.8/0.6 in the accuracy and runs
to transform the data into such embedding maps as follows. 19%/17% faster, respectively. 2) UniRepLKNet-P/N out-
Time-series. Let L and D be the length and dimensions of performs FastViT-T12/S12 and ConvNeXt V2-P/N by clear
a time-series sequence xT ∈ RB×L×D , we adopt the em- margins. 3) UniRepLKNet-T outperforms multiple small-
bedding layer in Corrformer [71] to split it into n nodes then level competitors. 4) UniRepLKNet-S outperforms a se-
′
project it into a latent space RBn×L×D (D′ and n are con- ries of small-level and even base-level models in both speed
figurable hyper-parameters of the embedding layer). Then and accuracy and runs almost as fast as InternImage-T. 5)
we simply reshape it into a single-channel embedding map. With ImageNet-22K pretraining, UniRepLKNet-S even ap-
D ′
proaches the accuracy of RepLKNet-31L and runs 3× as
xT ∈ RB×L×D → RBn×L× n → RBn×L×D fast as the latter. UniRepLKNet-B outperforms CoAtNet-
(2)
→ RBn×1×H×W s.t.HW = LD′ . 2 and DeiT III-B by clear margins. UniRepLKNet-L out-
performs InternImage-L in both accuracy and throughput.
Audio. Let T and F be the numbers of time frames and 6) On the XL-level, UniRepLKNet-XL outperforms in both
frequency bins, we use xA ∈ RB×T ×F to represent audio accuracy and throughput, running more than 2× as fast as
data. A sample is seen as a 1 × T × F embedding map that CoAtNet-3 and 3× as DeiT III-L.
resembles a single-channel image so C ′ =1, H=T , W =F . COCO object detection and instance segmentation. We
transfer the pretrained UniRepLKNets as the backbones
xA ∈ RB×T ×F → RB×1×T ×F . (3) of Cascade Mask R-CNN [3, 20] and adopt the stan-
dard 3x (36-epoch) training configuration with MMDetec-
Point cloud. Assume a sample comprises P points each tion [5]. Table 7 shows UniRepLKNet outperforms Swin,
represented by the X/Y/Z coordinates, we use a series of ConvNeXt, RepLKNet, and SLaK, which are representa-
conv layers to generate three-view projections [78]. We tives of ViTs, modern medium-kernel ConvNets, and ex-
configure the resolution of the generated projections to be isting large-kernel ConvNets, respectively, and shows com-
224 so that H=W =224, C ′ =3. parable performance to InternImage [67], which is a latest
powerful architecture with deformable convolution.
xP ∈ RB×P ×3 → RB×3×224×224 . (4)
ADE20K semantic segmentation. We use the pretrained
Video. We represent a video as NF frames and each frame UniRepLKNets as the backbones of UPerNet [74] on
is a 3 × h × w image. We reshape it by merging the frame ADE20K [80] and adopt the standard 160k-iteration train-
dimension into the height and width dimensions so that ing receipt with MMSegmentation [7]. Table 8 reports the
we obtain a representation that can be viewed as a single mIoU on the validation set. Impressively, UniRepLKNet
image created by laying out (i.e., concatenating) the NF outperforms InternImage and the other models.

6
Table 6. ImageNet classification. Throughput is tested with an Table 7. Object detection on COCO validation set. FLOPs are
A100 GPU and batch size of 128. “T/C” denote transformer/Con- measured with 1280×800 inputs. “‡ ” ImageNet-22K pretraining.
vNet. “‡ ” indicates ImageNet-22K [9] pretraining. Method Params (M) FLOPs (G) APbox APmask
Input Params FLOPs Throughput Acc UniRepLKNet-T 89 749 51.7 44.9
Method Type Swin-T [37] 86 745 50.4 43.7
size (M) (G) (img/s) (%)
UniRepLKNet-A C 2242 4.4 0.6 5942 77.0 ConvNeXt-T [40] 86 741 50.4 43.7
UniRepLKNet-F C 2242 6.2 0.9 5173 78.6 SLaK-T [36] - - 51.3 44.3
ConvNeXt V2-A [70] C 2242 3.7 0.5 5054 76.2 UniRepLKNet-S 113 835 53.0 45.9
Swin-S [37] 107 838 51.9 45.0
FastViT-T8 [63] T 2562 3.6 0.7 5025 75.6
ConvNeXt-S [40] 108 827 51.9 45.0
ConvNeXt V2-F [70] C 2242 5.2 0.8 4329 78.0
UniRepLKNet-S‡ 113 835 54.3 47.1
UniRepLKNet-P C 2242 10.7 1.6 3949 80.2
UniRepLKNet-B‡ 155 978 54.8 47.4
FastViT-T12 [63] T 2562 6.8 1.4 3407 79.1
Swin-B‡ [37] 145 982 53.0 45.8
ConvNeXt V2-P [70] C 2242 9.1 1.4 3339 79.7
FastViT-S12 [63] T 2562 8.8 1.8 3162 79.8 ConvNeXt-B‡ [40] 146 964 54.0 46.9
UniRepLKNet-N C 2242 18.3 2.8 2807 81.6 RepLKNet-31B‡ [11] 137 965 52.2 45.2
ConvNeXt V2-N [70] C 2242 15.6 2.4 2405 81.2 UniRepLKNet-L‡ 276 1385 55.8 48.4
UniRepLKNet-T C 2242 31 4.9 1804 83.2 Swin-L‡ [37] 253 1382 53.9 46.7
FastViT-SA24 [63] T 2562 21 3.8 1670 82.6 ConvNeXt-L‡ [40] 255 1354 54.8 47.6
PVTv2-B2 [66] T 2242 25 4.0 1620 82.0 RepLKNet-31L‡ [11] 229 1321 53.9 46.5
CoAtNet-0 [8] T 2242 25 4.2 1613 81.6 InternImage-L‡ [67] 277 1399 56.1 48.5
DeiT III-S [62] T 224 2 22 4.6 1485 81.4 UniRepLKNet-XL‡ 443 1952 56.4 49.0
SwinV2-T/8 [39] T 2562 28 6 1406 81.8 InternImage-XL‡ [67] 387 1782 56.2 48.8
SLaK-T [36] C 2242 30 5.0 1312 82.5 ConvNeXt-XL‡ [40] 407 1898 55.2 47.7
InternImage-T [67] C 2242 30 5 1292 83.5
UniRepLKNet-S C 2242 56 9.1 1265 83.9 Table 8. Semantic segmentation on ADE20K validation set.
ConvNeXt-S [40] C 2242 50 8.7 1182 83.1 The FLOPs are measured with 512×2048 or 640×2560 inputs ac-
HorNet-T [51] C 2242 23 3.9 1162 83.0 cording to the crop size. “SS” and “MS” mean single- and multi-
FastViT-SA36 [63] T 2562 30 5.6 1151 83.6 scale testing, respectively. “‡ ” indicates ImageNet-22K [9] pre-
CoAtNet-1 [8] T 2242 42 8.4 969 83.3
training.
SLaK-S [36] C 2242 55 9.8 967 83.8
FastViT-MA36 [63] T 2562 43 7.9 914 83.9 Method
Crop Params FLOPs mIoU mIoU
SwinV2-S/8 [39] T 2562 50 12 871 83.7 size (M) (G) (SS) (MS)
RepLKNet-31B [11] C 2242 79 15.3 859 83.5 UniRepLKNet-T 5122 61 946 48.6 49.1
PVTv2-B5 [66] T 224 2 82 11.8 802 83.8 Swin-T [37] 5122 60 945 44.5 45.8
UniRepLKNet-S‡ C 3842 56 26.7 435 86.4 ConvNeXt-T [40] 5122 60 939 46.0 46.7
ConvNeXt-S‡ [40] C 3842 50 25.5 415 85.8 SLaK-T [36] 5122 65 936 47.6 -
UniRepLKNet-B‡ C 3842 98 47.2 314 87.4 InternImage-T [67] 5122 59 944 47.9 48.1
ConvNeXt-B‡ [40] C 3842 89 45.1 304 86.8 UniRepLKNet-S 5122 86 1036 50.5 51.0
UniRepLKNet-L‡ C 3842 218 105.4 190 87.9 Swin-S [37] 5122 81 1038 47.6 49.5
ConvNeXt-L‡ [40] C 3842 198 101 185 87.5 ConvNeXt-S [40] 5122 82 1027 48.7 49.6
CoAtNet-2‡ [8] T 3842 75 49.8 163 87.1 SLaK-S [36] 5122 91 1028 49.4 -
RepLKNet-31L‡ [11] C 3842 172 96.0 158 86.6 InternImage-S [67] 5122 80 1017 50.1 50.9
InternImage-L‡ [67] C 3842 223 108 143 87.7 UniRepLKNet-S‡ 6402 86 1618 51.9 52.7
DeiT III-B‡ [62] T 3842 87 55.5 138 86.7 UniRepLKNet-B‡ 6402 130 1850 53.5 53.9
UniRepLKNet-XL‡ C 3842 386 187 131 88.0 Swin-B‡ [37] 6402 121 1841 50.0 51.7
ConvNeXt-XL‡ [40] C 3842 350 179 129 87.8 ConvNeXt-B‡ [40] 6402 122 1828 52.6 53.1
HorNet-L‡ [51] C 3842 202 102 127 87.7 RepLKNet-31B‡ [11] 6402 112 1829 51.5 52.3
InternImage-XL‡ [67] C 3842 335 163 114 88.0 UniRepLKNet-L‡ 6402 254 2507 54.5 55.0
CoAtNet-3‡ [8] T 3842 168 107 103 87.6 Swin-L‡ [37] 6402 234 2468 52.1 53.5
SwinV2-L/24‡ [39] T 3842 197 115 88 87.6 RepLKNet-31L‡ [11] 6402 207 2404 52.4 52.7
CoAtNet-4‡ [8] T 3842 275 190 58 87.9 ConvNeXt-L‡ [40] 6402 235 2458 53.2 53.7
DeiT III-L‡ [62] T 3842 305 191 42 87.7 InternImage-L‡ [67] 6402 256 2526 53.9 54.1
UniRepLKNet-XL 6402 425 3420 55.2 55.6
5. Universal Perception on other Modalities ConvNeXt-XL‡ [40] 6402 391 3335 53.6 54.0
InternImage-XL‡ [67] 6402 368 3142 55.0 55.3
Time-series. Following Corrformer [71], we conduct ex-
periments on the Global Temperature and Wind Speed Fore- ous state-of-the-art method, we use its embedding layer (as
casting challenge 4 using the dataset collected from the Na- introduced in Sec. 3.4) and decoder and only replace its en-
tional Centers for Environmental Information (NCEI). This coder transformer with UniRepLKNet-S. We also compare
huge-scale dataset contains hourly averaged wind speed and UniRepLKNet-S against a wide range of methods including
temperature data from 3,850 stations with different geo- statistical and numerical approaches. We make the follow-
graphical scales and densities, spanning from 2019 to 2021. ing observations from Table 9. 1) In terms of the forecast-
For a fair comparison with Corrformer, which was the previ- ing precision, UniRepLKNet delivers a new state-of-the-art
performance, achieving the lowest errors of 7.602, 1.832,
4 https://codeocean.com/capsule/0341365/tree/v1 3.865, and 1.301 for MSE and MAE in forecasting global

7
Table 9. Time-series forecasting performance on Global Temper- Table 11. Video recognition accuracy on Kinetics-400.
ature and Wind Speed Forecasting challenge. UniRepLKNet de- Method Pretrain Type Acc (%) Params
livers a new state-of-the-art performance in Mean Squared Error Specialist
SlowFast-101 [13] IN-1K ConvNet+RNN 79.8 62.8M
(MSE) and Mean Absolute Error (MAE). GFS (https://www. MViTv2-B [33] IN-1K Transformer 81.2 51.2M
ncei.noaa.gov/) stands for the Global Forecasting System. TimeSFormer [2] K400 Transformer 80.7 122M
Generalist
Temperature Wind speed
Method Type Params Meta-Transformer [78] LAINON-2B Transformer 47.3 86.9M
MSE ↓ MAE ↓ MSE ↓ MAE ↓
Statistics-based ImageBind [14] CLIP Data Transformer 50.0 632M
Holt–Winters [27] - - 13.241 2.262 5.912 1.664 UniRepLKNet-S - ConvNet 54.8 55.5M
Prophet [59] - - 11.626 2.946 9.691 2.382
GDBT [NeurIPS’17] [30] - - 9.706 2.214 4.101 1.417
Numerical Simulation
GFS (reanalysis) - - 14.933 2.287 9.993 2.340 Table 12. Point cloud analysis on ModelNet-40.
ERA5 (reanalysis) [21] - - 13.448 1.908 4.999 1.587
DeepAR [52] - - 32.249 4.262 5.248 1.602 Method Type mAcc (%) OA (%)
N-BEATS [44] - - 9.203 2.117 4.124 1.390 PointNet [46] MLP 86.0 89.2
Deep Learning Specialist
PointNet++ [47] MLP - 91.9
StemGNN [NeurIPS’20] [4] GNN 180M 13.926 2.746 4.066 1.389
Pyraformer [ICLR’21] [35] Transformer 158M 23.326 3.669 4.614 1.514 PointConv [72] ConvNet - 92.5
Corrformer [Nat. Mach. Intell.’23] [71] Transformer 155M 7.709 1.888 3.889 1.304 KPConv [60] ConvNet - 92.9
Generalist DGCNN [68] ConvNet 90.2 92.9
UniRepLKNet-S ConvNet 132M 7.602 1.832 3.865 1.301
UniRepLKNet-S ConvNet 90.3 93.2

Table 10. Audio recognition on Speech Commands V2 dataset.

Method Pretrain Type Acc (%) Params Table 13. Universal perception performance with other ConvNets
PANNS [31] - ConvNet 61.8 - or UniRepLKNet with a smaller kernel size.
PSLA [16] IN-1K ConvNet 96.3 -
AST [15] AS-2M Transformer 96.2 86.9M Time-Series Point Cloud Audio Video
Modality
SSAST [17] AS-2M Transformer 97.8 89.3M MAE↓ OA (%) Acc (%) Acc (%)
Audio-MAE [26] AS-2M Transformer 98.3 86.2M ResNet-101 [19] (K=3) 7.846 92.6 73.6 41.3
Meta-Transformer [78] LAION-2B Transformer 97.0 86.6M ConvNeXt-S [40] (K=7) 7.641 92.7 94.3 48.5
UniRepLKNet-S - ConvNet 98.5 55.5M UniRepLKNet-S (K=11) 7.751 92.9 94.7 51.7
UniRepLKNet-S (K=13) 7.602 93.2 98.5 54.8
temperature and wind speed, respectively. 2) Regarding the
efficiency, UniRepLKNet uses fewer parameters than exist- riority in both the accuracy and no requirement of pretrain-
ing deep learning methods. It is particularly noteworthy ing. This observation highlights the challenge of achiev-
that UniRepLKNet, which is a generalist model, outper- ing high transferability in the video domain with current
forms time-series specialists such as Pyraformer [35] and general-purpose architectures, which may require a better
Corrformer [71] in both precision and efficiency. The sig- preprocessing scheme than our extremely simple approach
nificant advantages of UniRepLKNet open up new avenues (laying out frames into a single image).
for architectural discussions in time-series forecasting, pre-
senting a viable alternative to transformer models. Point cloud. We explore the versatility of UniRepLKNet by
Audio. We use Speech Commands V2 [69], which contains assessing its proficiency in learning 3D patterns, extending
105,829 one-second recordings of 35 common speech com- beyond the conventional 2D signals of images and audio.
mands. Table 10 shows UniRepLKNet seamlessly adapts to We use the ModelNet-40 [73] 3D shape classification task
the audio modality and delivers an impressive accuracy of with 9,843/2,468 training/validation samples of CAD mod-
98.5%, even without pretraining. Compared to transformer- els from 40 classes. Table 12 shows UniRepLKNet achieves
based methods such as AST [15] and Audio-MAE [26], an Overall Accuracy (OA) of 93.2% and a mean Accuracy
UniRepLKNet stands out with fewer parameters. Com- (mAcc) of 90.3%, surpassing existing ConvNet-based mod-
pared to previous ConvNets designed for audio, UniRe- els specialized for point cloud. Such outcomes highlight the
pLKNet achieves better performance without requiring any potential of further developing ConvNets in this domain.
customizations to the structure, highlighting the untapped Impact of kernel size on the performance. To investi-
potential of ConvNets in the realm of audio recognition. gate the influence of different kernel sizes on performance,
Video. We experiment on Kinetics-400 [29], which con- we replaced the backbone of the aforementioned models
tains 240k training videos and 20k validation videos, span- with other ConvNets or UniRepLKNet with smaller ker-
ning 400 classes for action recognition. Though its top-1 nels. We adopted the same modality-specific preprocess-
accuracy of 54.8% is somewhat behind state-of-the-art ar- ing approaches and training configurations for a fair com-
chitectures like MViT [33] and TimeSFormer [2], we would parison. We take ResNet-101 as a representative small-
like to note that the leading models are all specialized for kernel ConvNet because it has comparable parameters to
video and pretrained with extra data, while UniRepLKNet UniRepLKNet-S. Table 13 shows that UniRepLKNet con-
is a generalist model trained from scratch. Compared to sistently outperforms other models by clear margins across
the latest generalist methods such as ImageBind [14] and all tasks, suggesting that the large-kernel design is crucial
Meta-Transformer [78], UniRepLKNet demonstrates supe- for universal perception, at least in our specific cases.

8
6. Conclusion convnets great again. In Proceedings of the IEEE/CVF Con-
ference on Computer Vision and Pattern Recognition, pages
We proposed four architectural guidelines for large-kernel 13733–13742, 2021. 3
ConvNet and an architecture named UniRepLKNet, which [11] Xiaohan Ding, Xiangyu Zhang, Jungong Han, and Guiguang
not only shows a leading performance in image recogni- Ding. Scaling up your kernels to 31x31: Revisiting large
tion but also achieves remarkable results even on modali- kernel design in cnns. In Proceedings of the IEEE/CVF con-
ties such as audio and time-series data, outperforming mul- ference on computer vision and pattern recognition, pages
tiple specialist models on those modalities. Such results not 11963–11975, 2022. 1, 2, 3, 4, 5, 7
only signify a “comeback” for ConvNet in its original do- [12] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov,
main but also showcase large-kernel ConvNet’s potential to Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner,
“conquer” new territories, highlighting their adaptability Mostafa Dehghani, Matthias Minderer, Georg Heigold, Syl-
and broad utility across different modalities and tasks. vain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image
is worth 16x16 words: Transformers for image recognition
References at scale. In 9th International Conference on Learning Rep-
resentations, ICLR 2021, Virtual Event, Austria, May 3-7,
[1] Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hin- 2021. OpenReview.net, 2021. 1
ton. Layer normalization. arXiv preprint arXiv:1607.06450, [13] Christoph Feichtenhofer, Haoqi Fan, Jitendra Malik, and
2016. 3 Kaiming He. Slowfast networks for video recognition. In
[2] Gedas Bertasius, Heng Wang, and Lorenzo Torresani. Is Proceedings of the IEEE/CVF international conference on
space-time attention all you need for video understanding? computer vision, pages 6202–6211, 2019. 8
In ICML, page 4, 2021. 8 [14] Rohit Girdhar, Alaaeldin El-Nouby, Zhuang Liu, Mannat
[3] Zhaowei Cai and Nuno Vasconcelos. Cascade r-cnn: High Singh, Kalyan Vasudev Alwala, Armand Joulin, and Ishan
quality object detection and instance segmentation. IEEE Misra. Imagebind: One embedding space to bind them all.
Transactions on Pattern Analysis and Machine Intelligence, In Proceedings of the IEEE/CVF Conference on Computer
2019. 6 Vision and Pattern Recognition, pages 15180–15190, 2023.
[4] Defu Cao, Yujing Wang, Juanyong Duan, Ce Zhang, Xia 8
Zhu, Congrui Huang, Yunhai Tong, Bixiong Xu, Jing Bai, [15] Yuan Gong, Yu-An Chung, and James Glass. Ast: Audio
Jie Tong, et al. Spectral temporal graph neural network for spectrogram transformer. arXiv preprint arXiv:2104.01778,
multivariate time-series forecasting. Advances in neural in- 2021. 8
formation processing systems, 33:17766–17778, 2020. 8
[16] Yuan Gong, Yu-An Chung, and James Glass. Psla: Improv-
[5] Kai Chen, Jiaqi Wang, Jiangmiao Pang, Yuhang Cao, Yu ing audio tagging with pretraining, sampling, labeling, and
Xiong, Xiaoxiao Li, Shuyang Sun, Wansen Feng, Ziwei Liu, aggregation. IEEE/ACM Transactions on Audio, Speech, and
Jiarui Xu, Zheng Zhang, Dazhi Cheng, Chenchen Zhu, Tian- Language Processing, 29:3292–3306, 2021. 8
heng Cheng, Qijie Zhao, Buyu Li, Xin Lu, Rui Zhu, Yue Wu,
[17] Yuan Gong, Cheng-I Lai, Yu-An Chung, and James Glass.
Jifeng Dai, Jingdong Wang, Jianping Shi, Wanli Ouyang,
Ssast: Self-supervised audio spectrogram transformer. In
Chen Change Loy, and Dahua Lin. MMDetection: Open
Proceedings of the AAAI Conference on Artificial Intelli-
mmlab detection toolbox and benchmark. arXiv preprint
gence, pages 10699–10709, 2022. 8
arXiv:1906.07155, 2019. 6
[6] Yukang Chen, Jianhui Liu, Xiangyu Zhang, Xiaojuan Qi, and [18] Qi Han, Zejia Fan, Qi Dai, Lei Sun, Ming-Ming Cheng, Ji-
Jiaya Jia. Largekernel3d: Scaling up kernels in 3d sparse aying Liu, and Jingdong Wang. Demystifying local vision
cnns. In Proceedings of the IEEE/CVF Conference on Com- transformer: Sparse connectivity, weight sharing, and dy-
puter Vision and Pattern Recognition, pages 13488–13498, namic weight. arXiv preprint arXiv:2106.04263, 2021. 2
2023. 1, 3 [19] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun.
[7] MMSegmentation Contributors. MMSegmentation: Deep residual learning for image recognition. In Proceed-
Openmmlab semantic segmentation toolbox and ings of the IEEE conference on computer vision and pattern
benchmark. https : / / github . com / open - recognition, pages 770–778, 2016. 1, 8
mmlab/mmsegmentation, 2020. 4, 6 [20] Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross Gir-
[8] Zihang Dai, Hanxiao Liu, Quoc V Le, and Mingxing Tan. shick. Mask r-cnn. In Proceedings of the IEEE international
Coatnet: Marrying convolution and attention for all data conference on computer vision, pages 2961–2969, 2017. 6
sizes. Advances in neural information processing systems, [21] H Hersbach, B Bell, P Berrisford, Sh Hirahara, A Horányi, J
34:3965–3977, 2021. 7 Muñoz-Sabater, J Nicolas, C Peubey, R Radu, Di Schepers,
[9] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, et al. The era5 global reanalysis, qj roy. meteor. soc., 146,
and Li Fei-Fei. Imagenet: A large-scale hierarchical im- 1999–2049, 2020. 8
age database. In Computer Vision and Pattern Recognition, [22] Andrew G Howard, Menglong Zhu, Bo Chen, Dmitry
2009. CVPR 2009. IEEE Conference on, pages 248–255. Kalenichenko, Weijun Wang, Tobias Weyand, Marco An-
IEEE, 2009. 2, 7 dreetto, and Hartwig Adam. Mobilenets: Efficient convolu-
[10] Xiaohan Ding, Xiangyu Zhang, Ningning Ma, Jungong Han, tional neural networks for mobile vision applications. arXiv
Guiguang Ding, and Jian Sun. Repvgg: Making vgg-style preprint arXiv:1704.04861, 2017. 1

9
[23] Han Hu, Zheng Zhang, Zhenda Xie, and Stephen Lin. Lo- modeling and forecasting. In International conference on
cal relation networks for image recognition. In Proceedings learning representations, 2021. 8
of the IEEE/CVF International Conference on Computer Vi- [36] Shiwei Liu, Tianlong Chen, Xiaohan Chen, Xuxi Chen, Qiao
sion, pages 3464–3473, 2019. 2 Xiao, Boqian Wu, Tommi Kärkkäinen, Mykola Pechenizkiy,
[24] Jie Hu, Li Shen, and Gang Sun. Squeeze-and-excitation net- Decebal Mocanu, and Zhangyang Wang. More convnets in
works. In Proceedings of the IEEE conference on computer the 2020s: Scaling up kernels beyond 51x51 using sparsity.
vision and pattern recognition, pages 7132–7141, 2018. 2, 4 arXiv preprint arXiv:2207.03620, 2022. 1, 2, 3, 7
[25] Gao Huang, Zhuang Liu, Laurens van der Maaten, and Kil- [37] Ze Liu, Han Hu, Yutong Lin, Zhuliang Yao, Zhenda Xie,
ian Q. Weinberger. Densely connected convolutional net- Yixuan Wei, Jia Ning, Yue Cao, Zheng Zhang, Li Dong,
works. In 2017 IEEE Conference on Computer Vision and et al. Swin transformer v2: Scaling up capacity and reso-
Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July lution. arXiv preprint arXiv:2111.09883, 2021. 1, 7
21-26, 2017, pages 2261–2269. IEEE Computer Society, [38] Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng
2017. 1 Zhang, Stephen Lin, and Baining Guo. Swin transformer:
[26] Po-Yao Huang, Hu Xu, Juncheng Li, Alexei Baevski, Hierarchical vision transformer using shifted windows. In
Michael Auli, Wojciech Galuba, Florian Metze, and Proceedings of the IEEE/CVF International Conference on
Christoph Feichtenhofer. Masked autoencoders that listen. Computer Vision, pages 10012–10022, 2021. 1, 2
Advances in Neural Information Processing Systems, 35: [39] Ze Liu, Han Hu, Yutong Lin, Zhuliang Yao, Zhenda Xie,
28708–28720, 2022. 8 Yixuan Wei, Jia Ning, Yue Cao, Zheng Zhang, Li Dong, et al.
[27] Rob J Hyndman and George Athanasopoulos. Forecasting: Swin transformer v2: Scaling up capacity and resolution. In
Principles and practice. otexts; 2014. Online at http://otexts. Proceedings of the IEEE/CVF conference on computer vi-
org/fpp, 2017. 8 sion and pattern recognition, pages 12009–12019, 2022. 2,
7
[28] Sergey Ioffe and Christian Szegedy. Batch normalization:
[40] Zhuang Liu, Hanzi Mao, Chao-Yuan Wu, Christoph Feicht-
Accelerating deep network training by reducing internal co-
enhofer, Trevor Darrell, and Saining Xie. A convnet for the
variate shift. In International Conference on Machine Learn-
2020s. arXiv preprint arXiv:2201.03545, 2022. 5, 6, 7, 8, 1
ing, pages 448–456, 2015. 2, 3
[41] Ilya Loshchilov and Frank Hutter. Decoupled weight decay
[29] Will Kay, Joao Carreira, Karen Simonyan, Brian Zhang,
regularization. arXiv preprint arXiv:1711.05101, 2017. 1
Chloe Hillier, Sudheendra Vijayanarasimhan, Fabio Viola,
[42] Pinjun Luo, Guoqiang Xiao, Xinbo Gao, and Song Wu. Lkd-
Tim Green, Trevor Back, Paul Natsev, et al. The kinetics hu-
net: Large kernel convolution network for single image de-
man action video dataset. arXiv preprint arXiv:1705.06950,
hazing. In 2023 IEEE International Conference on Multi-
2017. 8
media and Expo (ICME), pages 1601–1606. IEEE, 2023. 1,
[30] Guolin Ke, Qi Meng, Thomas Finley, Taifeng Wang, Wei 3
Chen, Weidong Ma, Qiwei Ye, and Tie-Yan Liu. Lightgbm:
[43] Wenjie Luo, Yujia Li, Raquel Urtasun, and Richard S.
A highly efficient gradient boosting decision tree. Advances
Zemel. Understanding the effective receptive field in deep
in neural information processing systems, 30, 2017. 8
convolutional neural networks. In Advances in Neural Infor-
[31] Qiuqiang Kong, Yin Cao, Turab Iqbal, Yuxuan Wang, mation Processing Systems 29: Annual Conference on Neu-
Wenwu Wang, and Mark D Plumbley. Panns: Large-scale ral Information Processing Systems 2016, December 5-10,
pretrained audio neural networks for audio pattern recogni- 2016, Barcelona, Spain, pages 4898–4906, 2016. 1, 5
tion. IEEE/ACM Transactions on Audio, Speech, and Lan- [44] Boris N Oreshkin, Dmitri Carpov, Nicolas Chapados, and
guage Processing, 28:2880–2894, 2020. 8 Yoshua Bengio. N-beats: Neural basis expansion analysis for
[32] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. interpretable time series forecasting. In International Con-
Imagenet classification with deep convolutional neural net- ference on Learning Representations, 2019. 8
works. In Advances in neural information processing sys- [45] Chao Peng, Xiangyu Zhang, Gang Yu, Guiming Luo, and
tems, pages 1097–1105, 2012. 2 Jian Sun. Large kernel matters–improve semantic segmen-
[33] Yanghao Li, Chao-Yuan Wu, Haoqi Fan, Karttikeya Man- tation by global convolutional network. In Proceedings of
galam, Bo Xiong, Jitendra Malik, and Christoph Feichten- the IEEE conference on computer vision and pattern recog-
hofer. Mvitv2: Improved multiscale vision transformers for nition, pages 4353–4361, 2017. 2
classification and detection. In Proceedings of the IEEE/CVF [46] Charles Ruizhongtai Qi, Hao Su, Kaichun Mo, and
Conference on Computer Vision and Pattern Recognition, Leonidas J. Guibas. Pointnet: Deep learning on point sets
pages 4804–4814, 2022. 8 for 3d classification and segmentation. In CVPR, 2017. 8
[34] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, [47] Charles R Qi, Li Yi, Hao Su, and Leonidas J Guibas. Point-
Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence net++: Deep hierarchical feature learning on point sets in a
Zitnick. Microsoft coco: Common objects in context. In metric space. In NeurIPS, 2017. 8
European conference on computer vision, pages 740–755. [48] Ilija Radosavovic, Raj Prateek Kosaraju, Ross Girshick,
Springer, 2014. 2 Kaiming He, and Piotr Dollár. Designing network design
[35] Shizhan Liu, Hang Yu, Cong Liao, Jianguo Li, Weiyao Lin, spaces. In Proceedings of the IEEE/CVF Conference on
Alex X Liu, and Schahram Dustdar. Pyraformer: Low- Computer Vision and Pattern Recognition, pages 10428–
complexity pyramidal attention for long-range time series 10436, 2020. 1

10
[49] Prajit Ramachandran, Niki Parmar, Ashish Vaswani, Irwan [63] Pavan Kumar Anasosalu Vasu, James Gabriel, Jeff Zhu, On-
Bello, Anselm Levskaya, and Jonathon Shlens. Stand- cel Tuzel, and Anurag Ranjan. Fastvit: A fast hybrid vi-
alone self-attention in vision models. arXiv preprint sion transformer using structural reparameterization. arXiv
arXiv:1906.05909, 2019. 1 preprint arXiv:2303.14189, 2023. 2, 7
[50] Yongming Rao, Wenliang Zhao, Zheng Zhu, Jiwen Lu, and [64] Ashish Vaswani, Prajit Ramachandran, Aravind Srinivas,
Jie Zhou. Global filter networks for image classification. Niki Parmar, Blake Hechtman, and Jonathon Shlens. Scaling
arXiv preprint arXiv:2107.00645, 2021. 2 local self-attention for parameter efficient visual backbones.
[51] Yongming Rao, Wenliang Zhao, Yansong Tang, Jie Zhou, In Proceedings of the IEEE/CVF Conference on Computer
Ser Nam Lim, and Jiwen Lu. Hornet: Efficient high- Vision and Pattern Recognition, pages 12894–12904, 2021.
order spatial interactions with recursive gated convolutions. 1
Advances in Neural Information Processing Systems, 35: [65] Wenhai Wang, Enze Xie, Xiang Li, Deng-Ping Fan, Kaitao
10353–10366, 2022. 7 Song, Ding Liang, Tong Lu, Ping Luo, and Ling Shao.
[52] David Salinas, Valentin Flunkert, Jan Gasthaus, and Tim Pyramid vision transformer: A versatile backbone for
Januschowski. Deepar: Probabilistic forecasting with au- dense prediction without convolutions. arXiv preprint
toregressive recurrent networks. International Journal of arXiv:2102.12122, 2021. 1
Forecasting, 36(3):1181–1191, 2020. 8 [66] Wenhai Wang, Enze Xie, Xiang Li, Deng-Ping Fan, Kaitao
[53] Karen Simonyan and Andrew Zisserman. Very deep convo- Song, Ding Liang, Tong Lu, Ping Luo, and Ling Shao. Pvt
lutional networks for large-scale image recognition. arXiv v2: Improved baselines with pyramid vision transformer.
preprint arXiv:1409.1556, 2014. 1, 2 Computational Visual Media, 8(3):415–424, 2022. 7
[67] Wenhai Wang, Jifeng Dai, Zhe Chen, Zhenhang Huang,
[54] Aravind Srinivas, Tsung-Yi Lin, Niki Parmar, Jonathon
Zhiqi Li, Xizhou Zhu, Xiaowei Hu, Tong Lu, Lewei Lu,
Shlens, Pieter Abbeel, and Ashish Vaswani. Bottleneck
Hongsheng Li, et al. Internimage: Exploring large-scale vi-
transformers for visual recognition. In Proceedings of the
sion foundation models with deformable convolutions. In
IEEE/CVF Conference on Computer Vision and Pattern
Proceedings of the IEEE/CVF Conference on Computer Vi-
Recognition, pages 16519–16529, 2021. 1
sion and Pattern Recognition, pages 14408–14419, 2023. 6,
[55] Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, 7
Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent
[68] Yue Wang, Yongbin Sun, Ziwei Liu, Sanjay E Sarma,
Vanhoucke, and Andrew Rabinovich. Going deeper with
Michael M Bronstein, and Justin M Solomon. Dynamic
convolutions. In Proceedings of the IEEE conference on
graph cnn for learning on point clouds. TOG, 2019. 8
computer vision and pattern recognition, pages 1–9, 2015.
2 [69] Pete Warden. Speech commands: A dataset for
limited-vocabulary speech recognition. arXiv preprint
[56] Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jon arXiv:1804.03209, 2018. 8
Shlens, and Zbigniew Wojna. Rethinking the inception archi-
[70] Sanghyun Woo, Shoubhik Debnath, Ronghang Hu, Xinlei
tecture for computer vision. In Proceedings of the IEEE con-
Chen, Zhuang Liu, In So Kweon, and Saining Xie. Con-
ference on computer vision and pattern recognition, pages
vnext v2: Co-designing and scaling convnets with masked
2818–2826, 2016.
autoencoders. In Proceedings of the IEEE/CVF Conference
[57] Christian Szegedy, Sergey Ioffe, Vincent Vanhoucke, and on Computer Vision and Pattern Recognition, pages 16133–
Alexander A Alemi. Inception-v4, inception-resnet and the 16142, 2023. 2, 3, 5, 7
impact of residual connections on learning. In Thirty-first
[71] Haixu Wu, Hang Zhou, Mingsheng Long, and Jianmin
AAAI conference on artificial intelligence, 2017. 2
Wang. Interpretable weather forecasting for worldwide sta-
[58] Mingxing Tan and Quoc V Le. Efficientnet: Rethinking tions with a unified deep model. Nature Machine Intelli-
model scaling for convolutional neural networks. arXiv gence, pages 1–10, 2023. 6, 7, 8
preprint arXiv:1905.11946, 2019. 1 [72] Wenxuan Wu, Zhongang Qi, and Li Fuxin. Pointconv: Deep
[59] Sean J Taylor and Benjamin Letham. Forecasting at scale. convolutional networks on 3d point clouds. In CVPR, 2019.
The American Statistician, 72(1):37–45, 2018. 8 8
[60] Hugues Thomas, Charles R Qi, Jean-Emmanuel Deschaud, [73] Zhirong Wu, Shuran Song, Aditya Khosla, Fisher Yu, Lin-
Beatriz Marcotegui, François Goulette, and Leonidas J guang Zhang, Xiaoou Tang, and Jianxiong Xiao. 3d
Guibas. Kpconv: Flexible and deformable convolution for shapenets: A deep representation for volumetric shapes. In
point clouds. In ICCV, 2019. 8 CVPR, 2015. 8
[61] Hugo Touvron, Matthieu Cord, Matthijs Douze, Francisco [74] Tete Xiao, Yingcheng Liu, Bolei Zhou, Yuning Jiang, and
Massa, Alexandre Sablayrolles, and Hervé Jégou. Training Jian Sun. Unified perceptual parsing for scene understand-
data-efficient image transformers & distillation through at- ing. In Proceedings of the European Conference on Com-
tention. In International Conference on Machine Learning, puter Vision (ECCV), pages 418–434, 2018. 4, 6
pages 10347–10357. PMLR, 2021. 1 [75] Chengxing Xie, Xiaoming Zhang, Linze Li, Haiteng Meng,
[62] Hugo Touvron, Matthieu Cord, and Hervé Jégou. Deit iii: Tianlin Zhang, Tianrui Li, and Xiaole Zhao. Large ker-
Revenge of the vit. In European Conference on Computer nel distillation network for efficient single image super-
Vision, pages 516–533. Springer, 2022. 2, 7 resolution. In Proceedings of the IEEE/CVF Conference

11
on Computer Vision and Pattern Recognition, pages 1283–
1292, 2023. 1, 3
[76] Weihao Yu, Mi Luo, Pan Zhou, Chenyang Si, Yichen
Zhou, Xinchao Wang, Jiashi Feng, and Shuicheng Yan.
Metaformer is actually what you need for vision. arXiv
preprint arXiv:2111.11418, 2021. 2
[77] Xiangyu Zhang, Xinyu Zhou, Mengxiao Lin, and Jian Sun.
Shufflenet: An extremely efficient convolutional neural net-
work for mobile devices. In Proceedings of the IEEE con-
ference on computer vision and pattern recognition, pages
6848–6856, 2018. 1
[78] Yiyuan Zhang, Kaixiong Gong, Kaipeng Zhang, Hongsheng
Li, Yu Qiao, Wanli Ouyang, and Xiangyu Yue. Meta-
transformer: A unified framework for multimodal learning.
arXiv preprint arXiv:2307.10802, 2023. 6, 8
[79] Bolei Zhou, Hang Zhao, Xavier Puig, Sanja Fidler, Adela
Barriuso, and Antonio Torralba. Scene parsing through
ade20k dataset. In Proceedings of the IEEE conference on
computer vision and pattern recognition, pages 633–641,
2017. 1
[80] Bolei Zhou, Hang Zhao, Xavier Puig, Tete Xiao, Sanja Fi-
dler, Adela Barriuso, and Antonio Torralba. Semantic under-
standing of scenes through the ade20k dataset. International
Journal of Computer Vision, 127(3):302–321, 2019. 2, 6

12
UniRepLKNet: A Universal Perception Large-Kernel ConvNet for Audio, Video,
Point Cloud, Time-Series and Image Recognition
Supplementary Material
Appendix A: General Transformation from ImageNet image classification. The training configura-
Dialted Convolution to Non-dilated Large- tions for the ImageNet-1K-only results shown in Section
Kernel Convolution 4 are presented in Table 14. These configurations are sim-
ilar to common practices. For the experiments in Section
Since ignoring pixels of the input is equivalent to inserting 3, we use the same configurations, except that the training
extra zero entries into the conv kernel, a dilated conv layer epochs are set to 100 and the drop path rate is set to 0.1. For
with a small kernel can be equivalently converted into a the models pretrained with ImageNet-22K and then fine-
non-dilated layer with a sparse larger kernel. Let k be the tuned on ImageNet-22K, the configurations are shown in
kernel size and r be the dilation rate of the dilated layer, by Table 14. Note that we follow the configurations adopted
inserting zero entries, the kernel size of the corresponding by ConvNeXt for a fair comparison with ConvNeXt-S/B,
non-dilated layer will be (k − 1)r + 1, which is referred to and the configurations used by InternImage for a fair com-
as the equivalent kernel size for brevity. parison with InternImage-L/XL (the results with ImageNet-
As discussed in the paper, to eliminate the inference 22K-pretrained InternImage-S/B were not reported).
costs of the extra dilated conv layers in the Dilated Reparam COCO object detection. For fair comparisons, we fol-
Block, we propose to equivalently transform the whole low common practices [37, 40] to initialize the backbone
block into a single non-dilated conv layer for inference. As with pretrained weights and train the models using a 3×
discussed before, let k and r be the kernel size and dilation (36 epochs) schedule by default. The shorter side is resized
rate, respectively, the transformation from a dilated conv to 480−800 pixels, while the longer side does not exceed
layer’s kernel W ∈ Rk×k to a non-dilated layer’s kernel 1,333 pixels. All the models are trained with a batch size of
W′ ∈ R((k−1)r+1)×((k−1)r+1) can be elegantly realized by 16 and AdamW [41] optimizer with an initial learning rate
a transpose convolution with a stride of r and an identity of 1 × 10−4 . We will publicly release the training configu-
kernel I ∈ R1×1 , which is scalar 1 but viewed as a kernel ration file used in the MMDetection framework.
tensor. That is ADE20K semantic segmentation. We evaluate UniRe-
pLKNet models on the ADE20K dataset [79], and initial-
W′ = conv transpose2d(W, I, stride = r) . (6) ize them with the pre-trained classification weights. The
learning rate is initialized with 1 × 10−4 and decayed with
In general cases with multi-channel conv layers, let the the polynomial decay schedule with a power of 1.0. Fol-
input channels, output channels, and number of groups be lowing previous methods [37, 40], the crop size is set to
cin , cout , and g, respectively, we denote the kernel by a 4D 512 for the ImageNet-1K-pretrained models, and 640 for
tensor whose shape is cout × cgin × k × k. ImageNet-22K-pretrained models. All segmentation mod-
1) For a multi-channel depthwise (DW) layer, the trans- els are trained with a batch size of 16 for 160k iterations.
formation is easily generalized from 2D to 4D - the iden- We will publicly release the training configuration file used
tity kernel I is viewed as a 4D tensor I ∈ R1×1×1×1 and in the MMSegmentation framework.
we still follow function 6 to derive the equivalent kernel by
transpose convolution.
2) For non-DW cases (i.e., g < cin ), the transformation
can be seen as splitting the kernel into slices (which can
each be seen as a DW kernel), converting the slices respec-
tively, and concatenating the resultant non-dilated slices up.
We present the code in pytorch (Fig. 4) and a test case
demonstrating the equivalency (Fig. 5).

Appendix B: Training Configurations

We present the detailed training configurations for image
classification, object detection, and semantic segmentation.
We will publicly release a reproducible training script and
trained weights for every model.

1
import torch
import torch.nn as nn
import torch.nn.functional as F

def convert_dilated_to_nondilated(kernel, dilate_rate):

identity_kernel = torch.ones((1, 1, 1, 1))
if kernel.size(1) == 1:
# This is a DW kernel
dilated = F.conv_transpose2d(kernel, identity_kernel, stride=dilate_rate)
return dilated
else:
# This is a dense or group-wise (but not DW) kernel
slices = []
for i in range(kernel.size(1)):
dilated = F.conv_transpose2d(kernel[:,i:i+1,:,:], identity_kernel, stride=
dilate_rate)
slices.append(dilated)
return torch.cat(slices, dim=1)

Figure 4. Pytorch code to convert a dilated conv layer’s small kernel to a non-dilated layer’s larger sparse kernel.

def test_equivalency(in_channels, out_channels, groups, large_kernel_size, small_conv_r,

small_conv_k):
equivalent_kernel_size = small_conv_r * (small_conv_k - 1) + 1
large_conv = nn.Conv2d(in_channels, out_channels, kernel_size=large_kernel_size,
padding=large_kernel_size // 2, groups=groups, bias=False)
dilated_conv = nn.Conv2d(in_channels, out_channels, kernel_size=small_conv_k,
padding=equivalent_kernel_size // 2,
dilation=small_conv_r, groups=groups, bias=False)
H, W = 19, 19
x = torch.rand(2, in_channels, H, W)
origin_y = large_conv(x) + dilated_conv(x)
equivalent_kernel = convert_dilated_to_nondilated(dilated_conv.weight.data, small_conv_r)
rows_to_pad = large_kernel_size // 2 - equivalent_kernel_size // 2
merged_kernel = large_conv.weight.data + F.pad(equivalent_kernel, [rows_to_pad] * 4)
equivalent_y = F.conv2d(x, merged_kernel, bias=None, padding=large_kernel_size // 2,
groups=groups)
print(’relative error:’, (equivalent_y - origin_y).abs().sum() / origin_y.abs().sum())

test_equivalency(in_channels=4, out_channels=4, groups=1,

large_kernel_size=13, small_conv_r=3, small_conv_k=3)

Figure 5. A test case demonstrating the equivalency of the transformation.

2
Table 14. Detailed training configurations of ImageNet-1K-only models. Apart from the configurations shown in the table, we use
random left-right flipping, random resized crop, color jitter of 0.4, Auto-augment, and no repeated augmentation for every model.

settings UniRepLKNet-A UniRepLKNet-F UniRepLKNet-P UniRepLKNet-N UniRepLKNet-T UniRepLKNet-S

input scale 224 224 224 224 224 224
batch size 4096 4096 4096 4096 4096 4096
optimizer AdamW AdamW AdamW AdamW AdamW AdamW
LR 4×10−3 4×10−3 4×10−3 4×10−3 4×10−3 4×10−3
LR schedule cosine cosine cosine cosine cosine cosine
weight decay 0.05 0.05 0.05 0.05 0.05 0.05
warmup epochs 5 5 5 5 5 5
epochs 300 300 300 300 300 300
mixup alpha 0.3 0.3 0.3 0.5 0.8 0.8
cutmix alpha 0.3 0.3 0.3 0.5 1.0 1.0
erasing prob. 0.25 0.25 0.25 0.25 0.25 0.25
label smoothing ε 0.1 0.1 0.1 0.1 0.1 0.1
drop path rate 0.0 0.0 0.1 0.1 0.2 0.4

Table 15. Detailed training configurations of models pretrained with ImageNet-22K (IN-22K pt) and then finetuned on ImageNet-
1K (IN-1K ft). Apart from the configurations shown in the table, we use random left-right flipping, random resized crop, color jitter of
0.4, Auto-augment, and no repeated augmentation for every model.

UniRepLKNet-S UniRepLKNet-B UniRepLKNet-L UniRepLKNet-XL

settings
IN-22K pt IN-1K ft IN-22K pt IN-1K ft IN-22K pt IN-1K ft IN-22K pt IN-1K ft
input scale 224 384 224 384 192 384 192 384
batch size 4096 512 4096 512 4096 512 4096 512
optimizer AdamW AdamW AdamW AdamW AdamW AdamW AdamW AdamW
LR 4×10−3 5×10−5 4×10−3 5×10−5 4×10−3 5×10−5 4×10−3 5×10−5
LR schedule cosine cosine cosine cosine cosine cosine cosine cosine
weight decay 0.05 1×10−8 0.05 1×10−8 0.05 1×10−8 0.05 1×10−8
warmup epochs 5 0 5 0 5 0 5 0
epochs 90 30 90 30 90 20 90 20
mixup alpha 0.8 0.0 0.8 0.0 0.8 0.0 0.8 0.0
cutmix alpha 1.0 0.0 1.0 0.0 1.0 0.0 1.0 0.0
erasing prob. 0.25 0.25 0.25 0.25 0.25 0.25 0.25 0.25
label smoothing 0.1 0.1 0.1 0.1 0.1 0.3 0.1 0.3
drop path rate 0.1 0.2 0.1 0.2 0.1 0.3 0.2 0.3

Detection of Cyber Attack in Network Using Machine Learning Techniques Final
100% (5)
Detection of Cyber Attack in Network Using Machine Learning Techniques Final
50 pages
Coursework Assignment Summer
No ratings yet
Coursework Assignment Summer
7 pages
Cat and Dog Classification Using CNN: Project Objective
No ratings yet
Cat and Dog Classification Using CNN: Project Objective
7 pages
Chen 等 - 2024 - PeLK Parameter-efficient Large Kernel ConvNets with Peripheral Convolution
No ratings yet
Chen 等 - 2024 - PeLK Parameter-efficient Large Kernel ConvNets with Peripheral Convolution
12 pages
VGG (Simonyan and Zisserman)
No ratings yet
VGG (Simonyan and Zisserman)
14 pages
465-Lecture 7 (1)
No ratings yet
465-Lecture 7 (1)
46 pages
Classify Webcam Images Using Deep Learning
No ratings yet
Classify Webcam Images Using Deep Learning
17 pages
Identify Web Cam Images Using Neural Networks
No ratings yet
Identify Web Cam Images Using Neural Networks
17 pages
138 B Pretrained Networks Classification Complete
No ratings yet
138 B Pretrained Networks Classification Complete
47 pages
Unit-3
No ratings yet
Unit-3
38 pages
Modern Convolutional Neural Networks
No ratings yet
Modern Convolutional Neural Networks
68 pages
Trustworthy - Final Essay
No ratings yet
Trustworthy - Final Essay
21 pages
10. Image Processing With Deep Learning
No ratings yet
10. Image Processing With Deep Learning
39 pages
CNN Apps
No ratings yet
CNN Apps
17 pages
L3 - UUCLxDeepMind DL2020
No ratings yet
L3 - UUCLxDeepMind DL2020
110 pages
Convolutional Neural Networks
No ratings yet
Convolutional Neural Networks
17 pages
4b Image Processing
No ratings yet
4b Image Processing
63 pages
Understanding AlexNet
No ratings yet
Understanding AlexNet
8 pages
Unit-3 (1)
No ratings yet
Unit-3 (1)
37 pages
CNN Model For Image Classification Using Resnet: Dr. Senbagavalli M & Swetha Shekarappa G
No ratings yet
CNN Model For Image Classification Using Resnet: Dr. Senbagavalli M & Swetha Shekarappa G
10 pages
CS60010: Deep Learning CNN - Part 3: Sudeshna Sarkar
No ratings yet
CS60010: Deep Learning CNN - Part 3: Sudeshna Sarkar
167 pages
138 A VGG Googlenet in B Now
No ratings yet
138 A VGG Googlenet in B Now
18 pages
Famous Networks
No ratings yet
Famous Networks
6 pages
GoogleNET and ResNet v4 With Nin and Bias
No ratings yet
GoogleNET and ResNet v4 With Nin and Bias
82 pages
Res Net 4
No ratings yet
Res Net 4
23 pages
Danupon Chansong - Siriporn Supratid - 2021 - Impacts of Kernel Size On Different Resized Images in Object Recognition Based
No ratings yet
Danupon Chansong - Siriporn Supratid - 2021 - Impacts of Kernel Size On Different Resized Images in Object Recognition Based
4 pages
REgnet
No ratings yet
REgnet
6 pages
Military AI-Week 05-AI in Computer Vision
No ratings yet
Military AI-Week 05-AI in Computer Vision
65 pages
Convolutional Networks
No ratings yet
Convolutional Networks
211 pages
WEEK 8
No ratings yet
WEEK 8
101 pages
Modern CNN Architectures
No ratings yet
Modern CNN Architectures
32 pages
3)ImageNet classfication
No ratings yet
3)ImageNet classfication
9 pages
Untitled Document
No ratings yet
Untitled Document
23 pages
Imagenet Classification
No ratings yet
Imagenet Classification
9 pages
5b Dana
No ratings yet
5b Dana
67 pages
Lecture2 Advanced CNN
No ratings yet
Lecture2 Advanced CNN
55 pages
Chapter 5 Deep Learning
No ratings yet
Chapter 5 Deep Learning
35 pages
Lecture 1
No ratings yet
Lecture 1
32 pages
MN906 AI Watermarking
No ratings yet
MN906 AI Watermarking
99 pages
Aidl 2023s DL 08 CNN Architectures
No ratings yet
Aidl 2023s DL 08 CNN Architectures
51 pages
Mastering Kubernetes
From Everand
Mastering Kubernetes
Manish Soni
No ratings yet
Convolutional Neural Networks: CMSC 733 Fall 2015 Angjoo Kanazawa
No ratings yet
Convolutional Neural Networks: CMSC 733 Fall 2015 Angjoo Kanazawa
55 pages
BMM 2018 - Deep Learning Tutorial
No ratings yet
BMM 2018 - Deep Learning Tutorial
47 pages
Unit V
No ratings yet
Unit V
84 pages
Deep Learning (MODULE-3) (1)
No ratings yet
Deep Learning (MODULE-3) (1)
85 pages
Unit 5
No ratings yet
Unit 5
24 pages
Recent advances in convolutional neural networks-2018
No ratings yet
Recent advances in convolutional neural networks-2018
42 pages
COMP3220 Lect 11 - Introduction to Convolutional Neural Networks
No ratings yet
COMP3220 Lect 11 - Introduction to Convolutional Neural Networks
13 pages
Lec6 RNN Attention Search
No ratings yet
Lec6 RNN Attention Search
62 pages
1 s2.0 S0031320317304120 Main
No ratings yet
1 s2.0 S0031320317304120 Main
24 pages
Ai 4 All
No ratings yet
Ai 4 All
18 pages
Unit 5a - Machine Vision
No ratings yet
Unit 5a - Machine Vision
55 pages
Convolutional Neural Networks For Visual Recognition
No ratings yet
Convolutional Neural Networks For Visual Recognition
45 pages
Alexnet Tugce Kyunghee
No ratings yet
Alexnet Tugce Kyunghee
35 pages
19-ResNet-10-09-2024
No ratings yet
19-ResNet-10-09-2024
35 pages
CNN
No ratings yet
CNN
31 pages
Intro CNN PDF
No ratings yet
Intro CNN PDF
31 pages
Difference Between Alexnet, Vggnet, Resnet, and Inception
No ratings yet
Difference Between Alexnet, Vggnet, Resnet, and Inception
14 pages
Deep Learning: Seungsang Oh
No ratings yet
Deep Learning: Seungsang Oh
39 pages
XLA_final_report (1)
No ratings yet
XLA_final_report (1)
17 pages
Deep Learning Image Classification
No ratings yet
Deep Learning Image Classification
11 pages
Flannel Networking Essentials: Definitive Reference for Developers and Engineers
From Everand
Flannel Networking Essentials: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
VXLAN Network Virtualization Guide: Definitive Reference for Developers and Engineers
From Everand
VXLAN Network Virtualization Guide: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
11 (2020) Yuan Fan - Learning Target-Focusing Convolutional Regression Model For Visual Object Tracking
No ratings yet
11 (2020) Yuan Fan - Learning Target-Focusing Convolutional Regression Model For Visual Object Tracking
10 pages
Traffic Flow Prediction Models A Review of Deep Learning Techniques
No ratings yet
Traffic Flow Prediction Models A Review of Deep Learning Techniques
25 pages
Full Download Information and Communication Technology for Intelligent Systems Proceedings of ICTIS 2018 Volume 1 Suresh Chandra Satapathy PDF DOCX
100% (1)
Full Download Information and Communication Technology for Intelligent Systems Proceedings of ICTIS 2018 Volume 1 Suresh Chandra Satapathy PDF DOCX
43 pages
Tinaface: Strong But Simple Baseline For Face Detection
No ratings yet
Tinaface: Strong But Simple Baseline For Face Detection
9 pages
Flow Chart:: Input Audio Preprocessing
No ratings yet
Flow Chart:: Input Audio Preprocessing
14 pages
Automatic Detection of Knee Joints and Quantification of Knee Osteoarthritis Severity Using Modified Fully Connected Convolutional Neural Networks
No ratings yet
Automatic Detection of Knee Joints and Quantification of Knee Osteoarthritis Severity Using Modified Fully Connected Convolutional Neural Networks
9 pages
1-s2.0-S0957417421017255-main
No ratings yet
1-s2.0-S0957417421017255-main
13 pages
A Narrative Review of Medical Image Processing by Deep Learning Models: Origin To COVID-19
No ratings yet
A Narrative Review of Medical Image Processing by Deep Learning Models: Origin To COVID-19
22 pages
Smart Fashion A Review of AI - IRO-Journals-3 4 2
No ratings yet
Smart Fashion A Review of AI - IRO-Journals-3 4 2
22 pages
2019 MMCNet Deep Learning-Based Multimodal Classification Model Using Dynamic Knowledge
No ratings yet
2019 MMCNet Deep Learning-Based Multimodal Classification Model Using Dynamic Knowledge
10 pages
Full Download New Trends in Computer Technologies and Applications 23rd International Computer Symposium ICS 2018 Yunlin Taiwan December 20 22 2018 Revised Selected Papers Chuan-Yu Chang PDF DOCX
100% (2)
Full Download New Trends in Computer Technologies and Applications 23rd International Computer Symposium ICS 2018 Yunlin Taiwan December 20 22 2018 Revised Selected Papers Chuan-Yu Chang PDF DOCX
52 pages
TP Overview
No ratings yet
TP Overview
8 pages
CVD Lab Manual
No ratings yet
CVD Lab Manual
33 pages
047c328e828857d0e77472023f95ce2a
No ratings yet
047c328e828857d0e77472023f95ce2a
34 pages
A Deformable CNN-DLSTM Based Transfer Learning Method For Fault Diagnosis of Rolling Bearing Under Multiple Working Conditions
No ratings yet
A Deformable CNN-DLSTM Based Transfer Learning Method For Fault Diagnosis of Rolling Bearing Under Multiple Working Conditions
17 pages
Salat Pose Detection Using Deep Learning Architecture
No ratings yet
Salat Pose Detection Using Deep Learning Architecture
3 pages
Real-Time American Sign Language Recognition With Convolutional Neural Networks
No ratings yet
Real-Time American Sign Language Recognition With Convolutional Neural Networks
8 pages
Web Information Systems and Applications: 15th International Conference, WISA 2018, Taiyuan, China, September 14–15, 2018, Proceedings Xiaofeng Meng download
100% (1)
Web Information Systems and Applications: 15th International Conference, WISA 2018, Taiyuan, China, September 14–15, 2018, Proceedings Xiaofeng Meng download
60 pages
32002
No ratings yet
32002
6 pages
Deep Learning
No ratings yet
Deep Learning
12 pages
Ilovepdf Merged
No ratings yet
Ilovepdf Merged
29 pages
Design of A Convolutional Neural Network Based Smart Waste Disposal System
No ratings yet
Design of A Convolutional Neural Network Based Smart Waste Disposal System
5 pages
Emerging AI technologies for corrosion monitoring in oil and gas industry
No ratings yet
Emerging AI technologies for corrosion monitoring in oil and gas industry
25 pages
AIA 6600 - Module 1
No ratings yet
AIA 6600 - Module 1
5 pages
Deep Learning in Neural Networks An Overview
No ratings yet
Deep Learning in Neural Networks An Overview
89 pages
ManishGiri G 2018465 34
No ratings yet
ManishGiri G 2018465 34
12 pages
AI_and_Digital_Twins_Transforming_Healthcare_IoT
No ratings yet
AI_and_Digital_Twins_Transforming_Healthcare_IoT
6 pages