Unireplknet: A Universal Perception Large-Kernel Convnet For Audio, Video, Point Cloud, Time-Series and Image Recognition
Unireplknet: A Universal Perception Large-Kernel Convnet For Audio, Video, Point Cloud, Time-Series and Image Recognition
[email protected], [email protected]
1
input
example, when the objective is to extract higher-level lo- 4×
cal spatial patterns from lower-level ones, a 3×3 conv layer Stage 1 Dilated Re-param
DW conv
might be a more suitable option than a large-kernel conv LarK Block
Block
2×
layer. The reason is that the latter demands more computa- BN
Stage 2 avg-pool
tions and may result in patterns that are no longer restricted SmaK Block
2×
to smaller local regions, which could be undesirable in spe- SmaK Block
ReLU SE Block
2
first proposed that simply scaling up the kernel size of ex- the latter W′ ∈ R((k−1)r+1)×((k−1)r+1) can be elegantly
isting ConvNets resulted in improvements, especially on realized by a transpose convolution with a stride of r and an
downstream tasks [11]. It proposed several guidelines while identity kernel I ∈ R1×1 , which is scalar 1 but viewed as a
using large kernels, which were focused on the microstruc- kernel tensor. 1 With pytorch-style pseudo code, that is
tural design (e.g., using shortcut alongside large kernel) and
application (large-kernel ConvNets should be evaluated on W′ = conv transpose2d(W, I, stride = r) . (1)
downstream tasks). In terms of the architecture, RepLKNet
The equivalency can be easily verified - given an arbitrary
merely followed Swin Transformer for simplicity. In the
W ∈ Rk×k and an arbitrary input channel, a convolution
past two years, large-kernel ConvNets have been intensively
with W and a dilation rate r always yields identical results
studied. Some works succeeded in further enlarging the
to a non-dilated convolution with W′ . 2
kernel sizes [36], generalizing the idea to 3D scenarios [6]
Based on such equivalent transformations, we propose
and many downstream tasks, e.g., image dehazing [42] and
a Dilated Reparam Block, which uses a non-dilated small-
super-resolution [75]. However, we note that the architec-
kernel and multiple dilated small-kernel layers to enhance
tural design for ConvNets with very large kernels remains
a non-dilated large-kernel conv layer. Its hyper-parameters
under-explored. For example, SLaK [36] followed the ar-
include the size of large kernel K, sizes of parallel conv
chitecture developed by ConvNeXt, which is a powerful ar-
layers k, and the dilation rates r. The shown case (Fig. 2)
chitecture of medium-sized (7×7) kernels.
with four parallel layers is denoted by K=9, r=(1,2,3,4),
k=(5,3,3,3). For a larger K, we may use more dilated lay-
3. Architectural Design of UniRepLKNet ers with larger kernel sizes or dilation rates. The kernel
3.1. Dilated Reparam Block sizes and dilation rates of the parallel branches are flexible
and the only constraint is (k − 1)r + 1 ≤ K. For example,
with K=13 (the default setting in our experiments), we use
It is reported that a large-kernel conv should be used with five layers with k=(5,7,3,3,3), r=(1,2,3,4,5), so the equiva-
a parallel small-kernel one because the latter helps capture lent kernel sizes will be (5,13,7,9,11), respectively. To con-
the small-scale patterns during training [11]. Their out- vert a Dialted Reparam Block into a large-kernel conv layer
puts are added up after two respective Batch Normaliza- for inference, we first merge every BN into the preceding
tion (BN) [28] layers. After training, with the Structural conv layer, convert every layer with dilation r > 1 with
Re-parameterization [10] methodology, we merge the BN function 1, and add up all the resultant kernels with appro-
layers into the conv layers so that the small-kernel conv priate zero-paddings. For example, the layer in Fig. 2 with
can be equivalently merged into the large-kernel one for in- k=3,r=3 is converted into a sparse 7×7 kernel and added to
ference. In this work, we note that except for small-scale the 9×9 kernel with one-pixel zero paddings on each side.
patterns, enhancing the large kernel’s capability to capture
sparse patterns (i.e., a pixel on a feature map may be more 3.2. Architectural Guidelines for Large Kernels
related to some distant pixels than its neighbors) may yield Vanilla architecture. We first construct the vanilla archi-
features of higher quality. The need to capture such pat- tecture to experiment on. As a common practice, the main
terns exactly matches the mechanism of dilated convolution body of the model is split into four stages connected by
- from a sliding-window perspective, a dilated conv layer downsampling blocks. Specifically, the first downsampling
with a dilation rate of r scans the input channel to capture block uses two stride-2 3×3 conv layers to transform the
spatial patterns where each pixel of interest is r − 1 pix- raw input into C-channel feature maps, where C is an ar-
els away from its neighbor. Therefore, we use dilated conv chitectural hyper-parameter and the other three downsam-
layers parallel to the large kernel and add up their outputs. pling blocks each use one stride-2 3×3 conv layer perform-
To eliminate the inference costs of the extra dilated conv ing 2× channel expansion so that the numbers of channels
layers, we propose to equivalently transform the whole of the four stages are C, 2C, 4C, and 8C, respectively.
block into a single non-dilated conv layer for inference. A stage comprises blocks whose vanilla design resembles
Since ignoring pixels of the input is equivalent to insert- ConvNeXt, i.e., a depthwise (DW) conv layer and a Feed-
ing extra zero entries into the conv kernel, a dilated conv Forward Network (FFN) with GRN unit [70], but we use
layer with a small kernel can be equivalently converted into BN instead of LayerNorm [1] after the conv layer as BN can
a non-dilated (i.e., r = 1) layer with a sparse larger ker- be equivalently merged into the conv layer to eliminate its
nel. Let k be the kernel size of the dilated layer, by insert- 1 We showcase a single-channel conv and it is easy to generalize the
ing zero entries, the kernel size of the corresponding non- transformation to multi-channel cases. See the Appendix for details.
dilated layer will be (k − 1)r + 1, which is referred to as 2 In common cases where the shape of output equals that of input, i.e.,
the equivalent kernel size for brevity. We further note that the padding of the former is k−1
2
, note the padding of the latter should be
such transformation from the former kernel W ∈ Rk×k to (k−1)r
2
since the size of the equivalent sparse kernel is (k − 1)r + 1.
3
input Structure Perspective input
Figure 2. Dilated Reparam Block uses dilated small-kernel conv layers to enhance a non-dilated large-kernel layer. Such dilated layers are
equivalent to a non-dilated conv layer with a larger sparse kernel, as shown from the parameter perspective so that the whole block can be
equivalently transformed into a single large-kernel conv. This example shows K=9 and we may use more dilated layers for larger K.
BN
can also be equivalently merged into the preceding layer DW 3×3
DW 3×3 ReLU
(i.e., the second linear layer in FFN). The numbers of such DW 3×3
blocks in the four stages are denoted by N (N1 ,N2 ,N3 ,N4 ), FFN
add add add
sigmoid
multiply
BN
respectively. Following ConvNeXt-T, the vanilla architec- (A) (B) (C) (D)
ture uses C=96 and N =(3,3,9,3). By default, the last three Figure 3. Options of the extra structures to increase the depth.
stages use 13×13 Dilated Reparam Block as the DW layer, Table 1. Models with different efficient extra structures to increase
which means K=13, k=(5,7,3,3,3) and r=(1,2,3,4,5); the the depth. We report the ImageNet accuracy (Acc), ADE20K
first stage uses DW 3×3 conv as the DW layer. mIoU, and actual throughput (Img/s).
Experimental settings and metrics. It has been em-
Extra structure Params FLOPs Img/s Acc mIoU
phasized in the literature [11] that large-kernel ConvNets
None 31.3M 4.92G 1954 81.2 45.1
should be evaluated on downstream tasks, as their full po- (A) Bottleneck 32.9M 5.18G 1716 81.5 46.3
tential may not be accurately reflected by the ImageNet ac- (B) Two 1×1 32.9M 5.17G 1745 81.3 46.2
curacy alone. Therefore, except for the ImageNet-1K accu- (C) Two DW 3×3 31.4M 4.96G 1659 81.3 45.4
racy after 100-epoch training, we transfer the trained model (D) SE Block 32.9M 4.92G 1863 81.6 46.5
with UPerNet [74] to ADE20K to examine its performance
on semantic segmentation and report the single-scale mIoU
tion and observe a better performance and higher through-
after a 160k-iteration standard finetuning process [7]. Be-
put. We therefore use SE Block as a substructure of our
sides the parameters and FLOPs, we test the actual through-
block design in the following explorations.
put on an A100 GPU with a batch size of 128 and input reso-
Architectural Guideline 2 on Re-parameterization: use
lution of 224×224, which is measured in images per second
dilated small kernels to re-parameterize a large kernel.
(img/s). See the Appendix for detailed configurations.
For a fair comparison with Dilated Reparam Block, we try
Architectural Guideline 1 on Block Design: use efficient
two variants with the same numbers of parallel branches
structures that perform both inter-channel communica-
composed of non-dilated layers with A) the same kernel
tions and spatial aggregations to increase the depth. We
sizes or B) the same equivalent kernel sizes. For our
first seek to insert some structures to universally boost the
default setting of K=13, r=(1,2,3,4,5), k=(5,7,3,3,3), the
model’s representational capacity, which is required to com-
kernel sizes of the five branches will be k=(5,7,3,3,3) or
prise nonlinearity and efficient trainable transformations.
(5,13,7,9,11) for the two variants, respectively. All the mod-
We naturally try a bottleneck composed of a 1×1 conv that
els end up with the same inference structure but the train-
reduces the channels to 1/4, a DW 3×3 conv, and another
ing structures differ. Table 2 shows lower performance of
1×1 conv to expand the channels back (Fig. 3). We use BN
variants, suggesting that large kernel benefits from the par-
and ReLU after conv layers as a common practice. Table 1
allel dilated conv layers’ abilities to capture sparse patterns,
shows that the performance improves with acceptable costs
rather than merely the extra small kernels (variant A) or the
(+1.2 mIoU with 12% slow down). The performance de-
combination of different receptive fields (variant B). We use
grades as we remove the DW 3×3 conv so that only two
Dilated Reparam Block in the following explorations. 3
1×1 conv layers remain, or replace the bottleneck structure
Architectural Guideline 3 on Kernel Size: decide kernel
with two DW 3×3 layers, suggesting that such structures re-
size according to the downstream task and usually use
quire both spatial aggregation transformations and channel
large kernels in middle- and high-level layers. As intro-
mixing. Motivated by this, considering that SE Block [24]
duced above, the baseline model uses 3×3 conv in the first
elegantly realizes both transformations in a more efficient
way (i.e., global average pooling and nonlinear mapping of 3 While describing the architecture in this paper, using a K×K (K≥9)
the pooled vectors), we try it also with 1/4 channel reduc- conv means a K×K Dilated Reparam Block, unless otherwise noted.
4
Table 2. Models with different forms of Structural Re- Table 4. Different numbers of LarK and SmaK Blocks in Stage 3.
parameterization on the 13×13 conv layers.
N3 LarK SmaK Params FLOPs Img/s Acc mIoU
Re-param k r Acc mIoU 9 9 0 32.9M 4.92G 1863 81.6 46.5
27 27 0 56.7M 9.31G 1145 82.3 49.0
None N/A N/A 81.4 45.8 27 14 13, 3×3 55.9M 9.15G 1229 82.3 48.8
Dilated Reparam 5,7,3,3,3 1,2,3,4,5 81.6 46.5 27 9 18, 3×3 55.6M 9.10G 1264 82.3 48.8
Same kernel size 5,7,3,3,3 1,1,1,1,1 81.6 46.2 27 9 18, w/o 3×3 55.5M 9.08G 1289 82.2 47.8
Same eq kernel size 5,13,7,9,11 1,1,1,1,1 81.6 46.2
Table 3. Models with different kernel sizes in the four stages de- follows the traditional ConvNets, i.e., stacking more large
noted by S1 - S4. Numbers in parentheses are obtained with the kernels to build up a deeper model, but we argue that a
UPerNet only taking the outputs of S4. large-kernel ConvNet may not benefit from more large ker-
nels. In this group of experiments (Table 4), we scale up
S1 S2 S3 S4 Params FLOPs Img/s Acc mIoU
3 13 13 13 32.9M 4.92G 1863 81.6 46.5 (42.4) N3 from 9 to 27, following ConvNeXt-S [40]. Considering
3 11 11 11 32.6M 4.86G 1876 81.6 45.5 (41.9) that nine 13×13 blocks may have already built up sufficient
3 3 13 13 32.8M 4.85G 2006 81.7 46.1 receptive field, we examine if the added blocks should also
3 13 3 13 32.4M 4.81G 2015 81.6 45.9
use large kernels. Specifically, we refer to the block with
3 13 13 3 32.5M 4.90G 1884 81.4 45.8
3 15 15 15 33.3M 4.99G 1851 81.7 45.9 (42.7) a Dilated Reparam Block as the Large Kernel Block (LarK
13 13 13 13 33.0M 5.06G 1547 81.6 44.9 (42.4) Block) and name a block that uses a DW 3×3 conv as a
Small Kernel Block (SmaK Block) so that the shallow model
stage and 13×13 in the last three stages. Table 3 shows that has 3 SmaK Blocks in Stage 1 and 3, 9, 3 LarK Blocks in
replacing the large kernels in the last three stages with 3×3 the last three stages. While scaling up the depth of Stage
or changing K from 13 to 11 degrades the models, espe- 3, we tried the following options. A) All of the 27 blocks
cially in the ADE20K mIoU, which highlights the signifi- are LarK Blocks. B) We interleave SK with LarK Blocks so
cance of large kernels. Interestingly, using 13×13 in Stage that Stage 3 has 14 LarK Blocks and 13 SmaK Blocks. C)
1 or enlarging K from 13 to 15 makes almost no difference We place two SmaK Blocks after a LarK Block so that the
in the ImageNet accuracy but reduces the ADE20K mIoU. resultant model will have the same 9 LarK Blocks as before
Remark. We argue that this phenomenon does not mean but 18 extra SmaK Blocks. D) We remove the DW 3×3
larger kernels result in lower feature quality. It is due to the layers in SmaK Blocks. Table 4 shows that scaling up the
structural priors of UPerNet, which takes the features ex- depth brings significant improvements, which is expected,
tracted by the low-level layers of the backbone and assumes and 9 LarK Blocks are sufficient. Though 27 LarK Blocks
they should only encode local information so that combin- perform slightly better in the ADE20K mIoU, the inference
ing them with the high-level features extracted from the last speed is observably slowed down. Besides, the model with-
layers of the backbone results in better segmentation. With out 3×3 conv in SmaK Blocks shows significantly lower
larger kernels in lower stages, the low-level features are no mIoU with only minor improvements in the throughput,
longer confined to small local areas so the UPerNet benefits suggesting such small kernels in SmaK Blocks are useful
less from combining them with the high-level features. We while scaling up the depth of large-kernel ConvNet as they
verify this explanation by making the UPerNet only use the increase the abstract hierarchy of spatial patterns, though
high-level features (i.e., outputs of Stage 4) to evaluate the they may not effectively enlarge the ERF [11, 43]. This ob-
quality of the eventual features alone. Under this setting, servation supports our motivation to decouple the effects of
K=15 delivers the best mIoU (42.7), the model with large conv layers in enlarging the ERF and extracting more com-
kernels in Stage 1 performs as well as the baseline (42.4), plicated spatial patterns, as discussed in Sec. 1.
and K=11 performs the worst (41.9). Such observations
3.3. Architectural Specifications
confirm that large kernels, even when they are used inap-
propriately, do not damage the feature quality of ConvNet Following our proposed guidelines, we instantiate a series
but merely make the low-level features less favorable for of models (Table 5). For a fair comparison with ConvNeXt
certain downstream models that require local low-level fea- V2 [70], UniRepLKNet-A/F/P/N follows its configurations.
tures, suggesting we should decide the kernel size according We scale up the depth to build UniRepLKNet-T/S and scale
to the specific downstream tasks and framework. In our spe- up the width to construct UniRepLKNet-S/B/L/XL.
cific use cases (i.e., representative image recognition tasks
with common downstream frameworks), we employ 13×13 3.4. Generalizing UniRepLKNet beyond Image
kernels in the middle- and high-level stages by default. To utilize the universal perception ability of UniRe-
Architectural Guideline 4 on the Scaling Rule: while pLKNet, we preprocess the data of different modalities into
scaling up the depth, the added blocks should use small B × C ′ × H × W embedding maps, where B is the batch
kernels. The scaling rule of existing large-kernel ConvNets size and C ′ is determined by the modality, and configure the
5
Table 5. Architectural hyper-parameters of UniRepLKNet in- frames. For example, in our experiments, we have NF =16
stances, including the number of blocks in the four stages and h=w=224 so that H=W =896. Generally,
N1 , N2 , N3 , N4 and channels C of the first stage. Stage 1 uses
SmaK Blocks, and Stages 2 and 4 use LarK Blocks only. For Stage
3, e.g., “9 + 18” means 9 LarK Blocks and 18 SmaK Blocks. HW
xV ∈ RB×NF ×3×h×w → RB×3×H×W s.t. = NF .
N1 N2 N3 N4 C Params hw
UniRepLKNet-A 2 2 6+0 2 40 4.4M
(5)
UniRepLKNet-F 2 2 6+0 2 48 6.2M
UniRepLKNet-P 2 2 6+0 2 64 10.7M 4. UniRepLKNet for Image Recognition
UniRepLKNet-N 2 2 8+0 2 80 18.3M
UniRepLKNet-T 3 3 9+9 3 80 31.0M
UniRepLKNet-S 3 3 9 + 18 3 96 55.6M
ImageNet classification. Following ConvNeXt [40],
UniRepLKNet-B 3 3 9 + 18 3 128 97.9M we use the widely adopted 300-epoch receipt to train
UniRepLKNet-L 3 3 9 + 18 3 192 218.3M UniRepLKNet-A/F/P/N/T/S on ImageNet-1K; we pretrain
UniRepLKNet-XL 3 3 9 + 18 3 256 386.4M UniRepLKNet-S/B/L/XL on ImageNet-22K using the 90-
epoch receipt and fine-tune with ImageNet-1K for 30
input channel of the first layer of UniRepLKNet to C ′ . For epochs (see the Appendix for details). As our goal is to
simplicity, the other parts of the models are the same as the develop models that run with high actual speed, we eval-
UniRepLKNet initially designed for the image without any uate the actual throughput on the same A100 GPU using
modality-specific customization. By doing so, we directly a batch size of 128. Table 6 shows the top-1 accuracy on
apply a ConvNet typically used for image tasks to deal with the ImageNet-1K validation set where the results are sorted
data of other modalities. In other words, the UniRepLKNet by the throughput. We split the results into seven segments
for image tasks can be seen as a general UniRepLKNet for better readability. 1) UniRepLKNet-A/F outperforms
with C ′ =3 and no such preprocessing. We introduce how ConvNeXt-V2-A/F by 0.8/0.6 in the accuracy and runs
to transform the data into such embedding maps as follows. 19%/17% faster, respectively. 2) UniRepLKNet-P/N out-
Time-series. Let L and D be the length and dimensions of performs FastViT-T12/S12 and ConvNeXt V2-P/N by clear
a time-series sequence xT ∈ RB×L×D , we adopt the em- margins. 3) UniRepLKNet-T outperforms multiple small-
bedding layer in Corrformer [71] to split it into n nodes then level competitors. 4) UniRepLKNet-S outperforms a se-
′
project it into a latent space RBn×L×D (D′ and n are con- ries of small-level and even base-level models in both speed
figurable hyper-parameters of the embedding layer). Then and accuracy and runs almost as fast as InternImage-T. 5)
we simply reshape it into a single-channel embedding map. With ImageNet-22K pretraining, UniRepLKNet-S even ap-
D ′
proaches the accuracy of RepLKNet-31L and runs 3× as
xT ∈ RB×L×D → RBn×L× n → RBn×L×D fast as the latter. UniRepLKNet-B outperforms CoAtNet-
(2)
→ RBn×1×H×W s.t.HW = LD′ . 2 and DeiT III-B by clear margins. UniRepLKNet-L out-
performs InternImage-L in both accuracy and throughput.
Audio. Let T and F be the numbers of time frames and 6) On the XL-level, UniRepLKNet-XL outperforms in both
frequency bins, we use xA ∈ RB×T ×F to represent audio accuracy and throughput, running more than 2× as fast as
data. A sample is seen as a 1 × T × F embedding map that CoAtNet-3 and 3× as DeiT III-L.
resembles a single-channel image so C ′ =1, H=T , W =F . COCO object detection and instance segmentation. We
transfer the pretrained UniRepLKNets as the backbones
xA ∈ RB×T ×F → RB×1×T ×F . (3) of Cascade Mask R-CNN [3, 20] and adopt the stan-
dard 3x (36-epoch) training configuration with MMDetec-
Point cloud. Assume a sample comprises P points each tion [5]. Table 7 shows UniRepLKNet outperforms Swin,
represented by the X/Y/Z coordinates, we use a series of ConvNeXt, RepLKNet, and SLaK, which are representa-
conv layers to generate three-view projections [78]. We tives of ViTs, modern medium-kernel ConvNets, and ex-
configure the resolution of the generated projections to be isting large-kernel ConvNets, respectively, and shows com-
224 so that H=W =224, C ′ =3. parable performance to InternImage [67], which is a latest
powerful architecture with deformable convolution.
xP ∈ RB×P ×3 → RB×3×224×224 . (4)
ADE20K semantic segmentation. We use the pretrained
Video. We represent a video as NF frames and each frame UniRepLKNets as the backbones of UPerNet [74] on
is a 3 × h × w image. We reshape it by merging the frame ADE20K [80] and adopt the standard 160k-iteration train-
dimension into the height and width dimensions so that ing receipt with MMSegmentation [7]. Table 8 reports the
we obtain a representation that can be viewed as a single mIoU on the validation set. Impressively, UniRepLKNet
image created by laying out (i.e., concatenating) the NF outperforms InternImage and the other models.
6
Table 6. ImageNet classification. Throughput is tested with an Table 7. Object detection on COCO validation set. FLOPs are
A100 GPU and batch size of 128. “T/C” denote transformer/Con- measured with 1280×800 inputs. “‡ ” ImageNet-22K pretraining.
vNet. “‡ ” indicates ImageNet-22K [9] pretraining. Method Params (M) FLOPs (G) APbox APmask
Input Params FLOPs Throughput Acc UniRepLKNet-T 89 749 51.7 44.9
Method Type Swin-T [37] 86 745 50.4 43.7
size (M) (G) (img/s) (%)
UniRepLKNet-A C 2242 4.4 0.6 5942 77.0 ConvNeXt-T [40] 86 741 50.4 43.7
UniRepLKNet-F C 2242 6.2 0.9 5173 78.6 SLaK-T [36] - - 51.3 44.3
ConvNeXt V2-A [70] C 2242 3.7 0.5 5054 76.2 UniRepLKNet-S 113 835 53.0 45.9
Swin-S [37] 107 838 51.9 45.0
FastViT-T8 [63] T 2562 3.6 0.7 5025 75.6
ConvNeXt-S [40] 108 827 51.9 45.0
ConvNeXt V2-F [70] C 2242 5.2 0.8 4329 78.0
UniRepLKNet-S‡ 113 835 54.3 47.1
UniRepLKNet-P C 2242 10.7 1.6 3949 80.2
UniRepLKNet-B‡ 155 978 54.8 47.4
FastViT-T12 [63] T 2562 6.8 1.4 3407 79.1
Swin-B‡ [37] 145 982 53.0 45.8
ConvNeXt V2-P [70] C 2242 9.1 1.4 3339 79.7
FastViT-S12 [63] T 2562 8.8 1.8 3162 79.8 ConvNeXt-B‡ [40] 146 964 54.0 46.9
UniRepLKNet-N C 2242 18.3 2.8 2807 81.6 RepLKNet-31B‡ [11] 137 965 52.2 45.2
ConvNeXt V2-N [70] C 2242 15.6 2.4 2405 81.2 UniRepLKNet-L‡ 276 1385 55.8 48.4
UniRepLKNet-T C 2242 31 4.9 1804 83.2 Swin-L‡ [37] 253 1382 53.9 46.7
FastViT-SA24 [63] T 2562 21 3.8 1670 82.6 ConvNeXt-L‡ [40] 255 1354 54.8 47.6
PVTv2-B2 [66] T 2242 25 4.0 1620 82.0 RepLKNet-31L‡ [11] 229 1321 53.9 46.5
CoAtNet-0 [8] T 2242 25 4.2 1613 81.6 InternImage-L‡ [67] 277 1399 56.1 48.5
DeiT III-S [62] T 224 2 22 4.6 1485 81.4 UniRepLKNet-XL‡ 443 1952 56.4 49.0
SwinV2-T/8 [39] T 2562 28 6 1406 81.8 InternImage-XL‡ [67] 387 1782 56.2 48.8
SLaK-T [36] C 2242 30 5.0 1312 82.5 ConvNeXt-XL‡ [40] 407 1898 55.2 47.7
InternImage-T [67] C 2242 30 5 1292 83.5
UniRepLKNet-S C 2242 56 9.1 1265 83.9 Table 8. Semantic segmentation on ADE20K validation set.
ConvNeXt-S [40] C 2242 50 8.7 1182 83.1 The FLOPs are measured with 512×2048 or 640×2560 inputs ac-
HorNet-T [51] C 2242 23 3.9 1162 83.0 cording to the crop size. “SS” and “MS” mean single- and multi-
FastViT-SA36 [63] T 2562 30 5.6 1151 83.6 scale testing, respectively. “‡ ” indicates ImageNet-22K [9] pre-
CoAtNet-1 [8] T 2242 42 8.4 969 83.3
training.
SLaK-S [36] C 2242 55 9.8 967 83.8
FastViT-MA36 [63] T 2562 43 7.9 914 83.9 Method
Crop Params FLOPs mIoU mIoU
SwinV2-S/8 [39] T 2562 50 12 871 83.7 size (M) (G) (SS) (MS)
RepLKNet-31B [11] C 2242 79 15.3 859 83.5 UniRepLKNet-T 5122 61 946 48.6 49.1
PVTv2-B5 [66] T 224 2 82 11.8 802 83.8 Swin-T [37] 5122 60 945 44.5 45.8
UniRepLKNet-S‡ C 3842 56 26.7 435 86.4 ConvNeXt-T [40] 5122 60 939 46.0 46.7
ConvNeXt-S‡ [40] C 3842 50 25.5 415 85.8 SLaK-T [36] 5122 65 936 47.6 -
UniRepLKNet-B‡ C 3842 98 47.2 314 87.4 InternImage-T [67] 5122 59 944 47.9 48.1
ConvNeXt-B‡ [40] C 3842 89 45.1 304 86.8 UniRepLKNet-S 5122 86 1036 50.5 51.0
UniRepLKNet-L‡ C 3842 218 105.4 190 87.9 Swin-S [37] 5122 81 1038 47.6 49.5
ConvNeXt-L‡ [40] C 3842 198 101 185 87.5 ConvNeXt-S [40] 5122 82 1027 48.7 49.6
CoAtNet-2‡ [8] T 3842 75 49.8 163 87.1 SLaK-S [36] 5122 91 1028 49.4 -
RepLKNet-31L‡ [11] C 3842 172 96.0 158 86.6 InternImage-S [67] 5122 80 1017 50.1 50.9
InternImage-L‡ [67] C 3842 223 108 143 87.7 UniRepLKNet-S‡ 6402 86 1618 51.9 52.7
DeiT III-B‡ [62] T 3842 87 55.5 138 86.7 UniRepLKNet-B‡ 6402 130 1850 53.5 53.9
UniRepLKNet-XL‡ C 3842 386 187 131 88.0 Swin-B‡ [37] 6402 121 1841 50.0 51.7
ConvNeXt-XL‡ [40] C 3842 350 179 129 87.8 ConvNeXt-B‡ [40] 6402 122 1828 52.6 53.1
HorNet-L‡ [51] C 3842 202 102 127 87.7 RepLKNet-31B‡ [11] 6402 112 1829 51.5 52.3
InternImage-XL‡ [67] C 3842 335 163 114 88.0 UniRepLKNet-L‡ 6402 254 2507 54.5 55.0
CoAtNet-3‡ [8] T 3842 168 107 103 87.6 Swin-L‡ [37] 6402 234 2468 52.1 53.5
SwinV2-L/24‡ [39] T 3842 197 115 88 87.6 RepLKNet-31L‡ [11] 6402 207 2404 52.4 52.7
CoAtNet-4‡ [8] T 3842 275 190 58 87.9 ConvNeXt-L‡ [40] 6402 235 2458 53.2 53.7
DeiT III-L‡ [62] T 3842 305 191 42 87.7 InternImage-L‡ [67] 6402 256 2526 53.9 54.1
UniRepLKNet-XL 6402 425 3420 55.2 55.6
5. Universal Perception on other Modalities ConvNeXt-XL‡ [40] 6402 391 3335 53.6 54.0
InternImage-XL‡ [67] 6402 368 3142 55.0 55.3
Time-series. Following Corrformer [71], we conduct ex-
periments on the Global Temperature and Wind Speed Fore- ous state-of-the-art method, we use its embedding layer (as
casting challenge 4 using the dataset collected from the Na- introduced in Sec. 3.4) and decoder and only replace its en-
tional Centers for Environmental Information (NCEI). This coder transformer with UniRepLKNet-S. We also compare
huge-scale dataset contains hourly averaged wind speed and UniRepLKNet-S against a wide range of methods including
temperature data from 3,850 stations with different geo- statistical and numerical approaches. We make the follow-
graphical scales and densities, spanning from 2019 to 2021. ing observations from Table 9. 1) In terms of the forecast-
For a fair comparison with Corrformer, which was the previ- ing precision, UniRepLKNet delivers a new state-of-the-art
performance, achieving the lowest errors of 7.602, 1.832,
4 https://codeocean.com/capsule/0341365/tree/v1 3.865, and 1.301 for MSE and MAE in forecasting global
7
Table 9. Time-series forecasting performance on Global Temper- Table 11. Video recognition accuracy on Kinetics-400.
ature and Wind Speed Forecasting challenge. UniRepLKNet de- Method Pretrain Type Acc (%) Params
livers a new state-of-the-art performance in Mean Squared Error Specialist
SlowFast-101 [13] IN-1K ConvNet+RNN 79.8 62.8M
(MSE) and Mean Absolute Error (MAE). GFS (https://www. MViTv2-B [33] IN-1K Transformer 81.2 51.2M
ncei.noaa.gov/) stands for the Global Forecasting System. TimeSFormer [2] K400 Transformer 80.7 122M
Generalist
Temperature Wind speed
Method Type Params Meta-Transformer [78] LAINON-2B Transformer 47.3 86.9M
MSE ↓ MAE ↓ MSE ↓ MAE ↓
Statistics-based ImageBind [14] CLIP Data Transformer 50.0 632M
Holt–Winters [27] - - 13.241 2.262 5.912 1.664 UniRepLKNet-S - ConvNet 54.8 55.5M
Prophet [59] - - 11.626 2.946 9.691 2.382
GDBT [NeurIPS’17] [30] - - 9.706 2.214 4.101 1.417
Numerical Simulation
GFS (reanalysis) - - 14.933 2.287 9.993 2.340 Table 12. Point cloud analysis on ModelNet-40.
ERA5 (reanalysis) [21] - - 13.448 1.908 4.999 1.587
DeepAR [52] - - 32.249 4.262 5.248 1.602 Method Type mAcc (%) OA (%)
N-BEATS [44] - - 9.203 2.117 4.124 1.390 PointNet [46] MLP 86.0 89.2
Deep Learning Specialist
PointNet++ [47] MLP - 91.9
StemGNN [NeurIPS’20] [4] GNN 180M 13.926 2.746 4.066 1.389
Pyraformer [ICLR’21] [35] Transformer 158M 23.326 3.669 4.614 1.514 PointConv [72] ConvNet - 92.5
Corrformer [Nat. Mach. Intell.’23] [71] Transformer 155M 7.709 1.888 3.889 1.304 KPConv [60] ConvNet - 92.9
Generalist DGCNN [68] ConvNet 90.2 92.9
UniRepLKNet-S ConvNet 132M 7.602 1.832 3.865 1.301
UniRepLKNet-S ConvNet 90.3 93.2
8
6. Conclusion convnets great again. In Proceedings of the IEEE/CVF Con-
ference on Computer Vision and Pattern Recognition, pages
We proposed four architectural guidelines for large-kernel 13733–13742, 2021. 3
ConvNet and an architecture named UniRepLKNet, which [11] Xiaohan Ding, Xiangyu Zhang, Jungong Han, and Guiguang
not only shows a leading performance in image recogni- Ding. Scaling up your kernels to 31x31: Revisiting large
tion but also achieves remarkable results even on modali- kernel design in cnns. In Proceedings of the IEEE/CVF con-
ties such as audio and time-series data, outperforming mul- ference on computer vision and pattern recognition, pages
tiple specialist models on those modalities. Such results not 11963–11975, 2022. 1, 2, 3, 4, 5, 7
only signify a “comeback” for ConvNet in its original do- [12] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov,
main but also showcase large-kernel ConvNet’s potential to Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner,
“conquer” new territories, highlighting their adaptability Mostafa Dehghani, Matthias Minderer, Georg Heigold, Syl-
and broad utility across different modalities and tasks. vain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image
is worth 16x16 words: Transformers for image recognition
References at scale. In 9th International Conference on Learning Rep-
resentations, ICLR 2021, Virtual Event, Austria, May 3-7,
[1] Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hin- 2021. OpenReview.net, 2021. 1
ton. Layer normalization. arXiv preprint arXiv:1607.06450, [13] Christoph Feichtenhofer, Haoqi Fan, Jitendra Malik, and
2016. 3 Kaiming He. Slowfast networks for video recognition. In
[2] Gedas Bertasius, Heng Wang, and Lorenzo Torresani. Is Proceedings of the IEEE/CVF international conference on
space-time attention all you need for video understanding? computer vision, pages 6202–6211, 2019. 8
In ICML, page 4, 2021. 8 [14] Rohit Girdhar, Alaaeldin El-Nouby, Zhuang Liu, Mannat
[3] Zhaowei Cai and Nuno Vasconcelos. Cascade r-cnn: High Singh, Kalyan Vasudev Alwala, Armand Joulin, and Ishan
quality object detection and instance segmentation. IEEE Misra. Imagebind: One embedding space to bind them all.
Transactions on Pattern Analysis and Machine Intelligence, In Proceedings of the IEEE/CVF Conference on Computer
2019. 6 Vision and Pattern Recognition, pages 15180–15190, 2023.
[4] Defu Cao, Yujing Wang, Juanyong Duan, Ce Zhang, Xia 8
Zhu, Congrui Huang, Yunhai Tong, Bixiong Xu, Jing Bai, [15] Yuan Gong, Yu-An Chung, and James Glass. Ast: Audio
Jie Tong, et al. Spectral temporal graph neural network for spectrogram transformer. arXiv preprint arXiv:2104.01778,
multivariate time-series forecasting. Advances in neural in- 2021. 8
formation processing systems, 33:17766–17778, 2020. 8
[16] Yuan Gong, Yu-An Chung, and James Glass. Psla: Improv-
[5] Kai Chen, Jiaqi Wang, Jiangmiao Pang, Yuhang Cao, Yu ing audio tagging with pretraining, sampling, labeling, and
Xiong, Xiaoxiao Li, Shuyang Sun, Wansen Feng, Ziwei Liu, aggregation. IEEE/ACM Transactions on Audio, Speech, and
Jiarui Xu, Zheng Zhang, Dazhi Cheng, Chenchen Zhu, Tian- Language Processing, 29:3292–3306, 2021. 8
heng Cheng, Qijie Zhao, Buyu Li, Xin Lu, Rui Zhu, Yue Wu,
[17] Yuan Gong, Cheng-I Lai, Yu-An Chung, and James Glass.
Jifeng Dai, Jingdong Wang, Jianping Shi, Wanli Ouyang,
Ssast: Self-supervised audio spectrogram transformer. In
Chen Change Loy, and Dahua Lin. MMDetection: Open
Proceedings of the AAAI Conference on Artificial Intelli-
mmlab detection toolbox and benchmark. arXiv preprint
gence, pages 10699–10709, 2022. 8
arXiv:1906.07155, 2019. 6
[6] Yukang Chen, Jianhui Liu, Xiangyu Zhang, Xiaojuan Qi, and [18] Qi Han, Zejia Fan, Qi Dai, Lei Sun, Ming-Ming Cheng, Ji-
Jiaya Jia. Largekernel3d: Scaling up kernels in 3d sparse aying Liu, and Jingdong Wang. Demystifying local vision
cnns. In Proceedings of the IEEE/CVF Conference on Com- transformer: Sparse connectivity, weight sharing, and dy-
puter Vision and Pattern Recognition, pages 13488–13498, namic weight. arXiv preprint arXiv:2106.04263, 2021. 2
2023. 1, 3 [19] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun.
[7] MMSegmentation Contributors. MMSegmentation: Deep residual learning for image recognition. In Proceed-
Openmmlab semantic segmentation toolbox and ings of the IEEE conference on computer vision and pattern
benchmark. https : / / github . com / open - recognition, pages 770–778, 2016. 1, 8
mmlab/mmsegmentation, 2020. 4, 6 [20] Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross Gir-
[8] Zihang Dai, Hanxiao Liu, Quoc V Le, and Mingxing Tan. shick. Mask r-cnn. In Proceedings of the IEEE international
Coatnet: Marrying convolution and attention for all data conference on computer vision, pages 2961–2969, 2017. 6
sizes. Advances in neural information processing systems, [21] H Hersbach, B Bell, P Berrisford, Sh Hirahara, A Horányi, J
34:3965–3977, 2021. 7 Muñoz-Sabater, J Nicolas, C Peubey, R Radu, Di Schepers,
[9] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, et al. The era5 global reanalysis, qj roy. meteor. soc., 146,
and Li Fei-Fei. Imagenet: A large-scale hierarchical im- 1999–2049, 2020. 8
age database. In Computer Vision and Pattern Recognition, [22] Andrew G Howard, Menglong Zhu, Bo Chen, Dmitry
2009. CVPR 2009. IEEE Conference on, pages 248–255. Kalenichenko, Weijun Wang, Tobias Weyand, Marco An-
IEEE, 2009. 2, 7 dreetto, and Hartwig Adam. Mobilenets: Efficient convolu-
[10] Xiaohan Ding, Xiangyu Zhang, Ningning Ma, Jungong Han, tional neural networks for mobile vision applications. arXiv
Guiguang Ding, and Jian Sun. Repvgg: Making vgg-style preprint arXiv:1704.04861, 2017. 1
9
[23] Han Hu, Zheng Zhang, Zhenda Xie, and Stephen Lin. Lo- modeling and forecasting. In International conference on
cal relation networks for image recognition. In Proceedings learning representations, 2021. 8
of the IEEE/CVF International Conference on Computer Vi- [36] Shiwei Liu, Tianlong Chen, Xiaohan Chen, Xuxi Chen, Qiao
sion, pages 3464–3473, 2019. 2 Xiao, Boqian Wu, Tommi Kärkkäinen, Mykola Pechenizkiy,
[24] Jie Hu, Li Shen, and Gang Sun. Squeeze-and-excitation net- Decebal Mocanu, and Zhangyang Wang. More convnets in
works. In Proceedings of the IEEE conference on computer the 2020s: Scaling up kernels beyond 51x51 using sparsity.
vision and pattern recognition, pages 7132–7141, 2018. 2, 4 arXiv preprint arXiv:2207.03620, 2022. 1, 2, 3, 7
[25] Gao Huang, Zhuang Liu, Laurens van der Maaten, and Kil- [37] Ze Liu, Han Hu, Yutong Lin, Zhuliang Yao, Zhenda Xie,
ian Q. Weinberger. Densely connected convolutional net- Yixuan Wei, Jia Ning, Yue Cao, Zheng Zhang, Li Dong,
works. In 2017 IEEE Conference on Computer Vision and et al. Swin transformer v2: Scaling up capacity and reso-
Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July lution. arXiv preprint arXiv:2111.09883, 2021. 1, 7
21-26, 2017, pages 2261–2269. IEEE Computer Society, [38] Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng
2017. 1 Zhang, Stephen Lin, and Baining Guo. Swin transformer:
[26] Po-Yao Huang, Hu Xu, Juncheng Li, Alexei Baevski, Hierarchical vision transformer using shifted windows. In
Michael Auli, Wojciech Galuba, Florian Metze, and Proceedings of the IEEE/CVF International Conference on
Christoph Feichtenhofer. Masked autoencoders that listen. Computer Vision, pages 10012–10022, 2021. 1, 2
Advances in Neural Information Processing Systems, 35: [39] Ze Liu, Han Hu, Yutong Lin, Zhuliang Yao, Zhenda Xie,
28708–28720, 2022. 8 Yixuan Wei, Jia Ning, Yue Cao, Zheng Zhang, Li Dong, et al.
[27] Rob J Hyndman and George Athanasopoulos. Forecasting: Swin transformer v2: Scaling up capacity and resolution. In
Principles and practice. otexts; 2014. Online at http://otexts. Proceedings of the IEEE/CVF conference on computer vi-
org/fpp, 2017. 8 sion and pattern recognition, pages 12009–12019, 2022. 2,
7
[28] Sergey Ioffe and Christian Szegedy. Batch normalization:
[40] Zhuang Liu, Hanzi Mao, Chao-Yuan Wu, Christoph Feicht-
Accelerating deep network training by reducing internal co-
enhofer, Trevor Darrell, and Saining Xie. A convnet for the
variate shift. In International Conference on Machine Learn-
2020s. arXiv preprint arXiv:2201.03545, 2022. 5, 6, 7, 8, 1
ing, pages 448–456, 2015. 2, 3
[41] Ilya Loshchilov and Frank Hutter. Decoupled weight decay
[29] Will Kay, Joao Carreira, Karen Simonyan, Brian Zhang,
regularization. arXiv preprint arXiv:1711.05101, 2017. 1
Chloe Hillier, Sudheendra Vijayanarasimhan, Fabio Viola,
[42] Pinjun Luo, Guoqiang Xiao, Xinbo Gao, and Song Wu. Lkd-
Tim Green, Trevor Back, Paul Natsev, et al. The kinetics hu-
net: Large kernel convolution network for single image de-
man action video dataset. arXiv preprint arXiv:1705.06950,
hazing. In 2023 IEEE International Conference on Multi-
2017. 8
media and Expo (ICME), pages 1601–1606. IEEE, 2023. 1,
[30] Guolin Ke, Qi Meng, Thomas Finley, Taifeng Wang, Wei 3
Chen, Weidong Ma, Qiwei Ye, and Tie-Yan Liu. Lightgbm:
[43] Wenjie Luo, Yujia Li, Raquel Urtasun, and Richard S.
A highly efficient gradient boosting decision tree. Advances
Zemel. Understanding the effective receptive field in deep
in neural information processing systems, 30, 2017. 8
convolutional neural networks. In Advances in Neural Infor-
[31] Qiuqiang Kong, Yin Cao, Turab Iqbal, Yuxuan Wang, mation Processing Systems 29: Annual Conference on Neu-
Wenwu Wang, and Mark D Plumbley. Panns: Large-scale ral Information Processing Systems 2016, December 5-10,
pretrained audio neural networks for audio pattern recogni- 2016, Barcelona, Spain, pages 4898–4906, 2016. 1, 5
tion. IEEE/ACM Transactions on Audio, Speech, and Lan- [44] Boris N Oreshkin, Dmitri Carpov, Nicolas Chapados, and
guage Processing, 28:2880–2894, 2020. 8 Yoshua Bengio. N-beats: Neural basis expansion analysis for
[32] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. interpretable time series forecasting. In International Con-
Imagenet classification with deep convolutional neural net- ference on Learning Representations, 2019. 8
works. In Advances in neural information processing sys- [45] Chao Peng, Xiangyu Zhang, Gang Yu, Guiming Luo, and
tems, pages 1097–1105, 2012. 2 Jian Sun. Large kernel matters–improve semantic segmen-
[33] Yanghao Li, Chao-Yuan Wu, Haoqi Fan, Karttikeya Man- tation by global convolutional network. In Proceedings of
galam, Bo Xiong, Jitendra Malik, and Christoph Feichten- the IEEE conference on computer vision and pattern recog-
hofer. Mvitv2: Improved multiscale vision transformers for nition, pages 4353–4361, 2017. 2
classification and detection. In Proceedings of the IEEE/CVF [46] Charles Ruizhongtai Qi, Hao Su, Kaichun Mo, and
Conference on Computer Vision and Pattern Recognition, Leonidas J. Guibas. Pointnet: Deep learning on point sets
pages 4804–4814, 2022. 8 for 3d classification and segmentation. In CVPR, 2017. 8
[34] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, [47] Charles R Qi, Li Yi, Hao Su, and Leonidas J Guibas. Point-
Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence net++: Deep hierarchical feature learning on point sets in a
Zitnick. Microsoft coco: Common objects in context. In metric space. In NeurIPS, 2017. 8
European conference on computer vision, pages 740–755. [48] Ilija Radosavovic, Raj Prateek Kosaraju, Ross Girshick,
Springer, 2014. 2 Kaiming He, and Piotr Dollár. Designing network design
[35] Shizhan Liu, Hang Yu, Cong Liao, Jianguo Li, Weiyao Lin, spaces. In Proceedings of the IEEE/CVF Conference on
Alex X Liu, and Schahram Dustdar. Pyraformer: Low- Computer Vision and Pattern Recognition, pages 10428–
complexity pyramidal attention for long-range time series 10436, 2020. 1
10
[49] Prajit Ramachandran, Niki Parmar, Ashish Vaswani, Irwan [63] Pavan Kumar Anasosalu Vasu, James Gabriel, Jeff Zhu, On-
Bello, Anselm Levskaya, and Jonathon Shlens. Stand- cel Tuzel, and Anurag Ranjan. Fastvit: A fast hybrid vi-
alone self-attention in vision models. arXiv preprint sion transformer using structural reparameterization. arXiv
arXiv:1906.05909, 2019. 1 preprint arXiv:2303.14189, 2023. 2, 7
[50] Yongming Rao, Wenliang Zhao, Zheng Zhu, Jiwen Lu, and [64] Ashish Vaswani, Prajit Ramachandran, Aravind Srinivas,
Jie Zhou. Global filter networks for image classification. Niki Parmar, Blake Hechtman, and Jonathon Shlens. Scaling
arXiv preprint arXiv:2107.00645, 2021. 2 local self-attention for parameter efficient visual backbones.
[51] Yongming Rao, Wenliang Zhao, Yansong Tang, Jie Zhou, In Proceedings of the IEEE/CVF Conference on Computer
Ser Nam Lim, and Jiwen Lu. Hornet: Efficient high- Vision and Pattern Recognition, pages 12894–12904, 2021.
order spatial interactions with recursive gated convolutions. 1
Advances in Neural Information Processing Systems, 35: [65] Wenhai Wang, Enze Xie, Xiang Li, Deng-Ping Fan, Kaitao
10353–10366, 2022. 7 Song, Ding Liang, Tong Lu, Ping Luo, and Ling Shao.
[52] David Salinas, Valentin Flunkert, Jan Gasthaus, and Tim Pyramid vision transformer: A versatile backbone for
Januschowski. Deepar: Probabilistic forecasting with au- dense prediction without convolutions. arXiv preprint
toregressive recurrent networks. International Journal of arXiv:2102.12122, 2021. 1
Forecasting, 36(3):1181–1191, 2020. 8 [66] Wenhai Wang, Enze Xie, Xiang Li, Deng-Ping Fan, Kaitao
[53] Karen Simonyan and Andrew Zisserman. Very deep convo- Song, Ding Liang, Tong Lu, Ping Luo, and Ling Shao. Pvt
lutional networks for large-scale image recognition. arXiv v2: Improved baselines with pyramid vision transformer.
preprint arXiv:1409.1556, 2014. 1, 2 Computational Visual Media, 8(3):415–424, 2022. 7
[67] Wenhai Wang, Jifeng Dai, Zhe Chen, Zhenhang Huang,
[54] Aravind Srinivas, Tsung-Yi Lin, Niki Parmar, Jonathon
Zhiqi Li, Xizhou Zhu, Xiaowei Hu, Tong Lu, Lewei Lu,
Shlens, Pieter Abbeel, and Ashish Vaswani. Bottleneck
Hongsheng Li, et al. Internimage: Exploring large-scale vi-
transformers for visual recognition. In Proceedings of the
sion foundation models with deformable convolutions. In
IEEE/CVF Conference on Computer Vision and Pattern
Proceedings of the IEEE/CVF Conference on Computer Vi-
Recognition, pages 16519–16529, 2021. 1
sion and Pattern Recognition, pages 14408–14419, 2023. 6,
[55] Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, 7
Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent
[68] Yue Wang, Yongbin Sun, Ziwei Liu, Sanjay E Sarma,
Vanhoucke, and Andrew Rabinovich. Going deeper with
Michael M Bronstein, and Justin M Solomon. Dynamic
convolutions. In Proceedings of the IEEE conference on
graph cnn for learning on point clouds. TOG, 2019. 8
computer vision and pattern recognition, pages 1–9, 2015.
2 [69] Pete Warden. Speech commands: A dataset for
limited-vocabulary speech recognition. arXiv preprint
[56] Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jon arXiv:1804.03209, 2018. 8
Shlens, and Zbigniew Wojna. Rethinking the inception archi-
[70] Sanghyun Woo, Shoubhik Debnath, Ronghang Hu, Xinlei
tecture for computer vision. In Proceedings of the IEEE con-
Chen, Zhuang Liu, In So Kweon, and Saining Xie. Con-
ference on computer vision and pattern recognition, pages
vnext v2: Co-designing and scaling convnets with masked
2818–2826, 2016.
autoencoders. In Proceedings of the IEEE/CVF Conference
[57] Christian Szegedy, Sergey Ioffe, Vincent Vanhoucke, and on Computer Vision and Pattern Recognition, pages 16133–
Alexander A Alemi. Inception-v4, inception-resnet and the 16142, 2023. 2, 3, 5, 7
impact of residual connections on learning. In Thirty-first
[71] Haixu Wu, Hang Zhou, Mingsheng Long, and Jianmin
AAAI conference on artificial intelligence, 2017. 2
Wang. Interpretable weather forecasting for worldwide sta-
[58] Mingxing Tan and Quoc V Le. Efficientnet: Rethinking tions with a unified deep model. Nature Machine Intelli-
model scaling for convolutional neural networks. arXiv gence, pages 1–10, 2023. 6, 7, 8
preprint arXiv:1905.11946, 2019. 1 [72] Wenxuan Wu, Zhongang Qi, and Li Fuxin. Pointconv: Deep
[59] Sean J Taylor and Benjamin Letham. Forecasting at scale. convolutional networks on 3d point clouds. In CVPR, 2019.
The American Statistician, 72(1):37–45, 2018. 8 8
[60] Hugues Thomas, Charles R Qi, Jean-Emmanuel Deschaud, [73] Zhirong Wu, Shuran Song, Aditya Khosla, Fisher Yu, Lin-
Beatriz Marcotegui, François Goulette, and Leonidas J guang Zhang, Xiaoou Tang, and Jianxiong Xiao. 3d
Guibas. Kpconv: Flexible and deformable convolution for shapenets: A deep representation for volumetric shapes. In
point clouds. In ICCV, 2019. 8 CVPR, 2015. 8
[61] Hugo Touvron, Matthieu Cord, Matthijs Douze, Francisco [74] Tete Xiao, Yingcheng Liu, Bolei Zhou, Yuning Jiang, and
Massa, Alexandre Sablayrolles, and Hervé Jégou. Training Jian Sun. Unified perceptual parsing for scene understand-
data-efficient image transformers & distillation through at- ing. In Proceedings of the European Conference on Com-
tention. In International Conference on Machine Learning, puter Vision (ECCV), pages 418–434, 2018. 4, 6
pages 10347–10357. PMLR, 2021. 1 [75] Chengxing Xie, Xiaoming Zhang, Linze Li, Haiteng Meng,
[62] Hugo Touvron, Matthieu Cord, and Hervé Jégou. Deit iii: Tianlin Zhang, Tianrui Li, and Xiaole Zhao. Large ker-
Revenge of the vit. In European Conference on Computer nel distillation network for efficient single image super-
Vision, pages 516–533. Springer, 2022. 2, 7 resolution. In Proceedings of the IEEE/CVF Conference
11
on Computer Vision and Pattern Recognition, pages 1283–
1292, 2023. 1, 3
[76] Weihao Yu, Mi Luo, Pan Zhou, Chenyang Si, Yichen
Zhou, Xinchao Wang, Jiashi Feng, and Shuicheng Yan.
Metaformer is actually what you need for vision. arXiv
preprint arXiv:2111.11418, 2021. 2
[77] Xiangyu Zhang, Xinyu Zhou, Mengxiao Lin, and Jian Sun.
Shufflenet: An extremely efficient convolutional neural net-
work for mobile devices. In Proceedings of the IEEE con-
ference on computer vision and pattern recognition, pages
6848–6856, 2018. 1
[78] Yiyuan Zhang, Kaixiong Gong, Kaipeng Zhang, Hongsheng
Li, Yu Qiao, Wanli Ouyang, and Xiangyu Yue. Meta-
transformer: A unified framework for multimodal learning.
arXiv preprint arXiv:2307.10802, 2023. 6, 8
[79] Bolei Zhou, Hang Zhao, Xavier Puig, Sanja Fidler, Adela
Barriuso, and Antonio Torralba. Scene parsing through
ade20k dataset. In Proceedings of the IEEE conference on
computer vision and pattern recognition, pages 633–641,
2017. 1
[80] Bolei Zhou, Hang Zhao, Xavier Puig, Tete Xiao, Sanja Fi-
dler, Adela Barriuso, and Antonio Torralba. Semantic under-
standing of scenes through the ade20k dataset. International
Journal of Computer Vision, 127(3):302–321, 2019. 2, 6
12
UniRepLKNet: A Universal Perception Large-Kernel ConvNet for Audio, Video,
Point Cloud, Time-Series and Image Recognition
Supplementary Material
Appendix A: General Transformation from ImageNet image classification. The training configura-
Dialted Convolution to Non-dilated Large- tions for the ImageNet-1K-only results shown in Section
Kernel Convolution 4 are presented in Table 14. These configurations are sim-
ilar to common practices. For the experiments in Section
Since ignoring pixels of the input is equivalent to inserting 3, we use the same configurations, except that the training
extra zero entries into the conv kernel, a dilated conv layer epochs are set to 100 and the drop path rate is set to 0.1. For
with a small kernel can be equivalently converted into a the models pretrained with ImageNet-22K and then fine-
non-dilated layer with a sparse larger kernel. Let k be the tuned on ImageNet-22K, the configurations are shown in
kernel size and r be the dilation rate of the dilated layer, by Table 14. Note that we follow the configurations adopted
inserting zero entries, the kernel size of the corresponding by ConvNeXt for a fair comparison with ConvNeXt-S/B,
non-dilated layer will be (k − 1)r + 1, which is referred to and the configurations used by InternImage for a fair com-
as the equivalent kernel size for brevity. parison with InternImage-L/XL (the results with ImageNet-
As discussed in the paper, to eliminate the inference 22K-pretrained InternImage-S/B were not reported).
costs of the extra dilated conv layers in the Dilated Reparam COCO object detection. For fair comparisons, we fol-
Block, we propose to equivalently transform the whole low common practices [37, 40] to initialize the backbone
block into a single non-dilated conv layer for inference. As with pretrained weights and train the models using a 3×
discussed before, let k and r be the kernel size and dilation (36 epochs) schedule by default. The shorter side is resized
rate, respectively, the transformation from a dilated conv to 480−800 pixels, while the longer side does not exceed
layer’s kernel W ∈ Rk×k to a non-dilated layer’s kernel 1,333 pixels. All the models are trained with a batch size of
W′ ∈ R((k−1)r+1)×((k−1)r+1) can be elegantly realized by 16 and AdamW [41] optimizer with an initial learning rate
a transpose convolution with a stride of r and an identity of 1 × 10−4 . We will publicly release the training configu-
kernel I ∈ R1×1 , which is scalar 1 but viewed as a kernel ration file used in the MMDetection framework.
tensor. That is ADE20K semantic segmentation. We evaluate UniRe-
pLKNet models on the ADE20K dataset [79], and initial-
W′ = conv transpose2d(W, I, stride = r) . (6) ize them with the pre-trained classification weights. The
learning rate is initialized with 1 × 10−4 and decayed with
In general cases with multi-channel conv layers, let the the polynomial decay schedule with a power of 1.0. Fol-
input channels, output channels, and number of groups be lowing previous methods [37, 40], the crop size is set to
cin , cout , and g, respectively, we denote the kernel by a 4D 512 for the ImageNet-1K-pretrained models, and 640 for
tensor whose shape is cout × cgin × k × k. ImageNet-22K-pretrained models. All segmentation mod-
1) For a multi-channel depthwise (DW) layer, the trans- els are trained with a batch size of 16 for 160k iterations.
formation is easily generalized from 2D to 4D - the iden- We will publicly release the training configuration file used
tity kernel I is viewed as a 4D tensor I ∈ R1×1×1×1 and in the MMSegmentation framework.
we still follow function 6 to derive the equivalent kernel by
transpose convolution.
2) For non-DW cases (i.e., g < cin ), the transformation
can be seen as splitting the kernel into slices (which can
each be seen as a DW kernel), converting the slices respec-
tively, and concatenating the resultant non-dilated slices up.
We present the code in pytorch (Fig. 4) and a test case
demonstrating the equivalency (Fig. 5).
1
import torch
import torch.nn as nn
import torch.nn.functional as F
Figure 4. Pytorch code to convert a dilated conv layer’s small kernel to a non-dilated layer’s larger sparse kernel.
2
Table 14. Detailed training configurations of ImageNet-1K-only models. Apart from the configurations shown in the table, we use
random left-right flipping, random resized crop, color jitter of 0.4, Auto-augment, and no repeated augmentation for every model.
Table 15. Detailed training configurations of models pretrained with ImageNet-22K (IN-22K pt) and then finetuned on ImageNet-
1K (IN-1K ft). Apart from the configurations shown in the table, we use random left-right flipping, random resized crop, color jitter of
0.4, Auto-augment, and no repeated augmentation for every model.