0% found this document useful (0 votes)
8 views19 pages

Chinese Clip

The document presents Chinese CLIP, a vision-language foundation model specifically pretrained on a large-scale dataset of Chinese image-text pairs. It utilizes a two-stage pretraining method that enhances performance in cross-modal retrieval tasks and achieves state-of-the-art results on several benchmarks. The model demonstrates competitive performance in zero-shot image classification and is designed to effectively transfer knowledge to language-specific scenarios.

Uploaded by

shu yu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views19 pages

Chinese Clip

The document presents Chinese CLIP, a vision-language foundation model specifically pretrained on a large-scale dataset of Chinese image-text pairs. It utilizes a two-stage pretraining method that enhances performance in cross-modal retrieval tasks and achieves state-of-the-art results on several benchmarks. The model demonstrates competitive performance in zero-shot image classification and is designed to effectively transfer knowledge to language-specific scenarios.

Uploaded by

shu yu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 19

Chinese CLIP: Contrastive Vision-Language Pretraining in Chinese

An Yang∗1 , Junshu Pan∗1,2† , Junyang Lin∗1 ,


Rui Men1 , Yichang Zhang1 , Jingren Zhou1 , Chang Zhou1♣
1
DAMO Academy, Alibaba Group
2
Beihang University
{ya235025, panjunshu.pjs, junyang.ljy, ericzhou.zc}@alibaba-inc.com

80
CLIP 74.0
Abstract 71.7
70 CN-CLIP 68.9
65.4

Mean Accuracy (%)


The tremendous success of vision-language
arXiv:2211.01335v3 [cs.CV] 23 May 2023

60 57.0
foundation models has promoted the research
and application of computer vision and mul- 50 48.6
timodal representation leearning. However,
it is still difficult to effectively transfer such 40 36.8 37.2
foundation models to language-specific scenar- 32.0
ios. In this work, we propose Chinese CLIP 30 26.6
with the two-stage pretraining method which 20
trains the model with locked-image tuning in RN50 ViT-B/16 ViT-L/14 ViT-L/14@336 ViT-H/14
Models
the first stage and contrastive tuning in the sec-
ond one. Specifically, we have developed 5
Figure 1: Comparison of CLIP and Chinese CLIP
Chinese CLIP models of multiple sizes, span-
models on the Chinese native retrieval benchmark
ning from 77 to 958 million parameters, and
MUGE. On the benchmark based on the native data
we have pretrained them on a collected large-
(which are mostly crawled from the language-native
scale dataset of Chinese image-text pairs. Our
websites, in contrast with the translated data from web-
comprehensive experiments demonstrate that
sites of other countries.), CLIP performs far worse than
Chinese CLIP can achieve the state-of-the-art
our Chinese CLIP. Note that CLIPViT-H/14 is not re-
performance on MUGE, Flickr30K-CN, and
leased, we use the model from OpenCLIP (Ilharco et al.,
COCO-CN in the setups of zero-shot learning
2021) instead.
and finetuning, and it is able to achieve compet-
itive performance in zero-shot image classifica-
tion based on the evaluation on the ELEVATER
benchmark. We have released our codes, mod- lion image-text pair data collected from the web.
els, and demos1 . Despite the simplicity of the method, CLIP not
1 Introduction only achieved outstanding performance in vision-
language retrieval but more importantly played as a
Starting from the burst of pretraining in NLP, foun- vision foundation model and demonstrated state-of-
dation models have attracted attention from mul- the-art performance in zero-shot image classifica-
tiple research communities. Foundation models tion across a series of datasets. CLIP which builds
that learn from large-scale unsupervised or weakly a connection between vision and language has been
supervised data play as the basis of downstream transforming the research in both multimodal rep-
models. A milestone of foundation models (Bom- resentation learning and computer vision.
masani et al., 2021) in multimodal representation
Be that as it may, it is difficult to efficiently trans-
learning is CLIP (Radford et al., 2021). Differ-
fer a cross-modal pretrained model to another lan-
ent from the conventional generative pretraining,
guage for several causes. First, learning to model
CLIP is a contrastive-learning-based model pre-
the distribution of language-native vision and lan-
trained on a large-scale dataset of around 400 mil-
guage data is significant for the transfer. Though

Co-first authors. CLIP performs as a strong foundation model in

Corresponding author.
† most scenarios, we find that it is hard for CLIP
Work done as an intern in DAMO Academy.
1
Github: https://github.com/OFA-Sys/Chinese-CLI with machine translation to perform well on the
P; ModelScope: https://www.modelscope.cn/models Chinese-native cross-modal retrieval benchmark.
Figure 1 demonstrates large performance gaps be-
tween the original CLIP and our Chinese CLIP at
all model scales. We assume that it is crucial for Image Image
Encoder Encoder
both encoders to learn from the language-native
images and texts. Second, the performance of pre-
vious methods for Chinese multimodal pretraining Contrastive Loss Contrastive Loss
has been inhibited by several factors. Pretraining
from scratch requires collecting a large-scale qual-
⼀只⾦丝猴的照⽚ Text Text
ity language-specific image text pair dataset similar (Photo of a golden monkey) Encoder Encoder
to Web Image Text (WIT) for OpenAI CLIP (Fei
et al., 2021; Xie et al., 2022). Though the fast
transfer of CLIP to Chinese data can be realized Figure 2: An illustration of pretraining Chinese CLIP.
by using CLIP initialization and Locked-Image To leverage the advantages of the existing pretrained
Tuning (Zhai et al., 2022), the vision encoder still models, we initialize the image encoder with the Ope-
cannot learn the information of images from the nAI CLIP models, and the text encoder with the Chinese
RoBERTa models. In Stage 1, we freeze the weights of
language-specific domains (Gu et al., 2022).
the image encoder to avoid weight optimization, and in
Therefore, we propose Chinese CLIP, a Stage 2, we unfreeze it and optimize both encoders.
language-specific vision-language foundation
model pretrained on the publicly available Chinese
image-text pair data. Additionally, we still use methods and outperforms the Chinese baselines.
the same architecture as OpenAI CLIP. To realize Furthermore, we provide NVIDIA TensorRT and
efficient transfer of cross-modal foundation ONNX models for deployments, which run around
model to Chinese data, we develop a two-stage 2 to 10 times faster than Pytorch models for infer-
pretraining method, which is also adaptive to other ence.
vision-language foundation models, e.g., ALIGN, In brief, our contributions are:
Florence, etc. Here in this work, we use CLIP
as an example. To be specific, we first initialize • We propose Chinese CLIP, a simple imple-
both encoders with pretrained models, namely mentation of CLIP pretrained on our collected
vision encoders from CLIP and text encoders from large-scale Chinese image-text pair data, and
RoBERTa-wwm-Chinese (Cui et al., 2020). In we propose a two-stage pretraining method
Stage 1, we freeze the image encoder and only to achieve high pretraining efficiency and im-
optimize the text encoder with LiT, and in Stage 2, proved downstream performance.
we train both encoders with contrastive tuning 2.
In this way, the new model can inherit from the • Chinese CLIP achieves state-of-the-art per-
foundation models through initialization and LiT, formance in cross-modal retrieval in the se-
and effectively transfer to language-specific data tups of zero-shot learning and finetuning, and
through contrastive tuning. competitive performance in zero-shot image
We evaluate Chinese CLIP on 3 Chinese classification.
cross-modal retrieval datasets, including MUGE2 ,
Flickr30K-CN (Lan et al., 2017), and COCO- 2 Method
CN (Li et al., 2019c). Experimental results demon-
strate that both the large-size and huge-size Chinese CLIP (Radford et al., 2021) based on simple vision-
CLIP reach state-of-the-art performance on the 3 language contrastive pretraining on large-scale
datasets in the setups of both zero-shot learning and weakly supervised data is a significant foundation
finetuning. Additionally, we evaluate the capability model in multimodal representation learning. It can
of zero-shot image classification on the track “Im- transfer to cross-modal retrieval directly, and its im-
age Classification in the Wild” of the ELEVATER age encoder can play as a vision backbone. In this
benchmark (Li et al., 2022b). On the classification work, we propose to build a language-specific CLIP
datasets, Chinese CLIP demonstrates competitive model by pretraining a vision-language model on
performance in comparison with state-of-the-art large-scale Chinese multimodal data. In the follow-
ing, we provide the details of method design and
2
https://tianchi.aliyun.com/muge implementation of our Chinese CLIP.
2.1 Data current pretraining data. To leverage the advan-
One key to CLIP’s success should be the large-scale tages of existent pretrained models, we initialize
dataset for pretraining. Based on the experiments the models with weights from the pretrained check-
of a CLIP reimplementation (Ilharco et al., 2021), points from the official release of CLIP 4 for the im-
scaling up data and lengthening the training process age encoder, and RoBERTa-wwm-ext and RBT3 5
can consistently improve the model performance for the text encoder. To adapt the model to the
in zero-shot learning. This year, the most recent introduced pretraining data, it is available to pre-
multimodal pretrained models Wukong (Gu et al., train it with “contrastive tuning”, similar to the way
2022) and R2D2 (Xie et al., 2022) were pretrained to transfer CLIP to downstream retrieval data. In
on a public dataset of 100 million image-text pairs comparison with contrastive tuning, Locked-image
and an in-house dataset of 250 million samples, Tuning (LiT) (Zhai et al., 2022) demonstrated im-
where only a subset of 23 million samples were proved performance in downstream transfer.
released. For the facility in reimplementation, we
In this work, we propose a two-stage pretrain-
aim at pretraining Chinese CLIP on as many pub-
ing method, as shown in Figure 2. The core idea
licly available data as possible, and thus we focus
is to first utilize LiT to enable the text encoder
on collecting high-quality public datasets. We ex-
to read out high-quality representations from the
tract the Chinese data (with the mark “zh”) from
foundation vision model from OpenAI CLIP, and
the latest LAION-5B (Schuhmann et al., 2021), and
then transfer the whole model to the domain of
collect the data from the Wukong dataset. How-
the introduced pretraining data. It is not sufficient
ever, due to the problems of unavailable links, we
to pretrain Chinese CLIP with solely LiT, as the
can only collect around 108 million samples and
image encoder should learn the information of the
72 million samples from LAION-5B and Wukong
images of the Chinese datasets and model the distri-
respectively. We additionally add the translated
bution of such data. Before the two-stage pretrain-
data from the classic English multimodal datasets,
ing, we first initialize both encoders with pretrained
including Visual Genome (Krishna et al., 2017)
models. In Stage 1, we “lock” the image encoder
and MSCOCO (Chen et al., 2015), where test sets
by freezing its parameters during pretraining. We
are removed. Finally, we construct a dataset for
only pretrain the text encoder for vision-language
Chinese multimodal pretraining with around 200
alignment, based on the assumption that the vision
million image-text pairs.3
backbone with pretrained weights is already a pow-
Below illustrates the procedure for data prepro- erful vision foundation model (Zhai et al., 2022;
cessing. For the part of data from LAION-5B, we Gu et al., 2022). We pretrain it until there is no
remove the samples with CLIP scores lower than salient performance improvement in downstream
0.26 computed by mCLIP (Carlsson et al., 2022). tasks, even if we prolong the pretraining progress.
Besides, we remove the samples with captions Then we switch to Stage 2, where we “unlock”
containing words in our internal blacklist. The the image encoder by enabling its optimization.
blacklist contains words related to advertising, im- In Stage 2, we continue pretraining without any
age filenames, etc. We remove those samples that parameter frozen, so that the image encoder can
are too short (fewer than 5 characters) or too long learn to model the distribution of the image data
(more than 50 characters). For the images, we re- from Chinese websites. In the ablation study, we
size them to the resolution of 224 × 224 for most discuss the influence of the initialization of the pre-
cases and 336 × 336 for the ViT-L/14@336px. trained checkpoints and pretraining methods on
the downstream performance. Experimental results
2.2 Pretraining Method
show that the two-stage pretraining method can out-
There are multiple design choices for pretraining perform either pretraining from scratch or directly
the Chinese CLIP models. One of the simplest finetuning from the pretrained models.
methods should be pretraining from scratch, where
both the image and text encoders are randomly ini-
tialized. However, we assume that its performance
will be limited by the quantity and quality of the
4
https://github.com/openai/CLIP License: MIT.
3 5
We add around 20 million high-quality internal image- https://github.com/ymcui/Chinese- BERT- wwm
text pairs to provide more diversity. License: Apache 2.0
Tasks Zero-shot Finetuning
Metrics R@1 R@5 R@10 MR R@1 R@5 R@10 MR
Tiny-size Model
CN-CLIPRN50 42.6 68.6 77.9 63.0 48.6 75.1 84.0 69.2
Base-size Models
WukongViT-B/32 33.4 59.3 69.7 54.1 39.2 66.9 77.4 61.2
R2D2ViT-B - - - - 47.4 75.1 83.5 68.7
CN-CLIPViT-B/16 52.1 76.7 84.4 71.1 58.4 83.6 90.0 77.4
Large-size Models
WukongViT-L/14 42.7 69.0 78.0 63.2 52.7 77.9 85.6 72.1
R2D2ViT-L/14 49.5 75.7 83.2 69.5 60.1 82.9 89.4 77.5
CN-CLIPViT-L/14 56.3 79.8 86.2 74.1 63.3 85.6 91.3 80.1
CN-CLIPViT-L/14@336px 59.0 81.4 87.8 76.1 65.3 86.7 92.1 81.3
Huge-size Models
CN-CLIPViT-H/14 63.0 84.1 89.2 78.8 68.9 88.7 93.1 83.6

Table 1: Experimental results on MUGE-Retrieval. We report the performance of both baselines and Chinese CLIP
models on text-to-image retrieval and image-to-text retrieval in the setups of zero-shot evaluation and finetuning.

Tasks Text-to-Image Image-to-Text


Setups Zero-shot Finetuning Zero-shot Finetuning
Metrics R@1 R@5 R@10 R@1 R@5 R@10 R@1 R@5 R@10 R@1 R@5 R@10
Tiny-size Model
CN-CLIPRN50 48.8 76.0 84.6 66.7 89.4 94.1 60.0 85.9 92.0 84.2 96.7 98.0
Base-size Models
WukongViT-B/32 45.7 73.8 82.2 67.6 89.6 94.2 66.2 88.7 94.3 83.9 97.6 99.0
R2D2ViT-B - - - 78.3 94.6 97.0 - - - 92.6 99.1 99.8
CN-CLIPViT-B/16 62.7 86.9 92.8 79.1 94.8 97.4 74.6 93.5 97.1 93.5 99.0 99.5
Large-size Models
WukongViT-L/14 51.7 78.9 86.3 77.4 94.5 97.0 76.1 94.8 97.5 92.7 99.1 99.6
R2D2ViT-L/14 60.9 86.8 92.7 84.4 96.7 98.4 77.6 96.7 98.9 95.6 99.8 100.0
CN-CLIPViT-L/14 68.0 89.7 94.4 82.7 96.7 98.6 80.2 96.6 98.2 96.1 99.5 99.9
CN-CLIPViT-L/14@336px 69.0 90.7 95.4 84.4 97.1 98.7 83.3 97.2 98.5 96.6 99.8 100.0
Huge-size Models
CN-CLIPViT-H/14 71.2 91.4 95.5 83.8 96.9 98.6 81.6 97.5 98.8 95.3 99.7 100.0

Table 2: Experimental results on Flickr30K-CN. We report the performance of both baselines and Chinese CLIP
models on text-to-image retrieval and image-to-text retrieval in the setups of zero-shot evaluation and finetuning.

3 Evaluation which is known as “Image Classification in the


Wild (ICinW)”.
To comprehensively probe the effects of Chinese
CLIP, we follow the conventional practice that we 3.1 Cross-modal Retrieval
first evaluate its basic capabilities of cross-modal
3.1.1 Datasets and Metrics
retrieval, i.e. text-to-image retrieval and image-
to-text retrieval, in different domains, including We validate Chinese CLIP on 3 cross-modal
e-commerce and the general domain. Addition- retrieval datasets, namely MUGE-Retrieval,
ally, as the contrastive-learning-based pretraining Flickr30K-CN (Lan et al., 2017), and COCO-
builds a foundation vision model that is semanti- CN (Li et al., 2019c). MUGE-Retrieval is an
cally connected with natural language, we follow image-text retrieval dataset, where data are
Radford et al. (2021) and evaluate its capabilities extracted from Chinese E-commerce websites.
of zero-shot classification. Specifically, we vali- Flickr30K-CN and COCO-CN are built from the
date Chinese CLIP on the classification datasets classical datasets Flickr30K and MSCOCO-1K
of the ELEVATER benchmark (Li et al., 2022b), whose texts are translated into Chinese. Our
Tasks Text-to-Image Image-to-Text
Setups Zero-shot Finetuning Zero-shot Finetuning
Metrics R@1 R@5 R@10 R@1 R@5 R@10 R@1 R@5 R@10 R@1 R@5 R@10
Tiny-size Model
CN-CLIPRN50 48.1 81.3 90.5 66.8 91.1 97.0 51.6 81.2 90.5 68.4 93.3 97.8
Base-size Models
WukongViT-B/32 49.2 79.4 87.9 67.0 91.4 96.7 48.3 77.8 88.8 65.8 90.3 96.6
R2D2ViT-B - - - 75.1 94.2 98.1 - - - 76.1 95.3 98.5
CN-CLIPViT-B/16 62.2 86.6 94.9 77.0 97.1 99.0 57.0 84.1 93.6 77.4 96.2 98.9
Large-size Models
WukongViT-L/14 53.4 80.2 90.1 74.0 94.4 98.1 55.2 81.0 90.6 73.3 94.0 98.0
R2D2ViT-L/14 56.4 85.0 93.1 79.1 96.5 98.9 63.3 89.3 95.7 79.3 97.1 98.7
CN-CLIPViT-L/14 64.0 89.2 94.4 78.9 96.3 99.0 60.4 84.2 92.9 80.2 96.7 99.2
CN-CLIPViT-L/14@336px 64.7 89.6 94.6 80.1 96.7 99.2 63.4 87.2 94.4 81.2 97.2 99.1
Huge-size Models
CN-CLIPViT-H/14 69.2 89.9 96.1 81.5 96.9 99.1 63.0 86.6 92.9 83.5 97.3 99.2

Table 3: Experimental results on COCO-CN. We report the performance of both baselines and Chinese CLIP models
on text-to-image retrieval and image-to-text retrieval in the setups of zero-shot evaluation and finetuning. Since
machine translated COCO is included in our pretraining dataset, here the numbers of Chinese CLIP zero-shot
performances are shown in gray.

evaluation includes setups of zero-shot learning R2D2ViT-B by 8.7 MR in finetuning. Besides, the
and finetuning. For zero-shot learning, we use tiny model CN-CLIPRN50 can outperform the base-
Chinese CLIP models to compute the similarity size WukongViT-B/32 by 8.9 MR in zero-shot learn-
scores between images and texts and return the ing and 8.0 MR in finetuning.
top-K most similar candidates. For finetuning, we For the large-size models, CN-CLIPViT-L/14
finetune the Chinese CLIP models for cross-modal can outperform both baselines in all metrics and
retrieval with contrastive tuning. The evaluation is CN-CLIPViT-L/14@336px pretrained on images of
the same as that in zero-shot learning. The evalua- a larger resolution can achieve the state-of-the-
tion metrics are Recall@K, where K = {1, 5, 10}, art performance. CN-CLIPViT-L/14@336px outper-
and Mean Recall (MR, i.e., the average of forms R2D2ViT-L/14 by 6.6 MR in zero-shot learn-
Recall@K). For comparison, we choose the ing and 3.8 MR in finetuning. When scaling to
base-size and large-size Wukong and R2D2 as the CN-CLIPViT-H/14 , the performance is further im-
baselines, which are the previous SOTA models proved. Compared with the best large-size model
in Chinese multimodal representation learning. CN-CLIPViT-L/14@336px , CN-CLIPViT-H/14 sur-
Following these baselines, we report validation passes it by 2.7 MR in zero-shot learning and 2.3
performance on MUGE and test performance MR in finetuning.
on Flickr30K-CN and COCO-CN. Note that in Table 2 and 3 report the model performance
the setup of finetuning, R2D26 is essentially an on Flickr30K-CN and COCO-CN. We focus
end-to-end model of retrieval and ranking. on the evaluation of R@1. In both datasets,
CN-CLIP achieves better performance than the
3.1.2 Results
baselines. For the base-size models, in the
Table 1 reports the model performance on setup of zero-shot learning of Flickr30K-CN,
MUGE-Retrieval. For the base-size model, CN-CLIPViT-B/16 surpasses WukongViT-B/32 by
CN-CLIPViT-B/16 outperforms the baselines on all 17.0 R@1 in text-to-image retrieval and 8.4 R@1
metrics and in both setups of zero-shot learning in image-to-text retrieval, and in the finetuning
and finetuning. Specifically, for the base-size mod- setup, CN-CLIPViT-B/16 surpasses R2D2ViT-B by
els, CN-CLIPViT-B/16 surpasses WukongViT-B/32 0.8 R@1 in image retrieval and 0.9 R@1 in text re-
by 17.0 MR in zero-shot learning and surpasses trieval. Similarly, in the finetuning setup of COCO-
6
CN, CN-CLIPViT-B/16 surpasses R2D2ViT-B by
Since the original paper of R2D2 does not provide the
patch size of their base-size model, here we denote the model 1.9 R@1 in image retrieval and 1.3 R@1 in
as R2D2ViT-B . text retrieval. For the tiny-size CN-CLIPRN50 , it
again achieves or surpasses the performance of influence of initialization, we pretrain a model
WukongViT-B/32 in several metrics of Flickr30K- from scratch, and to examine the influence of LiT,
CN and COCO-CN. Specifically, CN-CLIPRN50 we pretrain a model without freezing the image
surpasses WukongViT-B/32 by 3.1 R@1 in the zero- encoder. For a better demonstration, we report
shot learning of Flickr30K-CN image retrieval and the curves of model performance of zero-shot re-
by 2.6 R@1 in the finetuning of COCO-CN text trieval on different datasets in terms of pretraining
retrieval. progress, indicated by the number of processed
For the large-size models, in the zero-shot samples.
setup of Flickr30K-CN, CN-CLIPViT-L/14 sur- Figure 3 shows the performance on different
passes WukongViT-L/14 by 16.3 R@1 in text-to- tasks, namely MUGE text-to-image retrieval, text-
image retrieval and 4.1 R@1 in image-to-text to-image and image-to-text retrieval on Flicrk30K-
retrieval. CN-CLIPViT-L/14@336px further im- CN and COCO-CN. In comparison with pretrain-
proves over CN-CLIPViT-L/14 by 1.0 R@1 in ing with pretrained model initialization, pretrain-
image retrieval and 3.1 R@1 in text retrieval. ing from scratch performs much worse though
In the finetuning setup, CN-CLIPViT-L/14 sur- it shows consistent performance improvement in
passes R2D2ViT-L/14 by 0.5 R@1 in text re- terms of pretraining progress. As to the importance
trieval. CN-CLIPViT-L/14@336px achieves equal of LiT, we observe different phenomena on differ-
performance with R2D2ViT-L/14 in image retrieval ent datasets. On MUGE, a dataset of samples origi-
and surpasses it by 1.0 R@1 in text retrieval. nally collected from Chinese websites, we find that
Similarly, in the finetuning setup of COCO-CN, pretraining without LiT might be the best solution
CN-CLIPViT-L/14 surpasses R2D2ViT-L/14 by 0.9 though its performance gap with two-stage pretrain-
R@1 in text retrieval. CN-CLIPViT-L/14@336px fur- ing is quite small. However, on the other datasets,
ther surpasses R2D2ViT-L/14 by 1.0 in image re- i.e., Flickr30K-CN and COCO-CN, whose sam-
trieval and 1.9 R@1 in text retrieval. ples are translated from the English datasets, we
On Flickr30K-CN and COCO-CN, scaling from find that our two-stage pretraining performs sig-
CN-CLIPViT-L/14 to CN-CLIPViT-H/14 improves nificantly better than pretraining without LiT. Fur-
the performance in almost all the metrics. Specif- thermore, we observe a common phenomenon that
ically, in the zero-shot setup of Flickr30K-CN, in the two-stage pretraining, switching from Stage
CN-CLIPViT-H/14 surpasses CN-CLIPViT-L/14 by 1 to Stage 2 can effectively boost the model per-
3.2 R@1 in image retrieval and 1.4 R@1 in formance to a higher level. This reflects the im-
text retrieval. Moreover, in the finetuning setup portance of adapting the pretrained model to the
of COCO-CN, CN-CLIPViT-H/14 even surpasses data distribution of the Chinese multimodal data,
CN-CLIPViT-L/14@336px with larger image reso- especially those concerned with visual information.
lution by 1.4 R@1 in image retrieval and 2.3
R@1 in text retrieval. We also compare our 3.2 Zero-shot Image Classification
CN-CLIPViT-H/14 with a huge CLIP-like model,
3.2.1 Open-Domain Image Classification
T-Bletchley7 . This model has 2.5 billion param-
Benchmark in Chinese
eters and is pretrained on billions of multilingual
image-caption pairs. In the finetuning setup of Contrastive pretraining on image-text pairs builds
COCO-CN, with smaller sizes of model param- a connection between vision and natural language.
eters and pretrain dataset, CN-CLIPViT-H/14 still Natural language supervision instead of crowd-
surpasses T-Bletchley by 3.9 MR. sourced labeling endows models with the capabil-
ity of zero-shot image classification by comput-
3.1.3 Ablation Study ing similarities between the given image and the
Here we provide an ablation study on our proposed text descriptions of the labels in the candidate set.
two-stage training methods. To validate its sig- Recent progress in this field is the ELEVATER
nificance and effectiveness, we design several se- benchmark (Li et al., 2022b). The track ICinW
tups for the ablation study. Our experiments are for open-domain image classification consists of
conducted on CN-CLIPViT-B/16 . To evaluate the a series of image classification datasets, including
7
ImageNet (Deng et al., 2009), CIFAR (Krizhevsky
https://www.microsoft.com/en-us/research/blo
g/turing-bletchley-a-universal-image-language-r et al., 2009), MNIST (Deng, 2012), etc. In order
epresentation-model-by-microsoft/ to evaluate Chinese CLIP on the datasets, we first
MUGE Image Retrieval Flickr30k-CN Image Retrieval Flickr30k-CN Text Retrieval
70 80 90
Zero-Shot Mean Recall

Zero-Shot Mean Recall

Zero-Shot Mean Recall


70 80
60
60 70
50 60
50
40 50
40
30 CN-CLIP Base Stage 1 CN-CLIP Base Stage 1
40 CN-CLIP Base Stage 1
30
20 CN-CLIP Base Stage 2 CN-CLIP Base Stage 2 30 CN-CLIP Base Stage 2
CN-CLIP Base w/o two-stage method 20 CN-CLIP Base w/o two-stage method 20 CN-CLIP Base w/o two-stage method
10 CN-CLIP Base from scratch 10 CN-CLIP Base from scratch CN-CLIP Base from scratch
10
0 2B 4B 6B 8B 10B 0 2B 4B 6B 8B 10B 0 2B 4B 6B 8B 10B
# of image-text pairs pretrained # of image-text pairs pretrained # of image-text pairs pretrained
(a) (b) (c)

COCO-CN Image Retrieval COCO-CN Text Retrieval


80 80
Zero-Shot Mean Recall

Zero-Shot Mean Recall


70 70
60 60
50 50
40 40
CN-CLIP Base Stage 1 CN-CLIP Base Stage 1
30 CN-CLIP Base Stage 2 30 CN-CLIP Base Stage 2
CN-CLIP Base w/o two-stage method CN-CLIP Base w/o two-stage method
20 CN-CLIP Base from scratch 20 CN-CLIP Base from scratch
0 2B 4B 6B 8B 10B 0 2B 4B 6B 8B 10B
# of image-text pairs pretrained # of image-text pairs pretrained
(d) (e)

Figure 3: Comparison of base-size Chinese CLIP models with different training methods on MUGE, Flickr30k-CN
and COCO-CN.

Model CIFAR10 CIFAR100 DTD EuroSAT FER FGVC KITTI MNIST PC VOC INet
Original benchmark
DeCLIP 90.9 66.8 44.9 39.9 23.3 9.0 39.7 13.6 55.3 80.6 73.7
GIT 88.5 61.1 42.9 43.4 41.4 6.7 22.1 68.9 50.0 80.2 -
ALIGN 94.9 76.8 66.1 52.1 50.8 25.0 41.2 74.0 55.2 83.0 76.4
OpenCLIP 93.5 76.2 56.4 53.7 50.3 20.8 28.8 70.9 50.5 82.3
CLIP 94.9 77.0 56.0 63.0 48.3 33.3 11.5 79.0 62.3 84.0 76.2
Translated benchmark
BriVL 72.3 35.9 18.8 25.5 - - - - - - 24.3
Wukong 95.4 77.1 40.9 50.3 - - - - - - 55.0
CN-CLIP 96.0 79.7 51.2 52.0 55.1 26.2 49.9 79.4 63.5 84.9 59.6

Table 4: Experimental results of the zero-shot image classification performance of models on ICinW.

transform the datasets for Chinese models by trans- cluding ImageNet classification, Chinese CLIP sur-
lating labels and prompts into Chinese. passes both baselines significantly, and the relative
achievements on some datasets are over 100%. Be-
3.2.2 Experimental Results sides, we also compare Chinese CLIP with the
Table 4 reports the performance of both English foundation models, e.g., CLIP and ALIGN, which
models and Chinese models. The baselines pre- are pretrained on English data. It can be found
trained on English data include DeCLIP (Li et al., that Chinese CLIP outperforms CLIP or ALIGN
2021b), ALIGN (Jia et al., 2021), CLIP (Radford on CIFAR-10, CIFAR-100, FER-2013, KITTI-
et al., 2021), and OpenCLIP (Ilharco et al., 2021), Distance, MNIST, PatchCamelyon, and Pascal-
and the baselines pretrained on Chinese data in- VOC-2007. Also, on the classification datasets for
clude BriVL (Fei et al., 2021) and Wukong (Gu general concepts or objects common in both west-
et al., 2022). We report the results of the variant ern and eastern culture, Chinese CLIP consistently
with the best downstream performance for the mod- achieves better performance. This indicates that
els. Chinese CLIP is capable of categorizing images to
We first focus on the comparison with the Chi- general prototypes.
nese baselines. To be specific, on all datasets in- However, as to the classification concerned with
proper nouns, e.g., FGVC-Aircraft, it is difficult for training of CLIP in learning negation. The texts in
all models to achieve high accuracy. We assume the pretraining datasets are mostly descriptions of
that the related images and texts are not common in the images, which indicate their objects or features
the pretraining datasets, and it is also hard for the but often do not indicate the absence of objects.
models to understand the names of airplanes with-
out finetuning. Specifically, for Chinese models, 3.3 Deployment
the translation or even transliteration can signifi- For deployment, we develop ONNX-based and
cantly affect the performance of Chinese CLIP. It TensorRT-based models based on our Pytorch-
encourages building a benchmark of “Image Clas- based pretrained Chinese CLIP models. As ex-
sification in the Wild for Chinese Models”. pected, we observe that the inference efficiency
increases significantly while there is almost no per-
3.2.3 Analysis formance sacrifice. Specifically, the inference effi-
Sensitivity to Handcrafted Prompts While the ciency of TensorRT-based models is around 2 to 10
benchmark ELEVATER provides specific prompts times faster than the Pytorch-based models. More
for each dataset, we find that this is not always the statistics are listed in Appendix A.6.
best option, in comparison with our baseline, trans-
lation of the prompts provided by OpenAI CLIP. 4 Related Work
The baseline with around 90 prompts performs Previous vision-language pretrained models are
the best on average. However, for some datasets, mostly BERT/T5-style (Devlin et al., 2019; Raf-
specific prompts designed with human knowledge fel et al., 2020), which involves cross-modal fu-
can boost the performance significantly. A typi- sion (Chen et al., 2020; Li et al., 2019a,b; Lu
cal case is the classification of airplanes. We test et al., 2019; Lin et al., 2020; Li et al., 2020; Huang
CN-CLIPViT-L/14 with our specified prompts that et al., 2020; Xu et al., 2021; Zhang et al., 2021;
are related to the knowledge of aircraft, e.g.,“label, Shen et al., 2021; Wang et al., 2021b, 2022a;
a photo of an airplane", “label, a zoomed image of Li et al., 2021a, 2022c,a; Wang et al., 2021a,
a fighter”, etc., and the translation of the OpenAI 2022b). CLIP (Radford et al., 2021), instead,
prompts. Experimental results show that the model is a contrastive-learning-based two-tower model,
can achieve an accuracy of 16.0 with the specified which can serve as a vision foundation model.
prompts but only 13.8 with the OpenAI prompts. Following CLIP, a series of similar contrastive-
learning-based multimodal pretrained models were
Inability to Understand Negation Previous
proposed and reached new SOTAs in cross-modal
studies (Khandelwal and Sawant, 2019; Hosseini
retrieval and zero-shot classification (Jia et al.,
et al., 2021) demonstrate that even strong NLP pre-
2021; Yao et al., 2021; Yuan et al., 2021). Fur-
trained models often make mistakes in negation
thermore, CLIP can be adaptive to other models. A
problems. We explore CLIP’s capability to un-
typical case is that CLIP is essential to many image
derstand negation by conducting experiments on
generation models, e.g., DALL-E (Ramesh et al.,
KITTI-Distance (Fritsch et al., 2013) and Patch-
2021), DALL-E 2 (Ramesh et al., 2022), Stable
Camelyon (Veeling et al., 2018). KITTI-Distance
Diffusion (Rombach et al., 2022), etc. The success
provides 4 options for models to judge, including
of multimodal pretraining encouraged the transfer
“next to a car”, “near a car”, “at a distance away
of the existing methods to Chinese pretraining, in-
from a car”, and “no car”. The last one is concerned
cluding generative pretrained models (Lin et al.,
with negation. We compare the model performance
2021a; Fei et al., 2021; Yang et al., 2021; Lin et al.,
using the text “no cars” and “others” for the last
2021b; Wang et al., 2022a) and contrastive pre-
label. We observe that it is hard for the model to
trained models (Fei et al., 2021; Gu et al., 2022;
understand negation. By changing the label from
Xie et al., 2022; Chen et al., 2022b).
“others” to “no cars”, the performance drops by
48.1% in accuracy (49.9 vs. 25.9). Similarly, in the
5 Conclusion
experiments on PatchCamelyon, the performance
drops from 63.5 to 50.2 by changing labels from In this work, we propose Chinese CLIP, a Chinese-
“mainly red” and “green block in the middle” to specific vision-language foundation model. Specif-
“no green block in the middle” and “green block ically, we construct a pretraining dataset of around
in the middle”. This shows the limitation of the 200 million samples, and pretrain a series of Chi-
nese CLIP models with the proposed two-stage pre- our largest CN-CLIPViT-H/14 , e.g., the 3B Swin-
training method, which improves both pretraining v2 (Liu et al., 2022), the 4B ViT-e (Chen et al.,
efficiency and effectiveness. Our comprehensive 2022a), etc. In the future, we will continue explor-
evaluation shows that Chinese CLIP can reach state- ing scaling up models in line with scaling up data
of-the-art performance on multiple cross-modal re- in order to build a more effective Chinese CLIP.
trieval datasets in zero-shot learning and finetuning. Another issue of model scaling connected with
Furthermore, we demonstrate that Chinese CLIP the real-world application is how to build effective
models can also achieve competitive performance small models. Experimental results show that our
in zero-shot image classification across 10 datasets. smallest Chinese CLIP CN-CLIPRN50 performs
much worse than the ViT variants. However, in
Limitations real-world applications, effective small models that
are available for deployment are usually more wel-
A number of issues reflect the limitations of this
comed. Thus it is necessary to explore distillation
work but also point out some directions for our fu-
for CLIP so that the capability of large models can
ture research. In this section, we generally discuss
be transferred to small models for application.
some limitations about the scale of data and model.

Data The core of CLIP pretraining is the simple


Ethics Statement
but effective large-scale contrastive pretraining on The proposed model is a contrastive-learning-based
extremely large-scale data. Though we have uti- vision-language foundation model, which gener-
lized around 200 million samples, compared with ates features for images and texts. Those features
recent studies (Yuan et al., 2021; Chen et al., 2022a) can be representation of visual and linguistic in-
the scale of our pretraining data is relatively small. formation, and they can support applications such
Thus one of our next-step studies is scaling up as search engine, recommender system, etc. Be-
the quantity of the pretraining data to evaluate the sides, this model can play as a foundation model
performance improvement with data scaling. Fur- support recent image generation models, e.g., diffu-
thermore, we still find it hard to decide what a sion models (Ramesh et al., 2022). This may create
“high-quality” dataset for CLIP is. In the previous risks as the AI generated contents may reflect harm-
studies (Jia et al., 2021; Li et al., 2021b), the pre- ful information, such as hate, bias, pornography,
processing methods are mostly simple to avoid the etc. In most cases, these cases should be attributed
loss of data. However, there are still many samples to the training of the image generation models. Still,
where the image and text are not matched properly, we cannot avoid the negative effects from the CLIP
which may provide negative information to the pre- representations to the generation. In the future, we
training. In our future research, we plan to use the will study how to filter pretraining data to avoid the
pretrained Chinese CLIP model to compute a score potential risks.
for each image-text pair in a larger dataset, filter
those whose scores are under the specified thresh-
old, and pretrain the new models with the new data. References
This is one of the possible solutions to explore the Rishi Bommasani, Drew A Hudson, Ehsan Adeli,
relationship between data quality and pretraining Russ Altman, Simran Arora, Sydney von Arx,
effectiveness. Also, such cycling might bring con- Michael S Bernstein, Jeannette Bohg, Antoine Bosse-
lut, Emma Brunskill, et al. 2021. On the opportuni-
tinuous performance enhancement in downstream
ties and risks of foundation models. arXiv preprint
tasks. arXiv:2108.07258.
Model Recently we have witnessed that in many Lukas Bossard, Matthieu Guillaumin, and Luc Van Gool.
domains the scaling of model size can lead to con- 2014. Food-101–mining discriminative components
sistent performance improvement (Gordon et al., with random forests. In European conference on
computer vision, pages 446–461. Springer.
2021; Wei et al., 2022), and in this work, we also
find that the scaling of model size for Chinese CLIP Fredrik Carlsson, Philipp Eisen, Faton Rekathati, and
can achieve steady performance improvement in Magnus Sahlgren. 2022. Cross-lingual and multilin-
gual clip. In Proceedings of the Language Resources
different downstream tasks, including retrieval and and Evaluation Conference, pages 6848–6854, Mar-
classification. Recent studies have scaled ViT and seille, France. European Language Resources Asso-
also CLIP-like models to a much larger scale than ciation.
Xi Chen, Xiao Wang, Soravit Changpinyo, AJ Pier- Jacob Devlin, Ming-Wei Chang, Kenton Lee, and
giovanni, Piotr Padlewski, Daniel Salz, Sebastian Kristina Toutanova. 2019. BERT: pre-training of
Goodman, Adam Grycner, Basil Mustafa, Lucas deep bidirectional transformers for language under-
Beyer, et al. 2022a. Pali: A jointly-scaled mul- standing. In NAACL-HLT 2019, pages 4171–4186.
tilingual language-image model. arXiv preprint Association for Computational Linguistics.
arXiv:2209.06794.
Mark Everingham, Luc Van Gool, Christopher K. I.
Xinlei Chen, Hao Fang, Tsung-Yi Lin, Ramakr- Williams, John M. Winn, and Andrew Zisserman.
ishna Vedantam, Saurabh Gupta, Piotr Dollár, and 2010. The pascal visual object classes (VOC) chal-
C. Lawrence Zitnick. 2015. Microsoft COCO cap- lenge. Int. J. Comput. Vis., 88(2):303–338.
tions: Data collection and evaluation server. CoRR, Nanyi Fei, Zhiwu Lu, Yizhao Gao, Guoxing Yang, Yuqi
abs/1504.00325. Huo, Jingyuan Wen, Haoyu Lu, Ruihua Song, Xin
Gao, Tao Xiang, et al. 2021. Wenlan 2.0: Make ai
Yen-Chun Chen, Linjie Li, Licheng Yu, Ahmed El imagine via a multimodal foundation model. arXiv
Kholy, Faisal Ahmed, Zhe Gan, Yu Cheng, and preprint arXiv:2110.14378.
Jingjing Liu. 2020. UNITER: universal image-text
representation learning. In ECCV 2020, volume Li Fei-Fei, Rob Fergus, and Pietro Perona. 2004. Learn-
12375 of Lecture Notes in Computer Science, pages ing generative visual models from few training ex-
104–120. Springer. amples: An incremental bayesian approach tested
on 101 object categories. In 2004 conference on
Zhong-Yong Chen, Guangyi Liu, Bohan Zhang, Fu- computer vision and pattern recognition workshop,
long Ye, Qinghong Yang, and Ledell Yu Wu. pages 178–178. IEEE.
2022b. Altclip: Altering the language encoder
in clip for extended language capabilities. ArXiv, Jannik Fritsch, Tobias Kuehnl, and Andreas Geiger.
abs/2211.06679. 2013. A new performance measure and evalua-
tion benchmark for road detection algorithms. In
Gong Cheng, Junwei Han, and Xiaoqiang Lu. 2017. 16th International IEEE Conference on Intelligent
Remote sensing image scene classification: Bench- Transportation Systems (ITSC 2013), pages 1693–
mark and state of the art. Proceedings of the IEEE, 1700. IEEE.
105(10):1865–1883.
Mitchell A. Gordon, Kevin Duh, and Jared Kaplan.
2021. Data and parameter scaling laws for neural
Mircea Cimpoi, Subhransu Maji, Iasonas Kokkinos,
machine translation. In EMNLP 2021, pages 5915–
Sammy Mohamed, and Andrea Vedaldi. 2014. De-
5922. Association for Computational Linguistics.
scribing textures in the wild. In Proceedings of
the IEEE conference on computer vision and pattern Jiaxi Gu, Xiaojun Meng, Guansong Lu, Lu Hou, Minzhe
recognition, pages 3606–3613. Niu, Hang Xu, Xiaodan Liang, Wei Zhang, Xin
Jiang, and Chunjing Xu. 2022. Wukong: 100 mil-
Ekin D Cubuk, Barret Zoph, Dandelion Mane, Vi- lion large-scale chinese cross-modal pre-training
jay Vasudevan, and Quoc V Le. 2019. Autoaug- dataset and a foundation framework. arXiv preprint
ment: Learning augmentation strategies from data. arXiv:2202.06767.
In Proceedings of the IEEE/CVF Conference on
Computer Vision and Pattern Recognition, pages Patrick Helber, Benjamin Bischke, Andreas Dengel,
113–123. and Damian Borth. 2019. Eurosat: A novel dataset
and deep learning benchmark for land use and land
Yiming Cui, Wanxiang Che, Ting Liu, Bing Qin, Shijin cover classification. IEEE Journal of Selected Topics
Wang, and Guoping Hu. 2020. Revisiting pre-trained in Applied Earth Observations and Remote Sensing,
models for Chinese natural language processing. In 12(7):2217–2226.
Proceedings of the 2020 Conference on Empirical
Methods in Natural Language Processing: Findings, Arian Hosseini, Siva Reddy, Dzmitry Bahdanau, R De-
pages 657–668, Online. Association for Computa- von Hjelm, Alessandro Sordoni, and Aaron Courville.
tional Linguistics. 2021. Understanding by understanding not: Mod-
eling negation in language models. arXiv preprint
Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, arXiv:2105.03519.
and Li Fei-Fei. 2009. Imagenet: A large-scale hi- Zhicheng Huang, Zhaoyang Zeng, Bei Liu, Dongmei Fu,
erarchical image database. In 2009 IEEE Computer and Jianlong Fu. 2020. Pixel-bert: Aligning image
Society Conference on Computer Vision and Pattern pixels with text by deep multi-modal transformers.
Recognition (CVPR 2009), 20-25 June 2009, Miami, CoRR, abs/2004.00849.
Florida, USA, pages 248–255. IEEE Computer Soci-
ety. Gabriel Ilharco, Mitchell Wortsman, Ross Wightman,
Cade Gordon, Nicholas Carlini, Rohan Taori, Achal
Li Deng. 2012. The mnist database of handwritten digit Dave, Vaishaal Shankar, Hongseok Namkoong, John
images for machine learning research [best of the Miller, Hannaneh Hajishirzi, Ali Farhadi, and Lud-
web]. IEEE signal processing magazine, 29(6):141– wig Schmidt. 2021. Openclip. If you use this soft-
142. ware, please cite it as below.
Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana Vision and language representation learning with mo-
Parekh, Hieu Pham, Quoc V Le, Yunhsuan Sung, mentum distillation. In NeurIPS 2021, pages 9694–
Zhen Li, and Tom Duerig. 2021. Scaling up 9705.
visual and vision-language representation learn-
ing with noisy text supervision. arXiv preprint Liunian Harold Li, Mark Yatskar, Da Yin, Cho-Jui
arXiv:2102.05918. Hsieh, and Kai-Wei Chang. 2019b. Visualbert: A
simple and performant baseline for vision and lan-
Aditya Khandelwal and Suraj Sawant. 2019. Negbert: a guage. ArXiv, abs/1908.03557.
transfer learning approach for negation detection and
scope resolution. arXiv preprint arXiv:1911.04211. Xirong Li, Chaoxi Xu, Xiaoxu Wang, Weiyu Lan,
Zhengxiong Jia, Gang Yang, and Jieping Xu. 2019c.
Douwe Kiela, Hamed Firooz, Aravind Mohan, Vedanuj Coco-cn for cross-lingual image tagging, caption-
Goswami, Amanpreet Singh, Pratik Ringshia, and ing, and retrieval. IEEE Transactions on Multimedia,
Davide Testuggine. 2020. The hateful memes 21:2347–2360.
challenge: Detecting hate speech in multimodal
memes. Advances in Neural Information Processing Xiujun Li, Xi Yin, Chunyuan Li, Xiaowei Hu,
Systems, 33:2611–2624. Pengchuan Zhang, Lei Zhang, Lijuan Wang,
Houdong Hu, Li Dong, Furu Wei, Yejin Choi,
Jonathan Krause, Michael Stark, Jia Deng, and Li Fei-
and Jianfeng Gao. 2020. Oscar: Object-semantics
Fei. 2013. 3d object representations for fine-
aligned pre-training for vision-language tasks. In
grained categorization. In Proceedings of the
ECCV.
IEEE international conference on computer vision
workshops, pages 554–561.
Yangguang Li, Feng Liang, Lichen Zhao, Yufeng Cui,
Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin John- Wanli Ouyang, Jing Shao, Fengwei Yu, and Jun-
son, Kenji Hata, Joshua Kravitz, Stephanie Chen, jie Yan. 2021b. Supervision exists everywhere: A
Yannis Kalantidis, Li-Jia Li, David A. Shamma, data efficient contrastive language-image pre-training
Michael S. Bernstein, and Li Fei-Fei. 2017. Vi- paradigm. arXiv preprint arXiv:2110.05208.
sual genome: Connecting language and vision us-
ing crowdsourced dense image annotations. IJCV, Junyang Lin, Rui Men, An Yang, Chang Zhou, Ming
123(1):32–73. Ding, Yichang Zhang, Peng Wang, Ang Wang,
Le Jiang, Xianyan Jia, Jie Zhang, Jianwei Zhang,
Alex Krizhevsky, Geoffrey Hinton, et al. 2009. Learn- Xu Zou, Zhikang Li, Xiaodong Deng, Jie Liu, Jin-
ing multiple layers of features from tiny images. bao Xue, Huiling Zhou, Jianxin Ma, Jin Yu, Yong Li,
Wei Lin, Jingren Zhou, Jie Tang, and Hongxia Yang.
Weiyu Lan, Xirong Li, and Jianfeng Dong. 2017. 2021a. M6: A chinese multimodal pretrainer. CoRR,
Fluency-guided cross-lingual image caption- abs/2103.00823.
ing. Proceedings of the 25th ACM international
conference on Multimedia. Junyang Lin, An Yang, Jinze Bai, Chang Zhou, Le Jiang,
Xianyan Jia, Ang Wang, Jie Zhang, Yong Li, Wei Lin,
Chenliang Li, Haiyang Xu, Junfeng Tian, Wei Wang, Jingren Zhou, and Hongxia Yang. 2021b. M6-10T: A
Ming Yan, Bin Bi, Jiabo Ye, Hehong Chen, Guo- sharing-delinking paradigm for efficient multi-trillion
hai Xu, Zheng Cao, et al. 2022a. mplug: Effective parameter pretraining. CoRR, abs/2110.03888.
and efficient vision-language learning by cross-modal
skip-connections. arXiv preprint arXiv:2205.12005. Junyang Lin, An Yang, Yichang Zhang, Jie Liu, Jingren
Zhou, and Hongxia Yang. 2020. Interbert: Vision-
Chunyuan Li, Haotian Liu, Liunian Harold Li,
and-language interaction for multi-modal pretraining.
Pengchuan Zhang, Jyoti Aneja, Jianwei Yang, Ping
CoRR, abs/2003.13198.
Jin, Yong Jae Lee, Houdong Hu, Zicheng Liu, et al.
2022b. Elevater: A benchmark and toolkit for eval- Ze Liu, Han Hu, Yutong Lin, Zhuliang Yao, Zhenda
uating language-augmented visual models. arXiv Xie, Yixuan Wei, Jia Ning, Yue Cao, Zheng Zhang,
preprint arXiv:2204.08790. Li Dong, et al. 2022. Swin transformer v2: Scal-
Gen Li, Nan Duan, Yuejian Fang, Daxin Jiang, and ing up capacity and resolution. In Proceedings of
Ming Zhou. 2019a. Unicoder-vl: A universal en- the IEEE/CVF Conference on Computer Vision and
coder for vision and language by cross-modal pre- Pattern Recognition, pages 12009–12019.
training. CoRR, abs/1908.06066.
Jiasen Lu, Dhruv Batra, Devi Parikh, and Stefan Lee.
Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi. 2019. Vilbert: Pretraining task-agnostic visiolinguis-
2022c. Blip: Bootstrapping language-image pre- tic representations for vision-and-language tasks. In
training for unified vision-language understanding NeurIPS 2019, pages 13–23.
and generation. arXiv preprint arXiv:2201.12086.
Subhransu Maji, Esa Rahtu, Juho Kannala, Matthew
Junnan Li, Ramprasaath R. Selvaraju, Akhilesh Blaschko, and Andrea Vedaldi. 2013. Fine-grained
Gotmare, Shafiq R. Joty, Caiming Xiong, and visual classification of aircraft. arXiv preprint
Steven Chu-Hong Hoi. 2021a. Align before fuse: arXiv:1306.5151.
Maria-Elena Nilsback and Andrew Zisserman. 2008. Bastiaan S Veeling, Jasper Linmans, Jim Winkens, Taco
Automated flower classification over a large num- Cohen, and Max Welling. 2018. Rotation equiv-
ber of classes. In 2008 Sixth Indian Conference ariant cnns for digital pathology. In International
on Computer Vision, Graphics & Image Processing, Conference on Medical image computing and
pages 722–729. IEEE. computer-assisted intervention, pages 210–218.
Springer.
Omkar M. Parkhi, Andrea Vedaldi, Andrew Zisserman,
and C. V. Jawahar. 2012. Cats and dogs. In 2012 Peng Wang, An Yang, Rui Men, Junyang Lin, Shuai
IEEE Conference on Computer Vision and Pattern Bai, Zhikang Li, Jianxin Ma, Chang Zhou, Jin-
Recognition, pages 3498–3505. IEEE Computer So- gren Zhou, and Hongxia Yang. 2022a. Unifying
ciety. architectures, tasks, and modalities through a simple
sequence-to-sequence learning framework. CoRR,
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya abs/2202.03052.
Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sas- Wenhui Wang, Hangbo Bao, Li Dong, Johan
try, Amanda Askell, Pamela Mishkin, Jack Clark, Bjorck, Zhiliang Peng, Qiang Liu, Kriti Aggarwal,
Gretchen Krueger, and Ilya Sutskever. 2021. Learn- Owais Khan Mohammed, Saksham Singhal, Subhojit
ing transferable visual models from natural lan- Som, et al. 2022b. Image as a foreign language: Beit
guage supervision. In ICML 2021, volume 139 of pretraining for all vision and vision-language tasks.
Proceedings of Machine Learning Research, pages arXiv preprint arXiv:2208.10442.
8748–8763. PMLR.
Wenhui Wang, Hangbo Bao, Li Dong, and Furu
Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Wei. 2021a. Vlmo: Unified vision-language pre-
Lee, Sharan Narang, Michael Matena, Yanqi Zhou, training with mixture-of-modality-experts. CoRR,
Wei Li, and Peter J. Liu. 2020. Exploring the limits abs/2111.02358.
of transfer learning with a unified text-to-text trans-
former. J. Mach. Learn. Res., 21:140:1–140:67. Zirui Wang, Jiahui Yu, Adams Wei Yu, Zihang Dai, Yu-
lia Tsvetkov, and Yuan Cao. 2021b. Simvlm: Simple
Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey visual language model pretraining with weak super-
Chu, and Mark Chen. 2022. Hierarchical text- vision. CoRR, abs/2108.10904.
conditional image generation with clip latents. arXiv
preprint arXiv:2204.06125. Jason Wei, Yi Tay, Rishi Bommasani, Colin Raffel,
Barret Zoph, Sebastian Borgeaud, Dani Yogatama,
Maarten Bosma, Denny Zhou, Donald Metzler, et al.
Aditya Ramesh, Mikhail Pavlov, Gabriel Goh, Scott 2022. Emergent abilities of large language models.
Gray, Chelsea Voss, Alec Radford, Mark Chen, and arXiv preprint arXiv:2206.07682.
Ilya Sutskever. 2021. Zero-shot text-to-image gen-
eration. In ICML 2021, volume 139 of Proceedings Chunyu Xie, Heng Cai, Jianfei Song, Jincheng Li, Fan-
of Machine Learning Research, pages 8821–8831. jing Kong, Xiaoyu Wu, Henrique Morimitsu, Lin
PMLR. Yao, Dexin Wang, Dawei Leng, et al. 2022. Zero and
r2d2: A large-scale chinese cross-modal benchmark
Robin Rombach, Andreas Blattmann, Dominik Lorenz, and a vision-language framework. arXiv preprint
Patrick Esser, and Björn Ommer. 2022. High- arXiv:2205.03860.
resolution image synthesis with latent diffusion mod-
els. In Proceedings of the IEEE/CVF Conference Haiyang Xu, Ming Yan, Chenliang Li, Bin Bi,
on Computer Vision and Pattern Recognition, pages Songfang Huang, Wenming Xiao, and Fei Huang.
10684–10695. 2021. E2e-vlp: End-to-end vision-language pre-
training enhanced by visual learning. arXiv preprint
Christoph Schuhmann, Richard Vencu, Romain Beau- arXiv:2106.01804.
mont, Robert Kaczmarczyk, Clayton Mullis, Aarush
Katta, Theo Coombes, Jenia Jitsev, and Aran Komat- An Yang, Junyang Lin, Rui Men, Chang Zhou, Le Jiang,
suzaki. 2021. Laion-400m: Open dataset of clip- Xianyan Jia, Ang Wang, Jie Zhang, Jiamang Wang,
filtered 400 million image-text pairs. arXiv preprint Yong Li, Di Zhang, Wei Lin, Lin Qu, Jingren Zhou,
arXiv:2111.02114. and Hongxia Yang. 2021. Exploring sparse expert
models and beyond. CoRR, abs/2105.15082.
Sheng Shen, Liunian Harold Li, Hao Tan, Mohit Lewei Yao, Runhui Huang, Lu Hou, Guansong Lu,
Bansal, Anna Rohrbach, Kai-Wei Chang, Zhewei Minzhe Niu, Hang Xu, Xiaodan Liang, Zhenguo
Yao, and Kurt Keutzer. 2021. How much can clip Li, Xin Jiang, and Chunjing Xu. 2021. FILIP:
benefit vision-and-language tasks? arXiv preprint fine-grained interactive language-image pre-training.
arXiv:2107.06383. CoRR, abs/2111.07783.
Johannes Stallkamp, Marc Schlipsing, Jan Salmen, and Lu Yuan, Dongdong Chen, Yi-Ling Chen, Noel Codella,
Christian Igel. 2011. The german traffic sign recog- Xiyang Dai, Jianfeng Gao, Houdong Hu, Xuedong
nition benchmark: A multi-class classification com- Huang, Boxin Li, Chunyuan Li, Ce Liu, Mengchen
petition. In IJCNN 2011, pages 1453–1460. IEEE. Liu, Zicheng Liu, Yumao Lu, Yu Shi, Lijuan Wang,
Jianfeng Wang, Bin Xiao, Zhen Xiao, Jianwei Yang, A Appendix
Michael Zeng, Luowei Zhou, and Pengchuan Zhang.
2021. Florence: A new foundation model for com- A.1 Model Architecture Details
puter vision. CoRR, abs/2111.11432.
Xiaohua Zhai, Xiao Wang, Basil Mustafa, Andreas We develop 5 Chinese CLIP models of dif-
Steiner, Daniel Keysers, Alexander Kolesnikov, and ferent sizes, spanning from around 77 to 958
Lucas Beyer. 2022. Lit: Zero-shot transfer with million parameters. We include 1 ResNet-
locked-image text tuning. In Proceedings of the 50 model CN-CLIPRN50 and 4 ViT mod-
IEEE/CVF Conference on Computer Vision and
Pattern Recognition, pages 18123–18133. els, i.e., CN-CLIPViT-B/16 , CN-CLIPViT-L/14 ,
CN-CLIPViT-L/14@336px and CN-CLIPViT-H/14 ,
Pengchuan Zhang, Xiujun Li, Xiaowei Hu, Jianwei where models are pretrained on images of the
Yang, Lei Zhang, Lijuan Wang, Yejin Choi, and
resolution of 224 × 224 without specification.
Jianfeng Gao. 2021. Vinvl: Revisiting visual rep-
resentations in vision-language models. In CVPR Table 5 presents the details of the model ar-
2021, pages 5579–5588. Computer Vision Founda- chitecture. The smallest model CN-CLIPRN50
tion / IEEE. consists of a ResNet-50 for the image en-
coder and a RBT3 for the text encoder. The
base-size model CN-CLIPViT-B/16 consists of
a ViT-B/16@224px for the image encoder and
a RoBERTa-wwm-Base for the text encoder.
The large-size model CN-CLIPViT-L/14 consists
of a ViT-L/14@224px for the image encoder
and a RoBERTa-wwm-Base for the text en-
coder, while CN-CLIPViT-L/14@336px consists of
a ViT-L/14@336px and a RoBERTa-wwm-Base.
Specifically, we pretrain CN-CLIPViT-L/14@336px
by continuing pretraining on the pretrained
CN-CLIPViT-L/14 . For the adaptation to a larger
resolution, we initialize the image positional em-
bedding by applying interpolation to the posi-
tional embedding of CN-CLIPViT-L/14 , follow-
ing Ilharco et al. (2021). The huge-size model
CN-CLIPViT-H/14 consists of a ViT-H/14 for the
image encoder and RoBERTa-wwm-Large for the
text encoder. More implementation details are pre-
sented in Appendix A.
We provide more details of their model architec-
tures in Table 6 and Table 7. We keep the architec-
ture of ResNet-50, ViT-B/16, and ViT-L/14 back-
bones conformed with OpenAI CLIP and the archi-
tecture of ViT-H/14 same with LAION CLIP8 . This
enables us to initialize the Chinese CLIP image en-
coders with their weights. The text encoders are
Chinese Roberta models (Cui et al., 2020). Specifi-
cally, our most lightweight tiny-size Chinese CLIP
uses the architecture of the 3-layer RBT3 model.
The base-size and large-size Chinese CLIP models
use the 12-layer architecture of RoBERTa-wwm-
Base. For the huge-size CN-CLIP, we use the 24-
layer architecture of RoBERTa-wwm-Large. The
vocabulary size of text tokenizer is 21, 128.

8
https://laion.ai/blog/large-openclip/
Model #Params (All) Backbone (I) #Params (I) Backbone (T) #Params (T) Resolution
CN-CLIPRN50 77M ResNet-50 38M RBT3 39M 224 × 224
CN-CLIPViT-B/16 188M ViT-B/16 86M RoBERTa-wwm-Base 102M 224 × 224
CN-CLIPViT-L/14 406M ViT-L/14 304M RoBERTa-wwm-Base 102M 224 × 224
CN-CLIPViT-L/14@336px 407M ViT-L/14 304M RoBERTa-wwm-Base 102M 336 × 336
CN-CLIPViT-H/14 958M ViT-H/14 632M RoBERTa-wwm-Large 326M 224 × 224

Table 5: Hyperparameters of Chinese CLIP models of different sizes.

Embedding Vision Transformer Text Transformer


Model
dimension layers width heads layers width heads
CN-CLIPViT-B/16 512 12 768 12 12 768 12
CN-CLIPViT-L/14 768 24 1,024 16 12 768 12
CN-CLIPViT-L/14@336px 768 24 1,024 16 12 768 12
CN-CLIPViT-H/14 1,024 32 1,280 16 24 1,024 24

Table 6: Detailed architecture hyperparameters of ViT-based CN-CLIP models.

A.2 Pretraining Details for CN-CLIPRN50 , 4.5 days using 128 NVIDIA
Initialization As mentioned in Section 2.2, we V100 GPUs for CN-CLIPViT-B/16 , 11.5 days using
initialize the image encoders of CN-CLIPRN50 , 128 NVIDIA V100 GPUs for CN-CLIPViT-L/14
CN-CLIPViT-B/16 and CN-CLIPViT-L/14 using the and 3.8 days using 184 NVIDIA A100 GPUs for
OpenAI CLIP weights. The image encoder of CN-CLIPViT-H/14 .
CN-CLIPViT-H/14 is initialized with LAION CLIP.
Besides the ResNet or ViT parameters, the temper- Stage 2 In Stage 2, we unfreeze the image en-
ature and visual output projection parameters are coder and update all the model parameters. Ex-
also initialized with the pretrained CLIP weights. cept for the peak learning rate, batch size and
For the text encoder, we initialize the parameters us- training epochs, all other hyperparameters men-
ing the released Chinese Roberta weights of the cor- tioned in Stage 1 are kept unchanged. We de-
responding model scale, with their pooler weights crease the learning rate to 2e − 5 for subtler opti-
discarded. The text output projection weight is mization. For CN-CLIPRN50 , CN-CLIPViT-B/16
randomly initialized with normal distribution. and CN-CLIPViT-L/14 , the batch size is shrunk
Stage 1 The pretraining hyperparameters of to 16, 384, 16, 384 and 4, 608 respectively due
Stage 1 are shown in Table 8, which are to the limitation in GPU memory. When scal-
shared for CN-CLIPRN50 , CN-CLIPViT-B/16 , ing to CN-CLIPViT-H/14 , we implement gradient
CN-CLIPViT-L/14 and CN-CLIPViT-H/14 . The val- checkpointing, which enables a larger batch size of
ues of hyperparameters are generally similar to 32, 768. These 4 models are pretrained for around
those in OpenAI CLIP (Radford et al., 2021). As to 44, 15, 7 and 7 epochs in Stage 2, respectively. In
data augmentation, we use random resize cropping this stage, we pretrain CN-CLIPRN50 for 5.8 days
and AutoAugment (Cubuk et al., 2019) on input using 64 NVIDIA V100 GPUs, CN-CLIPViT-B/16
images. We leverage all-gather communications for 3.0 days using 128 NVIDIA V100 GPUs,
across GPU workers to compute contrastive loss on CN-CLIPViT-L/14 for 8.0 days using 128 Nvidia
the global batch. The above 4 models are pretrained V100 GPUs, and CN-CLIPViT-H/14 for 2.2 days
for around 20, 44, 64, and 26 epochs in this stage re- using 184 NVIDIA A100 GPUs.
spectively, with the image encoder frozen. The run- To pretrain a model of a larger resolution, we
ning variance and mean of batch normalization lay- implement interpolation to the image positional
ers are not updated in this stage for CN-CLIPRN50 . embedding of CN-CLIPViT-L/14 for adapting to
The optimal epochs of pretraining are determined a larger resolution and continue pretraining with
by measuring the mean-recall under the 3 down- images of the resolution of 336 × 336. We start
stream zero-shot retrieval tasks during training. from CN-CLIPViT-L/14 and continue pretraining
Mixed-precision training is activated. In this stage, by 2 epochs. The pretraining only costs the use of
we pretrain 1.6 days using 64 NVIDIA V100 GPUs 128 NVIDIA A100 GPUs for 0.7 days.
Embedding ResNet Text Transformer
Model
dimension blocks width layers width heads
CN-CLIPRN50 1,024 (3, 4, 6, 3) 2,048 3 768 12

Table 7: Detailed architecture hyperparameters of ResNet-based CN-CLIPRN50 .

Hyperparameters Value the texts in the ICR dataset are longer, we set the
Batch size 32, 768 maximum text length to 128 for finetuning. The
Maximum text length 50 results show that Chinese CLIP achieves state-of-
Peak learning rate 1e − 4
Learning rate schedule Cosine
the-art performance in cross-modal retrieval tasks
Maximum temperature 100 with longer texts.
Weight decay 1e − 3
Warmup iterations 5, 000 A.5 Details About Experiments on Zero-shot
Adam β1 0.9
Adam β2 0.999 (ResNet), 0.98 (ViT) Image Classification
Adam ϵ 1e − 8 (ResNet), 1e − 6 (ViT)
We present the data statistics and metrics of the 20
Table 8: Common pretraining hyperparameters in the image classification datasets of the track ICinW in
first stage. the ELEVATER benchmark in Table 11. For the
adaptation of Chinese CLIP to the English-native
benchmark, we apply a series of preprocessing
A.3 Finetuning Details strategies. Specifically, we translate the text de-
As reported in Table 1, 2 and 3, we mainly fine- scriptions of the labels and the templates for man-
tune CN-CLIP on 3 cross-modal retrieval datasets: ual prompts to Chinese. For example, the labels
MUGE, Flickr30K-CN, and COCO-CN. Most fine- in CIFAR-10 include “car, dog, ...”, and we man-
tuning experiments are conducted on 32 NVIDIA ually translate the words into Chinese. There are
A100 GPUs. The finetuning strategy and loss are also particular cases, such as the labels in FGVC-
consistent with the pretraining process. For time Aircraft (Maji et al., 2013), which are difficult to
efficiency and full utilization of computation re- translate or transliterate. We search the names on
sources, we set the batch size as large as possi- Google and figure out the best Chinese name for
ble. We implement gradient checkpointing in the each label. Be that as it may, we cannot guaran-
finetuning process of CN-CLIPViT-L/14@336px and tee that we have the best Chinese translation, and
CN-CLIPViT-H/14 for a larger batch size. Table 9 more importantly, it is still hard for the Chinese pre-
shows the specific settings of batch size, peaking trained model to understand some of the concepts,
learning rate, maximum epochs, and warmup itera- which may lead to unsatisfactory performance in
tions in the finetuning process. We set other hyper- the related downstream tasks. As to the templates,
parameters to be the same as those in pretraining for some datasets, we use our translation of the tem-
by default. We save the model parameters at the plates provided by the ELEVATER toolkit,9 and for
end of each epoch. For MUGE, we report the best the others, we use the translation of the templates
results on the validation set. For Flickr30K-CN from OpenAI CLIP.
and COCO-CN, we choose the checkpoint with the We present the experimental results of all Chi-
best performance on the validation set and report nese CLIP models on zero-shot image classifica-
the results on the test set. tion in Table 12. It can be found that the scaling
of model size can consistently bring improvements
A.4 Cross-modal Retrieval with Longer Texts in model performance. The predictable improve-
The results reported in Section 3.1.2 demonstrate ments of scaling Chinese CLIP indicate that we can
the excellent cross-modal retrieval capability of further scale up the model for better performance
Chinese CLIP. Note that the average text lengths in the future work. However, it is still a pity that the
of MUGE, Flickr30K-CN, and COCO-CN are 7.4, tiny-size CN-CLIPRN50 saliently performs much
19.7, and 16.8, respectively. We also conduct fine- worse than the ViT variants which are significantly
tuning experiments on the ICR (Xie et al., 2022) larger. This shows that there is still much room
dataset with an average text length of 45.3. Ex- 9
https://github.com/Computer-Vision-in-the-W
perimental results are shown in Table 10. Since ild/Elevater_Toolkit_IC
Batch size Peak learning rate Maximum epochs Warmup iterations
Model
MUGE Flickr COCO MUGE Flickr COCO MUGE Flickr COCO MUGE Flickr COCO
CN-CLIPRN50 24,576 24,576 24,576 5e-5 6e-5 5e-5 60 30 40 100 20 6
CN-CLIPViT-B/16 12,800 7,680 12,800 2e-5 5e-5 5e-5 20 16 30 40 20 6
CN-CLIPViT-L/14 4,096 4,096 4,096 3e-5 2e-5 6e-5 20 16 18 100 60 9
CN-CLIPViT-L/14@336px 8,192 8,192 8,192 2e-5 2e-5 4e-5 20 18 18 100 20 2
CN-CLIPViT-H/14 20,480 4,096 5,120 2e-5 6e-6 2e-5 20 18 18 20 6 10

Table 9: Detailed finetuning hyperparameters of CN-CLIP models.

Tasks Text-to-Image Image-to-Text


Metrics R@1 R@5 R@10 R@1 R@5 R@10 MR
R2D2ViT-B 42.2 69.4 77.8 43.4 69.8 78.4 63.5
R2D2ViT-L/14 60.7 82.0 86.9 61.5 82.9 87.7 77.0
CN-CLIPViT-B/16 55.4 79.0 85.2 56.6 79.5 85.6 73.5
CN-CLIPViT-L/14 61.6 83.6 89.0 62.5 83.9 89.1 78.3

Table 10: Finetuning results on ICR dataset. We report the performance of baselines, CN-CLIPViT-B/16 and
CN-CLIPViT-L/14 on text-to-image and image-to-text retrieval.

for the small model to improve, and the knowl- Intel Xeon (Skylake) Platinum 8163 CPU cores
edge transfer of CLIP from large models to small with 64GB memory. For each model, we infer-
models should be an important research topic in ence the vision and text representations for 100
multimodal representation learning. batches and compute the average time. Simulating
the scenario of online deployment, we use batch
A.6 Deployment
size of 1. All the models infer with FP16 preci-
Chinese CLIP is supported to be deployed into sion. Table 13 shows the comparisons of infer-
ONNX-based10 and TensorRT-based11 models, en- ence time cost. For almost all the model scales,
abling faster text and vision representation gen- ONNX-based and TensorRT-based models have
eration (especially for online inference). In this optimized inference speed over native PyTorch im-
section, we provide more details on the model con- plemented Chinese CLIP models, especially on
version, as well as the performance improvement. smaller model sizes. For vision representation in-
Specifically, we employ the ONNX module in ference, the TensorRT-based models are around
PyTorch with ONNXMLTOOLS12 package to con- 1.3 (CN-CLIPViT-H/14 ) to 9.5 (CN-CLIPRN50 )
vert Chinese CLIP PyTorch models to ONNX- times as fast as the Pytorch-based models. For
based models in FP16 precision. With the support text representation inference, the TensorRT-based
of ONNXRUNTIME - GPU13 package, the ONNX- models are around 6.2 (CN-CLIPViT-H/14 ) to 8.2
based models are able to infer on NVIDIA GPUs. (CN-CLIPViT-L/14 ) times as fast as the PyTorch
The T ENSOR RT package enables the TensorRT- counterparts.
based models obtained from ONNX-based mod-
els and provides the GPU inference runtime. Our
TensorRT-based models are also in FP16 precision. We also evaluate the quality of ONNX-based
We benchmark the PyTorch implemented Chi- and TensorRT-based model representations by mea-
nese CLIP models with converted ONNX-based suring their zero-shot performance on MUGE re-
and TensorRT-based models using a server with a trieval dataset. Table 14 provides the experimental
single NVIDIA T4 GPU. The server contains 16 zero-shot results, which shows that the converted
ONNX-based or TensorRT-based models keeps the
10
https://onnx.ai/ quality of vision and text representations well, with
11
https://developer.nvidia.com/tensorrt
12
https://github.com/onnx/onnxmltools
no more than 0.1 MR degradation in retrieval per-
13
https://onnxruntime.ai/docs/install formance.
Dataset #Labels Test Size Metric
Caltech-101 (Fei-Fei et al., 2004) 101 6,084 Mean-per-class
CIFAR-10 (Krizhevsky et al., 2009) 10 10,000 Accuracy
CIFAR-100 (Krizhevsky et al., 2009) 100 10,000 Accuracy
Country-211 (Radford et al., 2021) 211 21,100 Accuracy
DTD (Cimpoi et al., 2014) 47 1,880 Accuracy
EuroSAT (Helber et al., 2019) 10 5,000 Accuracy
FER-2013 (Radford et al., 2021) 7 3,589 Accuracy
FGVC-Aircraft (Maji et al., 2013) 100 3,333 Mean-per-class
Food-101 (Bossard et al., 2014) 101 25,250 Accuracy
GTSRB (Stallkamp et al., 2011) 43 12,630 Accuracy
Hateful-Memes (Kiela et al., 2020) 2 500 ROC AUC
KITTI-Distance (Fritsch et al., 2013) 4 711 Accuracy
MNIST (Deng, 2012) 10 10,000 Accuracy
Oxford Flowers-102 (Nilsback and Zisserman, 2008) 102 6,149 Mean-per-class
Oxford-IIIT Pets (Parkhi et al., 2012) 37 3,669 Mean-per-class
PatchCamelyon (Veeling et al., 2018) 2 32,768 Accuracy
Rendered-SST2 (Radford et al., 2021) 2 1,821 Accuracy
RESISC-45 (Cheng et al., 2017) 45 25,200 Accuracy
Stanford-Cars (Krause et al., 2013) 196 8,041 Accuracy
Pascal VOC-2007 (Everingham et al., 2010) 20 4,952 11-point mAP

Table 11: Details of the image classification datasets in the ELEVATER benchmark.

CN-CLIP CN-CLIP CN-CLIP CN-CLIP CN-CLIP


Dataset
RN50 ViT-B/16 ViT-L/14 ViT-L/14@336px ViT-H/14

Caltech-101 77.3 84.9 88.5 88.8 90.6


CIFAR-10 72.7 92.0 94.9 94.1 96.0
CIFAR-100 40.6 64.4 75.1 73.5 79.7
Country-211 7.7 15.2 21.0 25.4 25.3
DTD 36.9 43.6 44.2 43.8 51.2
EuroSAT 27.0 46.9 56.9 50.7 52.0
FER-2013 21.9 47.2 54.6 55.1 49.2
FGVC-Aircraft 5.4 12.8 16.0 17.1 26.2
Food-101 39.8 62.4 69.4 73.9 74.6
GTSRB 22.3 28.4 37.3 35.5 38.5
Hateful-Memes 50.3 56.2 53.4 52.8 54.7
KITTI-Distance 30.2 33.5 49.9 49.8 39.1
MNIST 50.2 67.6 69.8 65.0 79.4
Oxford Flowers-102 30.7 52.2 62.5 64.8 68.4
Oxford-IIIT Pets 48.7 73.0 81.6 83.1 83.5
PatchCamelyon 47.7 54.0 63.5 62.9 52.4
Rendered-SST2 50.1 52.3 61.4 62.9 61.0
RESISC-45 49.3 58.7 65.2 65.8 66.9
Stanford-Cars 27.3 42.3 49.8 54.1 71.8
VOC-2007 82.1 83.3 84.5 84.9 84.9
Average 40.9 53.5 60.0 60.2 62.3

Table 12: Experimental results of the zero-shot image classification performance of models on ICinW.

Inference Time per Sample (ms) Vision Representation Text Representation


Model Scale PyTorch ONNX TensorRT PyTorch ONNX TensorRT
CN-CLIPRN50 12.93 5.04 1.36 3.64 0.95 0.58
CN-CLIPViT-B/16 11.12 4.92 3.58 12.47 3.42 1.54
CN-CLIPViT-L/14 21.19 17.10 13.08 12.45 3.48 1.52
CN-CLIPViT-L/14@336px 47.11 48.40 31.59 12.24 3.25 1.54
CN-CLIPViT-H/14 35.10 34.00 26.98 23.98 6.01 3.89

Table 13: Inference speed comparisons among PyTorch, ONNX and TensorRT Chinese CLIP models.
Zero-shot Performance
Model Scale Framework
R@1 R@5 R@10 MR
PyTorch 42.6 68.6 77.9 63.0
CN-CLIPRN50 ONNX 43.0 68.4 78.1 63.2
TensorRT 42.8 68.5 78.0 63.1
PyTorch 52.1 76.7 84.4 71.1
CN-CLIPViT-B/16 ONNX 52.0 76.8 84.3 71.1
TensorRT 52.0 76.8 84.2 71.0
PyTorch 56.3 79.8 86.2 74.1
CN-CLIPViT-L/14 ONNX 56.4 80.0 86.3 74.2
TensorRT 56.3 79.9 86.5 74.2
PyTorch 59.0 81.4 87.8 76.1
CN-CLIPViT-L/14@336px ONNX 59.2 81.4 87.6 76.1
TensorRT 59.2 81.7 87.5 76.1
PyTorch 63.0 84.1 89.2 78.8
CN-CLIPViT-H/14 ONNX 63.1 84.1 89.0 78.8
TensorRT 63.1 84.2 89.1 78.8

Table 14: Zero-shot results on MUGE-Retrieval dataset among PyTorch, ONNX and TensorRT Chinese CLIP
models.

Figure 4: Retrieval results of the query “a cat with glasses” in Chinese.


Figure 5: Retrieval results of the query “Spring Festival couplet” in Chinese.

You might also like