Park SeiT Storage-Efficient Vision Training With Tokens Using 1 of Pixel ICCV 2023 Paper
Park SeiT Storage-Efficient Vision Training With Tokens Using 1 of Pixel ICCV 2023 Paper
Song Park⇤ Sanghyuk Chun⇤ Byeongho Heo Wonjae Kim Sangdoo Yun
⇤
Equal contribution
NAVER AI Lab
Abstract 85
82.3
82.6
78.7
80 78.4 81.8
81.1
We need billion-scale images to achieve more general- 78.6
75
17248
deep vision models are vulnerable to imperceptible high- store a digital image, which is misaligned with our target
frequency perturbations [23, 42, 17] or unreasonably local task. Our approach overcomes this limitation by storing
areas [22, 6, 59], implying that deep vision models attend images as tokens rather than pixels, using pre-trained vision
too much to imperceptible details instead of the true prop- tokenizers, such as VQGAN [21] or ViT-VQGAN tokenizer
erty of objects. Therefore, we can expect that we can still [72]. Introducing Storage-efficient Vision Training (SeiT),
achieve a high-performing vision model with the reduced we convert each image to 32 ⇥ 32 tokens. The number of
image dataset by removing the imperceptible details. possible cases each token can have (the codebook) is 391,
There are two major directions to storage-efficient vision which takes only 1.15KB to store each token (assuming that
model training. The first direction aims to reduce the total the number of 391 cases can be expressed in 9 bits). It costs
number of data points by discarding less important samples only less than 1.5GB for storing 140GB pixel-based storage
[45, 47, 32] or synthesizing more “condensed” images than of ImageNet. We train Vision Transformer (ViT) models on
natural images [78, 77]. However, this approach shows a our tokenized images with minimum modifications. First, a
significant performance drop compared to the full dataset 1024-length tokenized image is converted to a 32 ⇥ 32 ⇥ 32
(the blue and yellow lines in Fig. 1) or cannot be applied tensor by using pre-trained 32-dimensional codebook vec-
to large-scale datasets due to their high complexity. Also, tors from ViT-VQGAN. Next, we apply random resized
as the sampled or synthesized images are still normal im- crop (RRC) to the tensor to get a 32 ⇥ 28 ⇥ 28 tensor. Then,
ages, these methods still suffer from an inefficient compres- to convert the tensor into a form that ViT can handle, we
sion ratio to express imperceptible details. Furthermore, introduce Stem-Adapter module that converts the RRC-ed
these methods need to compute the importance score or the tensor into a tensor of size 768⇥14⇥14, the same as the first
sample-wise gradient of each sample by learning models layer input of ViT after the stem layer. Because the image-
with the full dataset. It makes these approaches not appli- based augmentations are not directly applicable to tokens,
cable to unseen datasets or newly upcoming data streams. we propose simple token-specific augmentations, including
The other approach involves reducing the size of each Token-EDA (inspired from easy data augmentation (EDA)
image while keeping the total number of images. For ex- [71] for language), Emb-Noise and Token-CutMix (inspired
ample, by learning a more efficient compression method from CutMix [73]). In our experiment, we achieve 74.0%
[7, 8]. However, the neural compression methods have been top-1 accuracy with 1.36GB token storage, where the full
mostly studied on extremely small-scale datasets (e.g., 24 image storage requires 140GB to achieve 81.8% [65].
images [15] or 100 images [4]), and their generalizability SeiT has several advantages over previous storage-
to large-scale datasets is still an open problem. Moreover, efficient methods. First, as we use a frozen pre-trained to-
the goal of neural compression is to compress an image and kenizer that only requires forward operations to extract to-
recover the original image as perfectly as possible, not to kens from images, we do not need an additional optimiza-
extract the most discriminant features for object recognition tion for compressing a dataset, such as importance score-
tasks. In response to these limitations, no neural compres- based sampling [32], image synthesis methods [78, 77], or
sion method has been used to compress large-scale datasets neural compression [7, 8]. Hence, SeiT is easily applica-
like ImageNet [55] to train deep vision models. ble to newly upcoming data streams directly. Second, un-
Due to the difficulty of the practical usage of neural com- like previous works that use pre-trained feature extractors
pression, practitioners have attempted to reduce storage us- (e.g., HOG [19] or Faster-RCNN [51, 2]), SeiT can use the
age by controlling image quality. For example, the LAION same architecture as pixel-based approaches with only min-
dataset [58, 57] stores each image at 256 ⇥ 256 resolution, imal modifications on the stem layer, as well as the carefully
which takes up only 36% of ImageNet images (469 ⇥ 387 tuned optimization settings, such as DeiT [65]. It becomes a
resolution on average). Similarly, adjusting the JPEG com- huge advantage when using SeiT as an efficient pre-training
pression quality can reduce the overall storage. As shown in method; we can achieve 82.6% top-1 accuracy by fine-
Fig. 1 (green and purple lines), these approaches work well tuning the token pre-trained model with images. Moreover,
in practice compared to sampling-based methods. How- applying an input augmentation for feature extractor-based
ever, these methods have a limited compression ratio; if the approaches is not straightforward, limiting their generaliz-
compression ratio becomes less than 25%, the performances ability. Finally, SeiT shows a significant compression ratio,
drop significantly. By adjusting the image resolution with with a 1% compression ratio for ImageNet.
a 4% compression ratio and JPEG quality with a 7% com- We show the effectiveness of SeiT on three image classi-
pression ratio, we achieve 63.3% and 67.8% top-1 accura- fication scenarios: (1) storage-efficient ImageNet-1k bench-
cies, respectively. In contrast, our approach achieves 74.0% mark (2) storage-efficient large-scale pre-training, and (3)
top-1 accuracy with only a 1% compression ratio. continual learning. The overview of storage-efficient results
All shortcomings of the previous methods originate from is shown in Fig. 1: SeiT outperforms comparison methods
the fact that too many imperceptible bits are assigned to with a significant gap with the same storage size, 74.0% ac-
17249
curacy on ImageNet under 1% of the original storage, where be effective in efficient model training, especially in scenar-
comparison methods need 40% (uniform sampling, C-score ios such as continual learning [52]. However, due to their
sampling [32]), 6% (adjusting image resolution), and 8% high complexity, they have not yet demonstrated success-
(adjusting JPEG quality) of the original storage to achieve ful cases in large-scale datasets such as ImageNet-1k. We
the similar performance. We also demonstrate that SeiT can recommend the survey paper [39] for curious readers.
be applied to large-scale pre-training for an image-based ap-
proach; we pre-train a ViT-B/16 model on the tokenized
Neural compression. Image compression algorithms
ImageNet-21k (occupying only 14.1GB) and fine-tune the
have improved with the use of neural network training to
ViT model on the full-pixel ImageNet-1k. By using slightly
minimize quality loss on lossy compression. The repre-
more storage (156GB vs. 140GB), our storage-efficient pre-
sentative learned image compression methods are based on
training strategy shows 82.8% top-1 accuracy, whereas the
VAE [7, 8]. The compressor encodes an image to discrete
full-pixel ImageNet-1k training shows 81.8%. Finally, we
latent codes and the codes can be decoded into the image
observe that our token-based approach significantly outper-
with small losses. Recent studies [12, 33] have utilized
forms the image-based counterpart in the continual learning
the self-attention mechanism [68] with heavy CNN archi-
scenario [52] by storing more data samples in the same size
tectures to demonstrate superior compression power com-
of the memory compared to full-pixel images.
pared to conventional methods such as JPEG. However,
the learned image compression targets high-quality images
Contributions. (1) We compress an image to 1024 dis- with complex and detailed contexts, which are distant from
crete tokens using a pre-trained visual tokenizer. By ap- ImageNet samples. Thus, it is challenging to apply these
plying a simple lossless compression for the tokens, we methods to compress ImageNet for ViT training.
achieve only 0.97% storage size compared to images stored
in pixels. (2) We propose Stem-Adapter module and aug- Learning with frozen pre-extracted features. Using ex-
mentation methods for tokens such as Token-RRC, Token- tracted visual features for a model has been widely used in
CutMix, Emb-Noise, and Token-EDA in order to enable the computer vision field. It shows reasonable performances
ViT training with minimal change to the protocol and hy- with a low computational cost compared to pixel-based vi-
perparameters of existing ViT training. (3) Our storage- sual encoders. For example, the Youtube-8M [1] dataset
efficient training pipeline named Storage-efficient Vision consists of frame features extracted from Inception [61] in-
Training (SeiT) shows great improvements on the low- stead of raw pixels, allowing efficient video model train-
storage regime. With only 1% storage size, SeiT achieves ing [44, 10] with frozen frame features. The pre-extracted
74.0% top-1 ImageNet 1k validation accuracy. (4) We addi- features have also been widely used for tasks that need
tionally show that SeiT can be applied to a storage-efficient higher knowledge than pixel-level understandings. For ex-
pre-training strategy, and continual learning tasks. ample, frozen CNN features [40] or bottom-up and top-
down (BUTD) [63, 2] features [34] have been a popular
2. Related Works choice for vision-and-language models that aim to under-
Importance sampling for efficient training. Sampling- stand complex fused knowledge between two modalities,
based methods [45, 47, 14, 32] aims to idendity a compact, e.g., visual question answering [3, 24]. These approaches
yet representative subset of the training dataset that satisfies show slightly worse performances than the end-to-end train-
the original objectives for efficient model training. This is ing from raw inputs without pre-extracted features [35, 49],
usually achieved through exploring the early training stage but show high training efficiency in terms of computations.
[47], constructing a proxy model [14], or utilizing consis- However, these methods need feature-specific modules
tency score (C-score) [32]. However, the empirical perfor- to handle frozen features and specialized optimization tech-
mance gap between sampling-based methods and the base- niques rather than standard optimization methods of pixel-
line approach of random selection is insignificant, particu- based methods. Furthermore, some fundamental augmenta-
larly in large-scale datasets like ImageNet-1k (See Fig. 1). tions, such as random resized crop (RRC), are not applica-
We believe that preserving the diversity of data points in a ble to the frozen features, resulting in inferior generalizabil-
dataset is crucial, and therefore we endeavor to maintain the ity. SeiT has major advantages over these methods where it
number of data points instead of pruning them. is the almost same training method for ViT (e.g., DeiT [65]),
and yet it can significantly reduce the storage space.
Dataset distillation. Dataset distillation [70] aims to gen- 3. Token-based Storage-Efficient Training
erate a compact dataset by transferring the knowledge of
the training dataset into a smaller dataset. Recent works In this section, we propose Storage-efficient Vision
[78, 77, 38, 56, 54] have shown that the synthesized data can Training (SeiT). SeiT aims to learn a high-performing vi-
17250
Split
n2 Patches Storage Avg. size
Format Encoding
size per image
Encoder Pixels uint8 (uncompressed) 1471.2 GB 1.14 MB
Input image dc
n Outputs
2 Pixels JPEG (baseline) 140.0 GB 109.3 kB
Tokens uint16 (uncompressed) 2.50 GB 2.0 kB
Save dc Codebook Tokens Ours (8 bits encoding) 1.54 GB 1.26 kB
Tokens Ours + Huffman coding 1.36 GB 1.11 kB
Storage c1 c2 c3 c4 c5 c6 cn2 n2 Tokens Tokens Theoretical optimum 1.32 GB 1.08 kB
Figure 2. Tokenization. The input image is resized to 256 ⇥ 256
and then divided into non-overlapping n2 patches. The patches Table 1. Storage size of the ImageNet-1k training dataset for
are fed into the ViT-VQGAN encoder, which produces a sequence different formats and encodings. uint8 and uint16 denote
of dc dimensional vectors from the patches. Finally, the tokens uncompressed version of each data format. Theoretical optimum
are generated by mapping each vector to the nearest code in a pre- is estimated by assuming the token population is uniform.
trained codebook. We used 32 for both n and dc in this paper.
we cannot store tokens in 8.61 bits because the commonly
sion classifier at scale (e.g., ImageNet-1k [55]) with a small used data types use Byte for the minimal unit, e.g., 1 Byte
storage size (e.g., under 1% of the original size), a minimal (uint8) or 2 Bytes (uint16). To compress the required
change on the training strategy (e.g., highly optimized train- bits per token to less than 2 Bytes, we propose a simple
ing strategy [65]), and the minimum sacrifice of accuracies. yet efficient encoding for the tokens. First, we assign each
SeiT consists of two parts (1) preparing the compressed to- token index following the token popularity, i.e., the most
ken dataset and (2) training a model using the tokens. frequent token is assigned to index 0, and the least frequent
token is assigned to index 390. Then, we break up token
3.1. Preparing the token dataset indices larger than 255 into two elements as follows:
We extract tokens using the ImageNet-trained ViT- (
[i] if i < 255
VQGAN tokenizer [72] because it shows the best recon- i= (1)
struction quality among the ImageNet-1k only trained tok- [255, i] if i 255
enizers (See Appendix). In Fig. 1 and Appendix, our ap-
We store multiple tokens in a file to reduce the required stor-
proach performs better if a stronger tokenizer trained with
age as small as possible. However, because our encoding
an extra dataset, e.g., the OpenImages-trained VQGAN to-
process makes the length of each token variable, the naive
kenizer [21], is used. In the main paper, however, we use
decoding process for our encoding will need O(n) com-
the ViT-VQGAN tokenizer for a fair comparison with other
plexity where n is the number of encoded tokens by Eq. (1).
storage-efficient methods in terms of the training dataset.
We solve the problem by simply storing the start indices of
Fig. 2 shows the overview of the dataset preparation
each image. The index storage only requires 9.8 MB for the
pipeline. We first resize the entire ImageNet dataset to
entire ImageNet training dataset, but it makes the decoding
256 ⇥ 256. Then, each resized image is divided into non-
process becomes O(1) and parallelizable. Pseudo-codes for
overlapping 8 ⇥ 8 image patches. Finally, we encode each
the proposed encoding-decoding are in Appendix A.1.
patch into a 32-dimensional vector and assign a code in-
Our simple encoding strategy reduces 40% of the over-
dex by finding the nearest codeword from the pre-trained
all storage size compared to the naive uint16 data type
codebook. Here, we only use 391 codewords from the 8192
as shown in Table 1. Here, as the original baseline storage
original codewords because we found that only 391 code-
also employs a compression algorithm, such as JPEG (See
words are used for the ImageNet training dataset. As a re-
the first and the second row of Table 1), we also apply a
sult, each image is converted to 32 ⇥ 32 tokens where each
simple compression algorithm, Huffman coding [29]. After
token belongs to [0, . . ., 390]. We also store the codebook
applying Huffman coding to our token storage, we achieve
of ViT-VQGAN (a 32 ⇥ 391 vector) to re-use the knowl-
nearly optimal storage size per image (1.11 kB vs. 1.08 kB).
edge of the codebook for better performance.
We empirically observe that the entire decoding process, in-
In theory, as our token indices belong to [0-390], the op-
cluding Huffman decoding, is almost neglectable: while the
timal bit length to store the tokens is log2 391 = 8.611 by
full-pixel processing requires 0.84s per 100 images, our ap-
the source coding theorem [16]. Therefore, the optimal stor-
proach only needs 0.07s. As a result, full-pixel and SeiT
age size of an image will be 1.08 kB2 . However, in practice,
take 5m 40s and 5m 12s for 1 epoch training, respectively.
1 Following the empirical population of the tokens, the “empirical” op-
P In the remaining part of this paper, we use the compressed
timal bit length is 8.54 by computing H(p) = pi log pi . However, version of our token dataset if there is no specification.
in the rest of the paper, we assume the population is uniform for simplicity.
2 We have 1.08 kB = bits per token (8.61) ⇥ token length (1024) / bits per Byte (8). If we follow the actual distribution, it becomes 1.07 kB.
17251
Random Resized CutMix
Load Crop (m m) Emb. noise
n n Tokens
2
391
dc dc
dp
n Reshape,
One-hot dc
Stem-
Token m Adapter
EDA n n m
Storage m m k
3.2. Training classifiers with tokens 28 ⇥ 28 using bicubic interpolation. After RRC, the one-
hot tokens are converted to a 32 ⇥ 28 ⇥ 28 tensor using the
Training a classifier with tokenized images is not trivial.
pre-trained codebook vectors from ViT-VQGAN, where 32
For example, an input token has 32 ⇥ 32 dimensions, but
is the size of a pre-trained code vector. Note that tokens that
a conventional image input has 3 ⇥ 224 ⇥ 224. Further-
are not in one-hot form due to interpolation are converted to
more, strong image-level augmentations (e.g., RandAug-
mixed codebooks following their values. CutMix is then ap-
ment [18], Gaussian blur [66]) have become crucial in large-
plied to these tensors, whereby a patch is randomly selected
scale vision classifiers, however, these operations cannot be
from one token and replaced with a patch from another to-
directly applied to the tokens. One possible direction is to
ken while maintaining the channel dimension.
decode tokens to pixel-level images during every forward
Adding channel-wise noise. We also developed Emb-
computation. However, this would impose an additional
Noise, a token augmentation method that mimics color-
computational load on the network. Instead, we propose
changing image augmentations, such as color jittering. In-
simple yet effective token-level augmentations and a simple
spired by the fact that each channel in an image represents
Stem-Adapter module to train a vision classifier directly on
a specific color, we first generate noise of length 32 and add
the tokens with minimal modification but small sacrifices.
it to each channel of the converted tensor with 32 ⇥ 28 ⇥
28 dims, and then apply full-size iid noise, i.e. noise size
3.2.1 Token Augmentations of 32 ⇥ 28 ⇥ 28, to the tensor. All of the noise is sampled
Token-EDA. We utilize the EDA [71], designed for lan- from a normal distribution. We have empirically demon-
guage models, to augment our token data. EDA originally strated that this method brings significant performance im-
involves four methods: Synonym Replacement (SR), Ran- provement despite its simplicity. Moreover, we found that
dom Insertion (RI), Random Swap (RS), and Random Dele- adding channel-wise noise to the tokens in ViT-VQGAN,
tion (RD). However, we only adopt SR and RS because the the tokenizer we used, effectively changes the colors of the
others do not maintain the number of tokens, which is not decoded images, unlike adding Gaussian noise in entire di-
compatible with the ViT training strategy. For SR, we define mensions. Example decoded images by ViT-VQGAN are
synonyms of a token as the five tokens that have the clos- presented in Appendix A.2.
est Euclidean distance in the ViT-VQGAN codebook space.
Then, each token is randomly replaced with one of its syn- 3.2.2 Stem-Adapter module
onyms with a certain probability ps during training. For RS,
we randomly select two same-sized squares from a 32 ⇥ 32 As the tokens have a smaller size than images, they cannot
token and swapped the tokens inside them with each other, be directly used for input of networks. We introduce a Stem-
with a probability pr . We use 0.25 for ps and pr for SeiT. Adapter that converts the augmented tensor into ViT/16 to
Token-RRC and Token-CutMix. In addition to EDA, we make minimal modifications on the network. Specifically,
apply Random Resized Crop (RRC) and CutMix [73] to to- the Stem-Adapter module converts the 32 ⇥ 28 ⇥ 28 pre-
kens. For RRC, we adopt a standard ImageNet configura- processed tokens into 768 ⇥ 14 ⇥ 14, the same as the input
tion with a scale (0.08, 1) and an aspect ratio (3/4, 4/3). To of transformer blocks of ViT after the stem layer. We im-
enable interpolation, we first convert the original 32 ⇥ 32 plement the Stem-Adapter module as a convolutional layer
tokens to one-hot form. Then, apply the random cropping with a kernel size of 4 and a stride of 2. This allows the
to these one-hot tokens, which are subsequently resized to module to capture the spatial relationships of adjacent to-
17252
Reduction Dataset Avg. size
Method # of images Top1 Acc.
factor storage size per image
Full-pixels 100% 140.0 GB 1,281 k 109 kB 81.8
70% 95.7 GB 897 k 107 kB 78.2
Uniform random sampling 40% 54.6 GB 512 k 107 kB 74.0
20% 27.2 GB 256 k 107 kB 59.8
60% 80.6 GB 769 k 105 kB 77.5
C-score [32] based sampling 40% 53.3 GB 512 k 104 kB 73.3
20% 26.3 GB 256 k 103 kB 65.1
30% 16.0 GB 1,281 k 13 kB 78.6
Adjusting image reolution 20% 9.6 GB 1,281 k 8 kB 75.2
10% 5.3 GB 1,281 k 4 kB 63.3
Adjusting JPEG quality factor 10 14.0 GB 1,281 k 11 kB 78.1
(an integer scale between 1-100 representing 5 11.0 GB 1,281 k 9 kB 74.6
particular compression levels) 1 9.3 GB 1,281 k 7 kB 67.8
SeiT (ImageNet-1k-5M [74], the full dataset) - 7.5 GB 5,830 k 1 kB 78.6
SeiT (ImageNet-1k-5M, 60% randomly sampled one) 60% 4.5 GB 3,498 k 1 kB 75.9
SeiT (ImageNet-1k, the full dataset) - 1.4 GB 1,281 k 1 kB 74.0
Table 2. Main results. ImageNet-1k results using various data storage reduction methods are shown. We compare SeiT against reduction
factors that achieve comparable performance and storage size. Note that the numbers for all reduction factors are included in Appendix B.2.
kens and produce a tensor that can be used as input to ViT. tained their performance until storage reached 50% of the
The comparison among the different Stem-Adapter archi- original. When the quality was set above 50, the perfor-
tectures is included in Section Section 4.3. mance remained nearly the same as the original, even with
24.3% of the original storage usage. However, when the
4. Experiments quality was set to 1, the performance dropped dramati-
cally to 67.8%. Adjusting the resolution (purple) achieved
In this section, we conduct various experiments to better results than reducing the quality as storage became
demonstrate the effectiveness of token-based training. First, smaller while reducing the quality performed better than re-
we compare SeiT with four image compression methods ducing the resolution with relatively large storage. Despite
on ImageNet-1k [55]. Next, we explore the potential of the overall performance decline of image-based methods
SeiT as a large-scale pre-training dataset by employing the in low-storage environments, SeiT achieved 74.0% accu-
ImageNet-21k dataset. We also provide ablation studies racy while using only 1% of the original storage. Further-
on the proposed token augmentation methods and Stem- more, by employing ImageNet-1k-5M [74], we were able
Adapter module to determine the effectiveness of each pro- to access more storage on tokens and achieve 78.6% accu-
posed element. Lastly, we evaluate a continual learning sce- racy at 5% of the ImageNet-1k storage size, where JPEG-
nario on the ImageNet-100 [64] dataset to demonstrate the based methods demonstrated performances lower than 75%.
benefits of tokens in a limited memory environment. The These results highlight the effectiveness of SeiT in improv-
fine-grained classification results can be found in Appendix. ing performance in low-storage scenarios.
We also evaluate SeiT model and the image-trained
4.1. ImageNet-1k classification model on robustness datasets, such as adding Gaussian
ImageNet-1k classification performances are summa- noise or Gaussian blur, ImageNet-R [26], and adversarial
rized in Table 2 and Fig. 1. Random sampling (yellow in attacks [42, 17] in Appendix B.5. We observe that without
Fig. 1) had the most significant negative impact on perfor- strong pixel-level augmentations, SeiT shows lower perfor-
mance as storage capacity decreased. On the other hand, mance drops compared to the pixel-trained counterparts on
sampling by C-score [32] (blue) also resulted in a notice- corruptions and distribution shifts. SeiT shows a significant
able performance drop, but it performed better than random gradient-based attack robustness compared to others.
sampling when storage capacity reduced to 10% of the orig-
4.2. Storage-efficient token pre-training
inal. Although both sampling-based methods led to a con-
siderable performance drop even with a small decrease in We extract tokens from ImageNet-21k dataset and pre-
storage, JPEG-based compression methods (green) main- trained a ViT-B/16 model on the tokenized ImageNet-21k
17253
Pre-training Fine-tuning Storage Token-CutMix Token-EDA Emb-Noise Acc. (ViT-B)
Acc.
IN-21k IN-1k Size Ratio
8 8 8 63.8
- Pixels 140 GB 100.0% 81.8† 4 8 8 71.9
Tokens Tokens 16 GB 11.1% 81.1 4 4 8 72.2
Tokens Pixels 154 GB 110.0% 82.6 4 8 4 73.3
Tokens Tokens ! Pixels 156 GB 111.4% 82.8 4 4 4 74.0
Table 5. Impact of the proposed augmentations. ImageNet-1k
Table 3. Impact of storage-efficient pre-training (PT) and fine-
validation accuracies for the combination of the proposed augmen-
tuning (FT). We show the scenario of storage-efficient PT; we
tations for tokens are shown.
pre-train a model with a tokenized ImageNet-21k with more data
points and fine-tune the model on the pixel or the token ImageNet-
1k dataset. † is from the original paper. “Tokens ! Pixels” de- Linear Conv 2 ⇥ 2 Conv 4 ⇥ 4
notes three-staged FT, Token 21k PT, Token 1k PT and Pixels FT. Accuracy 58.6 73.1 74.0
Table 6. Stem-Adapter architectures. We compare three Stem-
# PT images ⇥1.35 ⇥1.70 ⇥2.05 ⇥2.40 ⇥3.10
Adapter architectures for ViT-B/16 on ImageNet-1k. Note that
IN-1k FT Acc 79.1 81.4 81.0 80.9 82.5
stride of Convolution layers set to 2.
Table 4. Sampling-based pixel PT. We show the IN-1k FT ac-
curacies by different PTs by subsampling ImageNet-1k-5M [74].
1.4%, respectively. Interestingly, these methods not only
Pixel-based PT-FT strategy shows comparable accuracy to SeiT
when 410% storage size is used (82.5 and 82.8, respectively).
work effectively when used individually but also achieve
higher performance when used in combination (74.0%).
We also assessed the impact of the Stem-Adapter archi-
to determine the effectiveness of tokens as a large-scale pre- tecture on performance in Table 6. We compared two dif-
training. We then fine-tuned the pre-trained model with ferent Stem-Adapter architectures with our design choice.
both tokenized ImageNet-1k and full-pixel ImageNet-1k, Note that, we used a smaller learning rate of 0.0005 for the
respectively (details are in Appendix B.1). Additionally, linear Step-Adapter because of its unstable convergence us-
we extend our storage-efficient pre-training in three stages, ing a larger learning rate and an input size of 14 ⇥ 14 to
namely, 21k token pre-training ! 1k token pre-training ! match the number of input patches with the convolutional
1k image fine-tuning, following BeiT v2 [48] (details are in Stem-Adapters. The results validate that our decision to use
Appendix B.4). The results are shown in Table 3. Conv 4⇥4 as Stem-Adapter for ViT models yields the high-
The use of large-scale tokens for pre-training improved est performance among the considered candidates.
not only the performance of ImageNet-1k benchmarks us- We also investigated the applicability of SeiT to convolu-
ing tokens but also the performance of full-pixel images. tional networks. The benchmark results on different archi-
Pre-training with ImageNet-21k tokens led to a 2.5% per- tectures of ImageNet-1k are presented in Table 7. Note that
formance gain compared to using ImageNet-1k-5M to- token-based training only requires 1.4GB storage, which is
kens, using only 8GB more storage. Furthermore, our pre- merely 1% of the storage required for pixel-based training.
training strategy improved full-pixel ImageNet-1k perfor- To match the size of features after the stem layer, we used
mance by 1.0% using only 11.4% more storage compared a deconvolutional Stem-Adapter for ResNet [25] models.
to the original full-pixel ImageNet-1k training. It is only Our findings indicate that SeiT can also be used for storage-
27% storage size compared to the sampling-based image efficient training of convolutional models.
pre-training strategy with a similar accuracy (410% of IN- Finally, we show the impact of the tokenizer in Ap-
1k, showing 82.5% accuracy) as shown in Table 4. pendix B.5. In summary, we observe that SeiT works well
for various tokenizers, e.g., ViT-VQGAN [72] and VQGAN
4.3. Ablation study [21] variants. We chose ViT-VQGAN considering the trade-
off between the performance and the storage size, and it is
We present an analysis of the proposed augmentation
solely trained on ImageNet-1k without external datasets.
methods, Stem-Adapter architectures, and results on con-
volutional networks. Table 5 reports the impact of the
4.4. Continual learning
proposed augmentations for tokens. We found that em-
ploying Token-CutMix not only stabilized the overall train- To demonstrate the effectiveness of SeiT in memory-
ing procedures but also resulted in the largest performance limited settings, we compare SeiT with full-pixel datasets
gain (8.1%) compared to excluding it. The newly pro- in a continual learning scenario. Specifically, we employed
posed methods for tokens, Embedding-Noise and Token- the Reduced ResNet-18 architecture on the ImageNet-100
EDA, also showed performance improvements of 0.3% and dataset [64] and evaluated the results following the Expe-
17254
Number of tasks = 10 Memory Size = 574 MB Memory Size = 574 MB
Pixel-based training Token-based training
Network 40 40
Acc. Storage Acc. Storage 50
accuracies of three architectures, including ViT-S, ResNet-50, and Memory Size (MB) Number of tasks Epochs per task
ResNet-18, on the ImageNet-1k benchmark. Figure 4. Comparisons on the continual learning task. We train
two Experience Replay (ER) [52] models on the ImageNet-100
[64] dataset using the pixel dataset and the token dataset. (a) By
rience Relay [52]. We observed that when using the same varying the memory size while the number of tasks is fixed by 10.
memory size, SeiT is significantly more memory-efficient (b) By varying the number of tasks while fixing the memory size.
than images, with a storage capacity of 147 times that of (c) By increasing the epochs per task. Note that except (c), we set
images. As a result, the total memory required to store the the epochs per task to 1 following the original setting [52].
entire dataset in tokens was less than 500MB. Fig. 4 illus-
trates the comparison results between using a token dataset
and a full-pixel dataset in three different settings. remaining settings except for the data augmentations. We
The left figure shows the performances of the token also followed the training recipe proposed in DeiT for the
dataset and the full-pixel dataset by increasing memory size full-pixel ImageNet-1k training but made a few adjustments
while fixing the number of tasks to ten. SeiT outperforms to handle the reduced datasets. We used a smaller learning
the pixel dataset and shows a neglectable performance drop rate of 0.0009 with a batch size of 1024 compared to the
even when the memory size decreased, as it stored sufficient original value of 0.001, as we found that the original learn-
data even with memory sizes below 100MB. ing rate did not converge well on smaller datasets. Also, we
The center figure presents the results of changing the increased the number of warm-up epochs and total train-
number of tasks with a fixed memory size of 574MB (⇡ ing iterations when the number of data points decreased
1k images). In this case, both token and full-pixel datasets to ensure a fair comparison. For large-scale token pre-
exhibited decreased performance as the number of tasks in- training and token fine-tuning, we adopted simple augmen-
creased. However, the performance degradation of the to- tation strategies as suggested in DeiT-III [66]; we excluded
ken dataset was less severe than that of the full-pixel dataset. Token-EDA and replaced RRC with a simple random crop.
Finally, with both memory size and the number of tasks Following the DeiT-III training recipe, we pre-trained the
fixed, we varied the number of times the dataset was viewed model with tokenized ImageNet-21k dataset for 270 epochs
per task (the right figure). When there was only one task, and then we fine-tuned the model for 100 epochs both of
the full-pixel dataset outperformed the token dataset as the token and full-pixel dataset using learning rates of 0.00001
epoch increased, consistent with other classification bench- with 4096 batch-size and 0.0005 with 1024 batch-size, re-
mark results. However, when there were ten tasks, the full- spectively. We provide the more detailed hyper-parameter
pixel dataset had lower performance than the token dataset, setting of our experiments in Appendix B.1.
even with increased epochs due to insufficient stored data.
5. Conclusion
4.5. Implementation details
In this paper, we propose Storage-efficient Vision
We used a pre-trained ViT-VQGAN Base-Base [72] Training (SeiT) by storing images into tokens. In practice,
model for extracting tokens from the images. Extracting we store an image into 1kB as a 32⇥32 token sequence
tokens of entire ImageNet-21k dataset took 1.1 hours us- and propose an efficient and fast encoding and decoding
ing 64 A100 GPUs with 2048 batch-size. We conducted strategy for the token data type. We also propose token
ImageNet-1k benchmark experiments using the ViT-B/16 augmentations and Stem-Adaptor to train vision transform-
model [20, 65] with an input size of 224 x 224. For token ers with minimal modifications from the highly-optimized
ImageNet-1k training, we replaced the patch embedding pixel-based training. Our experiments show that compared
layer in ViT-B/16 model with the proposed Stem-Adapter to the other storage-efficient training methods, SeiT shows
module and added a global pooling layer before the final significantly large gaps; with the same amount of storage
norm layer for tokens. We used a learning rate of 0.0015 size, SeiT shows the best performance among the compar-
with cosine scheduling and a weight decay of 0.1. The ison methods. Our method also shows benefits in other
model was trained for 300 epochs with a batch size of 1024. practical scenarios, such as storage-efficient large-scale pre-
We followed the training recipe proposed in DeiT [65] for training and continual learning at scale.
17255
References ods. ICML Workshop on Uncertainty and Robustness in
Deep Learning, 2019. 16
[1] Sami Abu-El-Haija, Nisarg Kothari, Joonseok Lee, Paul [14] Cody Coleman, Christopher Yeh, Stephen Mussmann, Baha-
Natsev, George Toderici, Balakrishnan Varadarajan, and ran Mirzasoleiman, Peter Bailis, Percy Liang, Jure Leskovec,
Sudheendra Vijayanarasimhan. Youtube-8m: A large- and Matei Zaharia. Selection via proxy: Efficient data se-
scale video classification benchmark. arXiv preprint lection for deep learning. arXiv preprint arXiv:1906.11829,
arXiv:1609.08675, 2016. 3 2019. 3
[2] Peter Anderson, Xiaodong He, Chris Buehler, Damien [15] Eastman Kodak Company. Kodak lossless true color image
Teney, Mark Johnson, Stephen Gould, and Lei Zhang. suite (photocd pcd0992). 2
Bottom-up and top-down attention for image captioning and [16] Thomas M Cover. Elements of information theory. chapter 5.
visual question answering. In Proceedings of the IEEE con- John Wiley & Sons, 1999. 4
ference on computer vision and pattern recognition, pages [17] Francesco Croce and Matthias Hein. Reliable evalua-
6077–6086, 2018. 2, 3 tion of adversarial robustness with an ensemble of diverse
[3] Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret parameter-free attacks. In International conference on ma-
Mitchell, Dhruv Batra, C Lawrence Zitnick, and Devi Parikh. chine learning, pages 2206–2216. PMLR, 2020. 2, 6, 15,
Vqa: Visual question answering. In Proceedings of the IEEE 16
international conference on computer vision, pages 2425– [18] Ekin D Cubuk, Barret Zoph, Jonathon Shlens, and Quoc V
2433, 2015. 3 Le. Randaugment: Practical automated data augmen-
[4] Nicola Asuni and Andrea Giachetti. Testimages: a large- tation with a reduced search space. In Proceedings of
scale archive for testing visual devices and basic image pro- the IEEE/CVF conference on computer vision and pattern
cessing algorithms. In STAG, pages 63–70, 2014. 2 recognition workshops, pages 702–703, 2020. 5
[5] Anish Athalye, Nicholas Carlini, and David Wagner. Obfus- [19] Navneet Dalal and Bill Triggs. Histograms of oriented gra-
cated gradients give a false sense of security: Circumventing dients for human detection. In 2005 IEEE computer soci-
defenses to adversarial examples. In International confer- ety conference on computer vision and pattern recognition
ence on machine learning, pages 274–283. PMLR, 2018. 16 (CVPR’05), volume 1, pages 886–893. Ieee, 2005. 2
[6] Hyojin Bahng, Sanghyuk Chun, Sangdoo Yun, Jaegul Choo, [20] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov,
and Seong Joon Oh. Learning de-biased representations with Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner,
biased representations. In International Conference on Ma- Mostafa Dehghani, Matthias Minderer, Georg Heigold, Syl-
chine Learning (ICML), 2020. 2 vain Gelly, et al. An image is worth 16x16 words: Trans-
[7] Johannes Ballé, Valero Laparra, and Eero P Simoncelli. formers for image recognition at scale. arXiv preprint
End-to-end optimized image compression. arXiv preprint arXiv:2010.11929, 2020. 1, 8
arXiv:1611.01704, 2016. 2, 3 [21] Patrick Esser, Robin Rombach, and Bjorn Ommer. Taming
transformers for high-resolution image synthesis. In Pro-
[8] Johannes Ballé, David Minnen, Saurabh Singh, Sung Jin
ceedings of the IEEE/CVF conference on computer vision
Hwang, and Nick Johnston. Variational image compression
and pattern recognition, pages 12873–12883, 2021. 2, 4, 7,
with a scale hyperprior. arXiv preprint arXiv:1802.01436,
15
2018. 2, 3
[22] Robert Geirhos, Patricia Rubisch, Claudio Michaelis,
[9] Yoshua Bengio, Nicholas Léonard, and Aaron Courville.
Matthias Bethge, Felix A Wichmann, and Wieland Brendel.
Estimating or propagating gradients through stochastic
Imagenet-trained cnns are biased towards texture; increas-
neurons for conditional computation. arXiv preprint
ing shape bias improves accuracy and robustness. In Inter-
arXiv:1308.3432, 2013. 16
national Conference on Learning Representations (ICLR),
[10] Shweta Bhardwaj, Mukundhan Srinivasan, and Mitesh M 2019. 2
Khapra. Efficient video classification using fewer frames. [23] Ian J Goodfellow, Jonathon Shlens, and Christian Szegedy.
In Proceedings of the IEEE/CVF Conference on Computer Explaining and harnessing adversarial examples. arXiv
Vision and Pattern Recognition, pages 354–363, 2019. 3 preprint arXiv:1412.6572, 2014. 2
[11] Tom Brown, Benjamin Mann, Nick Ryder, Melanie Sub- [24] Yash Goyal, Tejas Khot, Douglas Summers-Stay, Dhruv Ba-
biah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakan- tra, and Devi Parikh. Making the v in vqa matter: Elevating
tan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Lan- the role of image understanding in visual question answer-
guage models are few-shot learners. Advances in neural in- ing. In Proceedings of the IEEE conference on computer
formation processing systems, 33:1877–1901, 2020. 1 vision and pattern recognition, pages 6904–6913, 2017. 3
[12] Zhengxue Cheng, Heming Sun, Masaru Takeuchi, and Jiro [25] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun.
Katto. Learned image compression with discretized gaussian Deep residual learning for image recognition. In Proceed-
mixture likelihoods and attention modules. In Proceedings of ings of the IEEE conference on computer vision and pattern
the IEEE/CVF Conference on Computer Vision and Pattern recognition, pages 770–778, 2016. 7, 8
Recognition, pages 7939–7948, 2020. 3 [26] Dan Hendrycks, Steven Basart, Norman Mu, Saurav Kada-
[13] Sanghyuk Chun, Seong Joon Oh, Sangdoo Yun, Dongyoon vath, Frank Wang, Evan Dorundo, Rahul Desai, Tyler Zhu,
Han, Junsuk Choe, and Youngjoon Yoo. An empirical evalu- Samyak Parajuli, Mike Guo, et al. The many faces of robust-
ation on robustness and uncertainty of regularization meth- ness: A critical analysis of out-of-distribution generalization.
17256
In Proceedings of the IEEE/CVF International Conference [39] Shiye Lei and Dacheng Tao. A comprehensive survey to
on Computer Vision, pages 8340–8349, 2021. 6, 15 dataset distillation. arXiv preprint arXiv:2301.05603, 2023.
[27] Grant Van Horn, Oisin Mac Aodha, Yang Song, Alexander 3
Shepard, Hartwig Adam, Pietro Perona, and Serge J. Be- [40] Jiasen Lu, Jianwei Yang, Dhruv Batra, and Devi Parikh.
longie. The inaturalist species classification and detection Hierarchical question-image co-attention for visual question
dataset. arXiv preprint arXiv:1707.06642, 2017. 15 answering. Advances in neural information processing sys-
[28] Grant Van Horn, Oisin Mac Aodha, Yang Song, Alexander tems, 29, 2016. 3
Shepard, Hartwig Adam, Pietro Perona, and Serge J. Be- [41] Chris J. Maddison, Andriy Mnih, and Yee Whye Teh. The
longie. The inaturalist challenge 2018 dataset. arXiv preprint concrete distribution: A continuous relaxation of discrete
arXiv:1707.06642, 2018. 15 random variables. In International Conference on Learning
[29] David A Huffman. A method for the construction of Representations, 2017. 15
minimum-redundancy codes. Proceedings of the IRE, [42] Aleksander Madry, Aleksandar Makelov, Ludwig Schmidt,
40(9):1098–1101, 1952. 4 Dimitris Tsipras, and Adrian Vladu. Towards deep learn-
ing models resistant to adversarial attacks. arXiv preprint
[30] Eric Jang, Shixiang Gu, and Ben Poole. Categorical repa-
arXiv:1706.06083, 2017. 2, 6, 16
rameterization with gumbel-softmax. In International Con-
ference on Learning Representations, 2017. 15 [43] Dhruv Mahajan, Ross Girshick, Vignesh Ramanathan,
Kaiming He, Manohar Paluri, Yixuan Li, Ashwin Bharambe,
[31] Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana Parekh,
and Laurens Van Der Maaten. Exploring the limits of weakly
Hieu Pham, Quoc Le, Yun-Hsuan Sung, Zhen Li, and Tom
supervised pretraining. In Proceedings of the European con-
Duerig. Scaling up visual and vision-language representa-
ference on computer vision (ECCV), pages 181–196, 2018.
tion learning with noisy text supervision. In International
1
Conference on Machine Learning, pages 4904–4916. PMLR,
[44] Feng Mao, Xiang Wu, Hui Xue, and Rong Zhang. Hierar-
2021. 1
chical video frame sequence representation with deep con-
[32] Ziheng Jiang, Chiyuan Zhang, Kunal Talwar, and Michael C volutional graph network. In Proceedings of the European
Mozer. Characterizing structural regularities of labeled data conference on computer vision (ECCV) workshops, pages 0–
in overparameterized models. In Marina Meila and Tong 0, 2018. 3
Zhang, editors, Proceedings of the 38th International Con-
[45] Baharan Mirzasoleiman, Jeff Bilmes, and Jure Leskovec.
ference on Machine Learning, volume 139 of Proceedings
Coresets for data-efficient training of machine learning mod-
of Machine Learning Research, pages 5034–5044. PMLR,
els. In Hal Daumé III and Aarti Singh, editors, Proceedings
18–24 Jul 2021. 2, 3, 6
of the 37th International Conference on Machine Learning,
[33] Jun-Hyuk Kim, Byeongho Heo, and Jong-Seok Lee. Joint volume 119 of Proceedings of Machine Learning Research,
global and local hierarchical priors for learned image com- pages 6950–6960. PMLR, 13–18 Jul 2020. 2, 3
pression. In Proceedings of the IEEE/CVF Conference
[46] Maria-Elena Nilsback and Andrew Zisserman. Automated
on Computer Vision and Pattern Recognition, pages 5992–
flower classification over a large number of classes. In 2008
6001, 2022. 3
Sixth Indian Conference on Computer Vision, Graphics Im-
[34] Jin-Hwa Kim, Jaehyun Jun, and Byoung-Tak Zhang. Bi- age Processing, pages 722–729, 2008. 15
linear attention networks. Advances in neural information [47] Mansheej Paul, Surya Ganguli, and Gintare Karolina Dziu-
processing systems, 31, 2018. 3 gaite. Deep learning on a data diet: Finding important ex-
[35] Wonjae Kim, Bokyung Son, and Ildoo Kim. Vilt: Vision- amples early in training. Advances in Neural Information
and-language transformer without convolution or region su- Processing Systems, 34, 2021. 2, 3
pervision. In International Conference on Machine Learn- [48] Zhiliang Peng, Li Dong, Hangbo Bao, Qixiang Ye, and Furu
ing, pages 5583–5594. PMLR, 2021. 3 Wei. Beit v2: Masked image modeling with vector-quantized
[36] Jonathan Krause, Michael Stark, Jia Deng, and Li Fei-Fei. visual tokenizers. arXiv preprint arXiv:2208.06366, 2022. 7,
3d object representations for fine-grained categorization. In 14
2013 IEEE International Conference on Computer Vision [49] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya
Workshops, pages 554–561, 2013. 15 Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry,
[37] Alina Kuznetsova, Hassan Rom, Neil Alldrin, Jasper Ui- Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning
jlings, Ivan Krasin, Jordi Pont-Tuset, Shahab Kamali, Stefan transferable visual models from natural language supervi-
Popov, Matteo Malloci, Alexander Kolesnikov, et al. The sion. In International conference on machine learning, pages
open images dataset v4: Unified image classification, object 8748–8763. PMLR, 2021. 1, 3
detection, and visual relationship detection at scale. Interna- [50] Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario
tional Journal of Computer Vision, 128(7):1956–1981, 2020. Amodei, Ilya Sutskever, et al. Language models are unsu-
15 pervised multitask learners. OpenAI blog, 1(8):9, 2019. 1
[38] Saehyung Lee, Sanghyuk Chun, Sangwon Jung, Sangdoo [51] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun.
Yun, and Sungroh Yoon. Dataset condensation with con- Faster r-cnn: Towards real-time object detection with region
trastive signals. In International Conference on Machine proposal networks. Advances in neural information process-
Learning (ICML), 2022. 3 ing systems, 28, 2015. 2
17257
[52] David Rolnick, Arun Ahuja, Jonathan Schwarz, Timothy Lil- bustness to natural distribution shifts in image classifica-
licrap, and Gregory Wayne. Experience replay for continual tion. Advances in Neural Information Processing Systems,
learning. Advances in Neural Information Processing Sys- 33:18583–18599, 2020. 16
tems, 32, 2019. 3, 8 [63] Damien Teney, Peter Anderson, Xiaodong He, and Anton
[53] Robin Rombach, Andreas Blattmann, Dominik Lorenz, Van Den Hengel. Tips and tricks for visual question answer-
Patrick Esser, and Björn Ommer. High-resolution image ing: Learnings from the 2017 challenge. In Proceedings of
synthesis with latent diffusion models. In Proceedings of the IEEE conference on computer vision and pattern recog-
the IEEE/CVF Conference on Computer Vision and Pattern nition, pages 4223–4232, 2018. 3
Recognition, pages 10684–10695, 2022. 1 [64] Yonglong Tian, Dilip Krishnan, and Phillip Isola. Con-
[54] Andrea Rosasco, Antonio Carta, Andrea Cossu, Vincenzo trastive multiview coding. In Computer Vision–ECCV 2020:
Lomonaco, and Davide Bacciu. Distilled replay: Overcom- 16th European Conference, Glasgow, UK, August 23–28,
ing forgetting through synthetic samples. In Continual Semi- 2020, Proceedings, Part XI 16, pages 776–794. Springer,
Supervised Learning: First International Workshop, CSSL 2020. 6, 7, 8
2021, Virtual Event, August 19–20, 2021, Revised Selected [65] Hugo Touvron, Matthieu Cord, Matthijs Douze, Francisco
Papers, pages 104–117. Springer, 2022. 3 Massa, Alexandre Sablayrolles, and Hervé Jégou. Training
[55] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, San- data-efficient image transformers & distillation through at-
jeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, tention. In International conference on machine learning,
Aditya Khosla, Michael Bernstein, et al. Imagenet large pages 10347–10357. PMLR, 2021. 2, 3, 4, 8, 14, 15
scale visual recognition challenge. International journal of [66] Hugo Touvron, Matthieu Cord, and Hervé Jégou. Deit iii:
computer vision, 115:211–252, 2015. 1, 2, 4, 6 Revenge of the vit. In Computer Vision–ECCV 2022: 17th
[56] Mattia Sangermano, Antonio Carta, Andrea Cossu, and Da- European Conference, Tel Aviv, Israel, October 23–27, 2022,
vide Bacciu. Sample condensation in online continual learn- Proceedings, Part XXIV, pages 516–533. Springer, 2022. 5,
ing. In 2022 International Joint Conference on Neural Net- 8
works (IJCNN), pages 01–08. IEEE, 2022. 3 [67] Aaron Van Den Oord, Oriol Vinyals, et al. Neural discrete
[57] Christoph Schuhmann, Romain Beaumont, Richard Vencu, representation learning. Advances in neural information pro-
Cade W Gordon, Ross Wightman, Mehdi Cherti, Theo cessing systems, 30, 2017. 15
Coombes, Aarush Katta, Clayton Mullis, Mitchell Worts- [68] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszko-
man, Patrick Schramowski, Srivatsa R Kundurthy, Katherine reit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia
Crowson, Ludwig Schmidt, Robert Kaczmarczyk, and Jenia Polosukhin. Attention is all you need. Advances in neural
Jitsev. LAION-5b: An open large-scale dataset for train- information processing systems, 30, 2017. 3
ing next generation image-text models. In Thirty-sixth Con- [69] Haohan Wang, Songwei Ge, Zachary Lipton, and Eric P
ference on Neural Information Processing Systems Datasets Xing. Learning robust global representations by penalizing
and Benchmarks Track, 2022. 1, 2 local predictive power. In Advances in Neural Information
[58] Christoph Schuhmann, Richard Vencu, Romain Beaumont, Processing Systems, pages 10506–10518, 2019. 15
Robert Kaczmarczyk, Clayton Mullis, Aarush Katta, Theo [70] Tongzhou Wang, Jun-Yan Zhu, Antonio Torralba, and
Coombes, Jenia Jitsev, and Aran Komatsuzaki. Laion-400m: Alexei A Efros. Dataset distillation. arXiv preprint
Open dataset of clip-filtered 400 million image-text pairs. In arXiv:1811.10959, 2018. 3
NeurIPS Data-Centric AI Workshop, 2021. 2 [71] Jason Wei and Kai Zou. Eda: Easy data augmentation tech-
[59] Luca Scimeca, Seong Joon Oh, Sanghyuk Chun, Michael niques for boosting performance on text classification tasks.
Poli, and Sangdoo Yun. Which shortcut cues will dnns arXiv preprint arXiv:1901.11196, 2019. 2, 5
choose? a study from the parameter-space perspective. In In- [72] Jiahui Yu, Xin Li, Jing Yu Koh, Han Zhang, Ruoming Pang,
ternational Conference on Learning Representations (ICLR), James Qin, Alexander Ku, Yuanzhong Xu, Jason Baldridge,
2022. 2 and Yonghui Wu. Vector-quantized image modeling with
[60] Mannat Singh, Laura Gustafson, Aaron Adcock, Vinicius improved vqgan. arXiv preprint arXiv:2110.04627, 2021. 2,
de Freitas Reis, Bugra Gedik, Raj Prateek Kosaraju, Dhruv 4, 7, 8, 15
Mahajan, Ross Girshick, Piotr Dollár, and Laurens Van [73] Sangdoo Yun, Dongyoon Han, Seong Joon Oh, Sanghyuk
Der Maaten. Revisiting weakly supervised pre-training of Chun, Junsuk Choe, and Youngjoon Yoo. Cutmix: Regu-
visual perception models. In Proceedings of the IEEE/CVF larization strategy to train strong classifiers with localizable
Conference on Computer Vision and Pattern Recognition, features. In International Conference on Computer Vision
pages 804–814, 2022. 1 (ICCV), 2019. 2, 5, 15
[61] Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, [74] Sangdoo Yun, Seong Joon Oh, Byeongho Heo, Dongyoon
Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Han, Junsuk Choe, and Sanghyuk Chun. Re-labeling im-
Vanhoucke, and Andrew Rabinovich. Going deeper with agenet: from single to multi-labels, from global to local-
convolutions. In Proceedings of the IEEE conference on ized labels. In Proceedings of the IEEE/CVF Conference
computer vision and pattern recognition, pages 1–9, 2015. on Computer Vision and Pattern Recognition, pages 2340–
3 2350, 2021. 6, 7
[62] Rohan Taori, Achal Dave, Vaishaal Shankar, Nicholas Car- [75] Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur
lini, Benjamin Recht, and Ludwig Schmidt. Measuring ro- Dave, Justin Ma, Murphy McCauly, Michael J Franklin,
17258
Scott Shenker, and Ion Stoica. Resilient distributed datasets:
A fault-tolerant abstraction for in-memory cluster comput-
ing. In Presented as part of the 9th {USENIX} Symposium
on Networked Systems Design and Implementation ({NSDI}
12), pages 15–28, 2012. 1
[76] Wei Emma Zhang, Quan Z Sheng, Ahoud Alhazmi, and
Chenliang Li. Adversarial attacks on deep-learning models
in natural language processing: A survey. ACM Transactions
on Intelligent Systems and Technology (TIST), 11(3):1–41,
2020. 16
[77] Bo Zhao and Hakan Bilen. Dataset condensation with differ-
entiable siamese augmentation. In International Conference
on Machine Learning, 2021. 2, 3
[78] Bo Zhao, Konda Reddy Mopuri, and Hakan Bilen. Dataset
condensation with gradient matching. In International Con-
ference on Learning Representations, 2021. 2, 3
17259