0% found this document useful (0 votes)

116 views20 pages

Fastvlm: Efficient Vision Encoding For Vision Language Models

FastVLM is a novel model designed to optimize the performance of Vision Language Models (VLMs) by balancing image resolution, encoding latency, and accuracy. It incorporates FastViTHD, a hybrid vision encoder that reduces token count and encoding time, achieving significant improvements in time-to-first-token (TTFT) and performance benchmarks compared to prior models. The study highlights the advantages of hybrid architectures in efficiently processing high-resolution images while maintaining VLM effectiveness.

Uploaded by

ahmadreza.safara

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

116 views20 pages

Fastvlm: Efficient Vision Encoding For Vision Language Models

Uploaded by

ahmadreza.safara

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 20

FastVLM: Efficient Vision Encoding for Vision Language Models

Pavan Kumar Anasosalu Vasu⋆† Fartash Faghri⋆ Chun-Liang Li⋆ Cem Koc⋆
Nate True Albert Antony Gokul Santhanam James Gabriel Peter Grasch
Oncel Tuzel⋆ Hadi Pouransari⋆†
Apple
arXiv:2412.13303v2 [cs.CV] 15 May 2025

{panasosaluvasu,fartash,chunliang li,cem koc,otuzel,mpouransari}@apple.com

⋆
Core authors; † Project lead

Abstract 56 10242 3.2× 3842 10242

7682 768 2

Avg-5 VLM Evals (%)

Scaling the input image resolution is essential for 54
enhancing the performance of Vision Language Mod- 5122
5122
els (VLMs), particularly in text-rich image understanding 52
2562 FastViTHD (ours)
tasks. However, popular visual encoders such as ViTs be- ViT-L/14
come inefficient at high resolutions due to the large num- 50
3202 SigLIP-SO400M
ber of tokens and high encoding latency. At different op- 2562 ConvNeXt-XXL
erational resolutions, the vision encoder of a VLM can be 48 3362 ConvNeXt-L
optimized along two axes: reducing encoding latency and 0 200 400 600 800
minimizing the number of visual tokens passed to the LLM, Time To First Token (ms)
thereby lowering overall latency. Based on a comprehen- (a) Qwen2-0.5B
sive efficiency analysis of the interplay between image reso-
lution, vision latency, token count, and LLM size, we intro- 64 10242 2.3×
7682 7682
duce FastVLM—a model that achieves an optimized trade-
Avg-5 VLM Evals (%)

off between resolution, latency, and accuracy. FastVLM in- 62 3362 3842
corporates FastViTHD, a novel hybrid vision encoder de- 5122 512 2

60 FastViTHD (ours)
signed to output fewer tokens and significantly reduce en- 2562 ViT-L/14
coding time for high-resolution images. Unlike previous SigLIP-SO400M
58
methods, FastVLM achieves the optimal balance between 3202 ConvNeXt-XXL
visual token count and image resolution solely by scaling 56 256 2 ConvNeXt-L
the input image, eliminating the need for additional token
0 250 500 750 1000 1250 1500 1750
pruning and simplifying the model design. In the LLaVA- Time To First Token (ms)
1.5 setup, FastVLM achieves 3.2× improvement in time-to- (b) Vicuna-7B
first-token (TTFT) while maintaining similar performance
Figure 1. FastVLM is more than 3× faster than prior work.
on VLM benchmarks compared to prior works. Compared
Comparison of commonly used vision encoders for VLMs with
to LLaVa-OneVision at the highest resolution (1152×1152),
(a) Qwen2 [86] 0.5B LLM and (b) Vicuna 7B [98] LLM. All the
FastVLM achieves better performance on key benchmarks vision encoders are CLIP [69] pretrained. For a fair comparison
like SeedBench, MMMU and DocVQA, using the same 0.5B all models are trained using LLaVA-1.5 [53] setup with the vision
LLM, but with 85× faster TTFT and a vision encoder that encoders made trainable for resolution adaptation, see Sec. 4 for
is 3.4× smaller. Code and models are available at https: more details. Marker size for each model corresponds to num-
//github.com/apple/ml-fastvlm. ber of parameters of the vision encoder. The x-axis is the sum of
vision encoder latency and LLM prefilling time. All models are
benchmarked on an M1 Macbook Pro.
1. Introduction
Vision Language Models (VLMs) enable visual understanding visual tokens from a pretrained vision backbone to a
ing alongside textual inputs. VLMs are often built by pass- pretrained Large Language Model (LLM) through a projec-

1
tion layer. Previous works [53, 54] have explored various discussed hybrid FastViT for different input image resolu-
training and fine-tuning strategies for these three compo- tions and LLM sizes (Figure 1a, Figs. 1b and 4). In particu-
nents: the vision backbone, the projection layer, and the lar, FastVLM outperforms several prior works while being
LLM, which is typically a decoder-only transformer [84] smaller, faster, and trained with less data (Table 6). Com-
model. pared to LLaVA-OneVision [45] operating at the highest
Several studies [28, 61, 66] highlight image resolution possible resolution (1152×1152), FastVLM obtains compa-
as a key factor in VLM performance, especially for text rable performance with the same 0.5B LLM, but with 85×
and chart-rich data. However, increasing image resolu- faster TTFT and a 3.4× smaller vision encoder.
tion presents multiple challenges. First, pretrained vision The following is a summary of our contributions:
encoders may not support high-resolution images, as this • We show that hybrid vision backbones outperform ViTs
would make pretraining inefficient. To address this, one ap- in VLMs, and introduce additional architectural interven-
proach is to continuously pretrain the vision backbone to tions, such as multi-scale vision features, to further im-
adapt it for high resolutions [6]. Alternatively, tiling strate- prove VLM performance while maintaining efficiency.
gies, such as Sphinx [52], S2 [72], and AnyRes [53], divide • We design and pretrain a new hybrid architecture,
images into subregions, with each subregion processed in- FastViTHD, optimized for efficient VLM performance
dependently by the backbone. with high resolution input for FastVLM. In a controlled
A further challenge is the runtime computational cost as- experimental setup, where only the vision backbone is
sociated with high-resolution inference. Both single high- changed, we show that FastViTHD outperforms its ViT-
resolution inference and multiple inferences at lower reso- based and convolution-based counterparts when used in
lution (the tiling strategy) result in significant latency when VLMs: achieving 3.2× faster TTFT and 3.6× smaller
generating visual tokens. Additionally, high-resolution im- size than SigLIP-SO400M [94], and 2.3× faster TTFT
ages naturally produce more tokens, which increases the and 1.7× smaller size than ConvNeXT [28]. We further
LLM prefilling time (the LLM forward pass time on all to- demonstrate that FastVLM scales effectively as more vi-
kens in the context, including visual tokens), thereby further sual instruction tuning data becomes available.
increasing the time-to-first-token (TTFT), which is the sum • We systematically study the VLM accuracy-latency trade-
of the vision encoder latency and the LLM prefilling time. off by considering both the vision backbone latency and
In this work, we study VLM design and training from a the LLM prefilling time on actual hardware benchmarks.
runtime efficiency perspective. We explore the optimization Our results demonstrate an improved resolution-latency-
landscape as image resolution increases, aiming to improve accuracy trade-off achieved by FastVLM, measured on-
accuracy-latency trade-off, where latency includes both the device rather than estimates.
vision encoder inference time and the LLM prefilling time.
Using extensive experiments with different LLM sizes and 2. Related Works
resolutions, we establish the Pareto optimal curve for a spe- Large Multimodal Models. With the emergence of large
cific vision backbone, showing the best accuracy achievable language models [68, 77, 79, 86, 98] and large pretrained
within a given runtime budget (TTFT) based on different vision models, such as CLIP [69], trained on web-scale
choices of resolution and LLM size. image-text datasets, several multimodal architectures have
We start by exploring the use of a hybrid convolutional- been proposed to encode images aligned with a large lan-
transformer architecture FastViT [82], pretrained with Mo- guage model (LLM) to enable the interpretation of visual
bileCLIP [83], as a vision backbone for the VLM setup signals. Earlier works like Frozen [80] and Florence [1, 2]
(Section 3.1). We demonstrate the potential of this hy- used a cross-attention mechanism where the image embed-
brid backbone, which generates visual tokens over 4× faster dings are fused with text embeddings in intermediate lay-
than a ViT model while achieving higher overall VLM ac- ers of the LLM. More recently, auto-regressive architec-
curacy with multi-scale features (Section 3.1.1). tures have gained popularity where the image embedding
However, further architectural optimization is possible is fed alongside text as input to an LLM. Some promi-
when the primary goal is a high-resolution VLM (rather nent works that use this architecture are LLaVA [53–55],
than embedding generation as in MobileCLIP-pretrained mPLUG-Owl [88–90], InstructBLIP [20], BLIP-3 [85],
FastViT). We introduce a new hybrid vision encoder, SPHINX [52], MiniGPT-4 [99], VILA [50], MM1 [66],
FastViTHD, specifically designed for efficient VLM per- Qwen-VL [4], InternVL [15, 16] and Cambrian-1 [78]. Re-
formance on high-resolution images (Section 3.2), and use cently, Fuyu [5] and EVE [23] introduced a simplified archi-
it as the vision backbone to obtain FastVLM through vi- tecture that passes raw images directly to the LLM decoder.
sual instruction tuning. FastVLM demonstrates a signif- Chameleon [76] introduced early fusion mixed-modal mod-
icantly improved accuracy-latency trade-off over VLMs els where images are tokenized using a pretrained code-
based on ViTs, convolutional encoders, and our previously book. While skipping the image encoder is an intriguing

2
approach, the performance of this new class of models lags Image Input #Visual Latency
GQA TextVQA POPE DocVQA
Seed
Avg-5
Encoder Res. Tokens Enc.(ms)↓ BenchI
behind architectures that use a pretrained image encoder.
ViT-L/14 336 576 127.4 62.0 58.2 85.9 28.1 66.1 60.1
Efficient Image Encoding. CLIP [69] pretrained vi- ViT-L/14† 336 576 127.4 63.5 59.2 86.3 28.7 68.6 61.2
sion transformers [24] are widely used for encoding images FastViT 256 64 3.0 60.2 51.6 82.9 15.8 61.5 54.4
FastViT 768 576 34.5 62.7 62.3 86.5 34.4 67.1 62.6
in vision-language models, with popular choices including
SigLIP [94], EVA-CLIP [75], InternViT [15] and DFN-
Table 1. FastViT has higher accuracy than ViT-L/14 at near
CLIP [26]. To enhance performance, recent works [36, 73,
4× lower latency. To scale resolution up to 768, FastViT is made
78] employ ensembles of vision encoders trained with dif- trainable during Stage-2 training of LLaVA-1.5 setup. †To have a
ferent objectives. These works are orthogonal to our work fair comparison, we also report the performance of ViT-L/14 fine-
as they can benefit from using an efficient vision encoder tuned during Stage-2 training of LLaVA-1.5. All latencies are re-
among the ensemble of vision encoders. Since ViT-based ported in milliseconds. See Sec. 4 for details.
architectures are a popular choice for VLMs, inefficiencies
arise from the number of visual tokens outputted by the Image Multi Pool Seed
GQA TextVQA POPE DocVQA Avg-5
Encoder Scale Type BenchI
encoder, prompting methods like LLaVA-PruMerge [70]
FastViT - 62.7 62.3 86.5 34.4 67.1 62.6
and Matryoshka-based token sampling [7, 32] to dynam-
FastViT ✓ AvgPool 63.0 62.2 86.2 35.1 66.9 62.7
ically prune tokens. Other approaches [9, 18–20] reduce FastViT ✓ DWConv 63.0 62.5 86.8 34.7 67.4 62.9
tokens using perceiver-style resamplers or pooling tech-
niques. Rather than using an isotropic architecture like ViT Table 2. Pushing FastViT VLM performance using multi-scale
and then designing custom resamplers and projectors, hier- features and pooling strategies. These modifications slightly im-
archical architectures can be a simpler design choice. Hier- prove FastViT. Training setup is LLaVA-1.5 with Vicuna 7B.
archical backbones like ConvNeXT [57] and FastViT [82]
produce fewer tokens as they downsample the input tensor We identify hybrid vision encoders (convolutional layers
at every stage of compute. Recently, ConvLLaVA [28] was followed by transformer blocks) as an ideal candidate for
introduced that uses a pure-convolutional vision encoder to VLMs, as their convolutional component enables native res-
encode images for a VLM. In our work, we introduce an olution scaling, and their transformer blocks further refine
improved convolution-transformer hybrid architecture for high-quality visual tokens for consumption by the LLM.
VLMs and discuss the Pareto optimal operating points when We use a CLIP-pretrained hybrid vision encoder, specif-
this architecture is scaled to higher input resolutions. ically the MCi2 image encoder from MobileCLIP [83],
which has 35.7M parameters and is based on the FastViT
3. Architecture architecture. For simplicity, we refer to this encoder as
“FastViT” throughout the rest of the paper. As shown
In this section, we first explore the adoption of the FastViT
in Tab. 1, using FastViT at its CLIP-pretrained resolution
hybrid vision encoder for vision-language modeling. We
(256×256) alone does not yield a strong VLM. The main
then introduce architectural interventions to improve perfor-
advantage of a hybrid encoder like FastViT lies in its fa-
mance on VLM tasks. We present FastViTHD, a new hy-
vorable image resolution scaling characteristics, meaning
brid vision encoder designed for an efficient high-resolution
it generates 5.2× fewer tokens than the ViT architecture
VLM. We provide comprehensive ablations to demonstrate
with a patch size of 14. The token reduction gives sig-
the optimality of FastViTHD over FastViT and prior works
nificant advantage to VLMs, as it reduces the prefilling
for different LLMs and input resolutions. Figure 2 illus-
time and time-to-first-token (TTFT) of the transformer de-
trates the overall architecture of FastVLM and FastViTHD.
coders. When the input resolution of FastViT is scaled to
The training setup for all results in this section follows the
768×768, it produces the same number of visual tokens as
same configuration as LLaVA-1.5 [53] with Vicuna-7B [98]
ViT-L/14 with an input resolution of 336×336 but achieves
as the LLM decoder, unless mentioned otherwise. See
better performance on VLM benchmarks. This performance
Sec. 4 for more details.
gap is even more pronounced on text-rich benchmarks like
3.1. FastViT as VLM Image Encoder TextVQA and DocVQA, despite both architectures produc-
ing the same number of visual tokens. Moreover, even with
VLMs such as LLaVA have three main components: an im- the same token count at higher resolution, it encodes images
age encoder, a vision-language projector, and a large lan- much faster due to efficient convolution layers.
guage model (LLM). Both the performance and runtime ef-
ficiency of a VLM highly depend on its vision backbone. 3.1.1. Multi-Scale Features
Encoding images at high resolution is essential for achiev- Typical convolutional and hybrid architectures split up the
ing strong performance across various VLM benchmarks, computations into 4 distinct stages with a downsampling
especially for text-rich tasks. Therefore, a vision encoder operation between them. While the VLM relies on features
with scalable resolution is particularly beneficial for VLMs. from the penultimate layer, features in earlier stages of the

3
Patch Embed. Stride 2

Patch Embed. Stride 2

Stage 1

Stage 2

Stage 3

Stage 4

Stage 5
Stem
C Connector

Arte de cavalo com a crina rosa. #arte #desenho

Large
stones.
#art #cavalo #colorida #tatuagem Horse
Drawings, Realistic Drawings, Animal Drawings,
Watercolor Pictures, Watercolor Animals, FastViTHD
octo 4240 pendant in 2020 wooden light design
suspension lamp
Projection Language Answer
Watercolor Art, Tribal Back Tattoos, Dibujos
(Learned)
Model
Tattoo, Nature Artwork

e of a
a painting of a horse ’s head with red hair
a restaurant filled with white tables and brown Pool
chairs

of a a dining room with tables and chairs in the

a watercolor of a horse ’s head with pink hair
middle of the room (Learned)
Pool Instruction/ Tokenizer
Question
(Learned)
Pool
Vision Encoding

Convolutional Stem RepMixer Stage Self Attention Stage C Pool and Channel-wise Concatenation

Figure 2. Overview of the FastVLM architecture. FastVLM consists of our novel vision encoder, FastViTHD, trained using the same
setup as LLaVA. The FastViTHD architecture is designed for low latency at high resolution, by utilizing additional self-attention layers,
and downsampling to generate 4× fewer tokens than FastViT, and 16× fewer tokens than ViT-L/14 at resolution 336.

network extract information at different granularity. Ag- Image Encoder Input Latency Zero-Shot Avg Perf. Avg Perf.
Encoder Size(M)↓ Res. Enc.(ms)↓ ImageNet Retrieval on 38 tasks
gregating information from multiple scales can complement ViT-L/14 [27] 304 224 47.2 79.2 60.8 66.3
high-level features from the penultimate layer. ViTamin-L [12] 333 224 38.1 80.8 60.3 66.7
ConvNeXt-L 200 320 34.4 76.8 64.8 63.9
The architecture for multiple scale feature extraction is FastViTHD 125 224 6.8 78.3 67.7 66.3
shown in Fig. 2. We ablate between 2 designs to pool fea-
tures from different stages, i.e. AvgPooling and 2D Depth- Table 3. FastViTHD achieves competitive results on CLIP
wise convolutions. From Tab. 2, we find that using depth- benchmarks at significantly lower latency. We follow the same
wise convolutions results in better performance. setup described in [12] to report average retrieval performance and
setup described in [27] to report average performance on 38 tasks.
103 All models are benchmarked on M1 Macbook Pro.

but this has drawbacks. From Fig. 3, simply scaling-up

the number of self-attention layers in stages 3 and 4 of
Latency (ms)

102 FastViT, as done in prior works, is suboptimal and slower

than ConvNeXT-L. To mitigate this, we introduce an ex-
tra stage with a downsampling layer, ensuring self-attention
FastViT-Naive Scaling operates on tensors downsampled by a factor of 32, rather
ConvNeXt-L
10 1 than 16 as in recent models like ViTamin, see Fig. 2. More
FastViTHD (ours)
details on the naive scaling approach can be found in Sec.B.
256 512 768 1024 Our design reduces image encoding latency and generates
Image Resolution (px)
4× fewer tokens for the compute-intensive LLM decoder,
Figure 3. Novel scaling strategy of FastViTHD lowers latency thereby decreasing the time-to-first-token (TTFT). The ar-
at various image resolutions. FastViT-Naive, a naive scaling of chitecture schematic is shown in Fig. 2, and we call this
the FastViT architecture, and our proposed FastViTHD have the model FastViTHD.
same number of parameters. ConvNeXt-L is provided for ref- The model architecture consists of 5 stages, as shown in
erence. All models are benchmarked on M1 Macbook Pro and
Fig. 2, with the first three stages utilizing RepMixer [82]
trained with LLaVA-1.5 setup and Vicuna 7B. Note that the y-axis
blocks and the last two stages employing multi-headed self-
is in log scale.
attention [24] blocks. The model depth at each stage is
[2, 12, 24, 4, 2], and the embedding dimensions
3.2. FastViTHD: High Resolution Encoder for VLM for each stage are [96, 192, 384, 768, 1536].
While FastViT with the introduced model interventions per- The MLP expansion ratio for the ConvFFN layers is set
forms well as an image encoder that is 8.7× smaller than to 4.0. The model has 125.1M parameters, which is 3.5×
ViT-L/14, previous studies [15, 47] have demonstrated that larger than the largest FastViT variant from MobileCLIP,
increasing the scale of the image encoder improves its gen- but is still smaller than popular ViT alternatives.
eralization capabilities. We follow the CLIP pretraining setup of [83] using
Hybrid architectures [17, 92] typically scale the num- the DataCompDR-1B dataset to pretrain FastViTHD be-
ber of self-attention layers and width in a 4-stage design, fore employing it for FastVLM training. Table 3 shows

4
Image Input Latency #Visual Seed 103
GQA TextVQA POPE DocVQA Avg-5
Encoder Res. Enc.(ms)↓ Tokens BenchI Vision Latency
LLM Prefilling

Latency (ms)
FastViTHD 256 10.1 16 60.6 53.1 82.3 17.4 63.7 55.5
C.N-L 320 34.4 100 61.9 55.5 85.3 21.3 64.6 57.7 102
C.N-XXL 256 89.9 64 62.7 56.3 85.3 21.6 65.6 58.3
FastViTHD 512 33.5 64 63.0 59.3 86.4 25.7 67.1 60.4
FastViTHD 768 122.6 144 62.4 62.9 87.7 32.9 68.2 62.8
101
C.N-L 512 71.9 256 61.8 61.0 86.3 30.8 66.8 61.3
C.N-XXL 512 397.1 256 62.3 65.1 87.7 36.2 68.4 63.9 256 512 768 1024 1536
FastViTHD 1024 235.6 256 63.1 64.4 88.1 35.6 68.5 63.9 Input Image Resolution (px)

Table 4. FastViTHD achieves higher accuracy than ConvNeXT Figure 5. Vision latency dominates at high resolution. Break-
while having lower latency at a higher resolution. The models down of FastVLM’s time to first token for varying image resolu-
are grouped based on the total number of visual tokens produced tions. Vision encoder is FastViTHD and LLM is Qwen2-1.5B.
for the LLM to process. “C.N” stands for ConvNeXT. Training
65 15362
setup is LLaVA-1.5 with Vicuna 7B. Static Resolution 15362 (2x2)
Dynamic Resolution (AnyRes)
10242 10242
64
10242 3× 15362 (4x4)
65.0 10242 (2x2) 15362 (3x3)

Avg-5 VLM Evals (%)

7682
+2.5 15362
7682 11522 (3x3)
62.5 10242 15362 63 7682
10242
5122 10242 (4x4)
Avg-5 VLM Evals (%)

7682 5122
60.0
7682 7682 (2x2)
62
5122
57.5 5122 7682 (3x3)
20482
2562 15362
10242 1024
2 15362
55.0 7682 7682 256
2 61
2562
5122 5122
52.5 Pareto-Optimal FastViTHD
5122
2562 Pareto-Optimal FastViT 60 5122 (2x2)
50.0 FastViT,Qwen-0.5B FastViTHD,Qwen-0.5B
2562 FastViT,Qwen-1.5B FastViTHD,Qwen-1.5B 102 103
47.5 2562 FastViT,Qwen-7B FastViTHD,Qwen-7B Time To First Token (ms)
102 103
Time To First Token (ms) Figure 6. Dynamic input resolution (AnyRes) is only optimal at
the highest resolution when using fewer tiles (2×2). The vision
Figure 4. FastViTHD improves the Pareto-Optimal curve for
encoder is FastViTHD. The tile grid size is specified in parenthe-
accuracy versus time to first token compared with FastViT.
sis. Training setup is LLaVA-1.5 with Vicuna 7B. Note that the
Comparison of FastViT and FastViTHD backbones paired with
x-axis is in log scale.
Qwen2 [86] family (chat variant) LLMs of varying sizes and dif-
ferent image resolutions (annotated for each point). The Pareto-
optimal curve is highlighted for the two vision backbones. Train- first token generation) of a VLM is determined by (1) the
ing setup is LLaVA-1.5. Note that the x-axis is in log scale. latency of the vision encoder and (2) the prefilling time of
the LLM. The latter is affected by both the number of tokens
that FastViTHD, despite being 2.4× smaller and 6.9× faster produced by the vision encoder and the size of the LLM.
than ViT-L/14, achieves comparable average performance Due to the complex optimization landscape of VLMs,
across 38 multi-modal zero-shot tasks [27]. In compari- claims regarding the optimality of a vision encoder must be
son to ViTamin [12], a hybrid transformer architecture built verified across various pairs of (Resolution, LLM). Here,
for VLMs, FastViTHD delivers superior average retrieval we empirically demonstrate the optimality of FastViTHD
performance while being 2.7× smaller and 5.6× faster. In over FastViT. For each vision encoder, we consider three
Tab. 4, we compare FastViTHD with other CLIP-pretrained LLMs, Qwen2 [86]-0.5B/1.5B/7B, along with a range of
hierarchical backbones, i.e. ConvNeXT-L and ConvNeXT- input image resolutions. For each (Resolution, LLM) pair,
XXL, for VLM tasks after LLaVA-1.5 training. FastViTHD we conduct LLaVA-1.5 [53] pretraining and visual instruc-
performs as well as ConvNeXT-XXL while being 6.8× tion tuning, and evaluate the resulting model over a range
smaller and 3.3× faster. of tasks. The results are presented in Fig. 4.
First, we observe that for a vision encoder, the Pareto-
3.2.1. Vision Encoder - Language Decoder Interplay optimal curve (highlighted in Fig. 4), which represents the
The accuracy-latency trade-off in a VLM is influenced by maximum achievable performance for a given runtime bud-
several factors. On one hand, the overall performance of get (TTFT), consists of varying sizes of LLMs. Specifically,
the VLM depends on (1) the input image resolution, (2) the pairing high resolution with a small LLM is suboptimal as
quantity and quality of visual tokens, and (3) the capability a small LLM cannot effectively utilize that many tokens,
of the LLM. On the other hand, the total latency (time to and TTFT will be dominated by the latency of the vision

5
encoder (see Fig. 5). Model
Input #Visual
GQA SQA
Text-
POPE
VQA Seed
Res. Tokens VQA v2 Bench
Second, the Pareto-optimal curve for FastViTHD in
ViT-L/14 M3 [7] 336 9 58.0 - - 83.4 - 55.4
Fig. 4 is significantly better than that of FastViT. For a ViT-L/14 MQT [32] 336 16 57.6 67.5 - 80.8 71.1 -
given runtime budget, considering all possible (Resolution, FastViTHD 256 16 60.6 69.2 53.1 82.3 74.7 58.8
LLM) pairs, we achieve significantly better performance (an ViT-L/14 PruMerge [70] 336 40 - 68.5 56.0 76.3 72.0 -
improvement of over 2.5 points on the Average-5 metric) ViT-L/14 PruMerge+ [70] 336 40 - 68.3 57.1 84.0 76.8 -
ViT-L/14 M3 [7] 336 36 60.3 - - 85.5 - 58.0
with FastViTHD. Similarly, FastViTHD can reach a target FastV [14] 336 64 46.1 51.1 47.8 48.0 55.0 51.9
VLM performance up to 3× faster. It is important to note SparseVLM [97] 336 64 52.7 62.2 51.8 75.1 68.2 51.1
VisionZip [87] 336 64 55.1 69.0 55.5 77.0 62.9 52.2
that in previous sections, we demonstrated that a FastViT- VisionZip‡ [87] 336 64 57.0 68.8 56.0 80.9 74.2 53.4
based VLM already represents a significant improvement DynamicLLaVAI [33] 336 115 61.4 69.1 57.0 85.0 78.0 -
over ViT-based VLMs, and yet FastViTHD provides sub- DynamicLLaVAI|T [33] 336 115 61.3 68.6 56.5 85.9 77.9 -
FastViTHD 512 64 63.0 68.9 59.3 86.4 78.0 61.8
stantial gains over FastViT.
ViT-L/14 M3 [7] 336 144 61.3 - - 87.0 - 59.7
ViT-L/14 MQT [32] 336 144 61.4 67.6 - 83.9 76.4 -
3.2.2. Static vs. Dynamic Input Resolution FastV [14] 336 192 52.7 67.3 52.5 64.8 67.1 57.1
SparseVLM [97] 336 192 57.6 69.1 56.1 83.6 75.6 55.8
There are two ways to scale input resolution: adjusting the VisionZip [87] 336 192 59.3 68.9 57.3 85.3 76.8 56.4
model’s input resolution directly or tiling the image and set- VisionZip‡ [87] 336 192 60.1 68.2 57.8 84.9 77.4 57.1
ting the encoder’s resolution to the tile size. The tiled in- FastViTHD 768 144 62.4 67.6 62.9 87.7 78.9 62.5

ference (AnyRes) was introduced in prior works [52, 54] to ViT-L/14 MQT [32] 336 256 61.6 67.5 - 84.4 76.8 -
FastViTHD 1024 256 63.1 67.4 64.4 88.1 79.2 -
enable ViT models to process high resolution images. Since
FastViTHD is designed to run inference efficiently on high Table 5. FastViTHD more effectively reduces tokens compared
input resolutions, we analyze the optimal operating point with token pruning methods. The models are grouped based
for various resolutions using the two strategies. From Fig. 6, on total number of visual tokens. “-” indicates that performance
we see that setting the model’s input resolution directly to was not reported in the respective paper. All models presented
the desired resolution offers the best accuracy-latency trade- in this table are trained using LLaVA-1.5 setup with Vicuna 7B.
off, with dynamic resolution benefiting only at extreme res- ‡- indicates further finetuning as reported in [87]. I - indicates
olutions like 1536×1536, due to memory bandwidth limita- vision only sparsification and I|T indicates vision-language spar-
tions. If dynamic resolution is desired, using a setting with sification, as reported in [33].
fewer tiles exhibits better accuracy-latency tradeoff. Further
discussion on this setup is presented in Sec. C.1.
for FastViTHD). In the second stage, we use LLaVA-665K
3.2.3. Comparison with Token Pruning & Downsampling supervised finetuning dataset, training the models for one
We further compare the performance of FastViTHD op- epoch and tuning all the modules, i.e., vision encoder, pro-
erating at different resolutions to popular token pruning jector and the LLM. At this stage, the input image resolution
methods in literature. From Tab. 5, we find that VLMs is set to the target resolution.
achieve better accuracy to latency trade-off using a hierar- In Sec. 4, we present results with different LLM de-
chical backbone as opposed to using token pruning meth- coders, primarily with Qwen2-0.5B/1.5B/7B model fam-
ods on isotropic architectures like ViT. By simply training ily [86] (chat variant) and Vicuna-7B model [98]. We report
the VLMs at lower input resolution, FastViTHD achieves results in two training setups, the first one is the 2-Stage
visual token counts as low as 16, while improving over re- setup introduced in LLaVA-1.5. For the second training
cent token pruning methods. Interestingly, even the most setup, we follow the current trend in literature [43, 66] of
effective token pruning methods, such as those proposed training the VLMs in 3 stages, i.e. Stage 1 for training the
by [7, 32, 33, 87], perform worse than FastViTHD trained connector, Stage 1.5 for resolution scaling and Stage 2 for
at a lower input resolution of 256×256. visual instruction tuning. Information on datasets used in
these stages can be found in Sec.D. In this setup, the input
4. Experiments image resolution is set to the backbone pretraining resolu-
tion for Stage 1 and adjusted to the target resolution for the
Training Setup. For all the ablations presented in Sec. 3, following two stages. In both setups, the vision encoder
we follow the 2-stage setup described in LLaVA-1.5 [53] and LLM are frozen only in stage 1, while all modules are
with Vicuna-7B [98] as the LLM decoder, unless men- finetuned in the remaining stages. For the best setup, we
tioned otherwise. During the first stage, only the projec- further finetune the model from Stage 2 with high quality
tor is trained using LLaVA-558K alignment dataset for one instruction tuning dataset with chain-of-thought reasoning
epoch, with a batch size of 256 and a learning rate of 10−3 . from [30] and call this Stage 3. Details of this setup is elab-
At this stage, the input image resolution matches the back- orated in Sec. A and Sec. D. We publicly release R4, R12,
bone pretraining resolution (e.g., 256 for FastViT and 224 and R41 checkpoints from Stage 2 training, as well as the

6
Row Vision Data (M) Input #Visual Vis. Enc. TTFT Text LLaVA MM- VQA Doc Seed
Method LLM GQA SQA POPE MMMU
Ann. Encoder (PT+IT) Res. Tokens Size(M)↓ (ms)↓ VQA BenchW Vet v2 VQA BenchI
0.5B Model Comparison
R1 nanoLLaVA ViT-SO400M Qw.1.5 - 384 729 430 535 54.8 59.0 46.7 84.1 - - 70.8 - 30.4 -
R2 LLaVAOV [45]∗ ViT-SO400M Qw.2 4.5+3.2 1152 7290 430 14124 - 67.2 - - - 29.1 - 70.0 31.4 65.5
R3 FastVLM (Ours) FastViTHD Qw.2 15+1.1 1024 256 125 166 61.6 61.4 57.4 87.4 56.0 31.8 77.0 61.0 30.9 65.6
R4 FastVLM (Ours) FastViTHD Qw.2 15+12.5 1024 256 125 166 63.1 81.5 62.9 86.6 66.7 29.8 78.8 70.4 32.9 69.2
R5 FastVLM (Ours) FastViTHD Qw.2 15+23.1 1024 256 125 166 62.7 82.0 65.8 86.2 - 37.5 78.6 79.1 33.6 69.3
1-2B Model Comparison
R6 MobileVLMv2 [19] ViT-L/14 ML. 1.2+3.6 336 144 304 458 59.3 66.7 52.1 84.3 - - - - - -
R7 FastVLM (Ours) FastViTHD Qw.2 15+1.1 768 144 125 152 63.9 75.8 64.4 87.2 65.2 35.4 79.4 61.3 34.9 71.7
R8 FastVLM (Ours) FastViTHD Qw.2 15+12.5 768 144 125 152 64.2 87.9 66.4 88.3 68.5 41.1 80.1 69.4 38.1 72.4
R9 DeepSeekVL [58] ViT-SO400M DS. - 384 576 430 - - - - 87.6 - 34.8 - - 32.2 66.7
R10 MM1 [66]∗ ViT-H - 3000+1.5 1344 720 632 - - 62.3 68.2 87.4 67.5 39.4 - 68.4 33.2 65.6
R11 FastVLM (Ours) FastViTHD Qw.2 15+1.1 1024 256 125 233 64.2 74.8 66.0 88.0 66.9 37.6 79.9 67.7 33.1 71.4
R12 FastVLM (Ours) FastViTHD Qw.2 15+12.5 1024 256 125 233 64.3 89.9 69.0 87.8 68.0 43.6 80.7 75.6 39.1 73.1
R13 FastVLM (Ours) FastViTHD Qw.2 15+23.1 1024 256 125 233 64.3 92.9 71.2 87.8 - 51.0 80.9 84.2 38.0 72.6
7B Model Comparison
R14 InstructBLIP [20] ViT-g/14 Vic. 129+1.2 224 32 1012 302 49.2 60.5 50.1 - 60.9 26.2 - - 30.6 -
R15 FastVLM (Ours) FastViTHD Vic. 0.5+0.6 256 16 125 150 60.6 69.2 53.1 82.3 60.4 27.5 74.7 17.4 36.2 63.7
R16 FastVLM (Ours) FastViTHD Vic. 15+1.1 256 16 125 150 62.1 75.7 57.2 83.9 64.0 31.5 77.3 29.8 37.6 68.8
R17 MobileVLMv2 [19] ViT-L/14 Vic. 1.2+3.6 336 144 304 460 62.6 74.8 62.3 85.3 - - - - - -
R18 ConvLLaVA [28] ConvNeXT-L Vic. 4.9+0.6 768 144 200 496 - - 59.1 87.3 - 44.8 - 44.8 36.3 68.8
R19 FastVLM (Ours) FastViTHD Vic. 0.5+0.6 768 144 125 387 62.4 67.6 62.9 87.7 63.8 31.5 78.9 32.9 34.9 68.2
R20 FastVLM (Ours) FastViTHD Vic. 0.5+1.1 768 144 125 387 63.2 73.5 67.5 86.3 63.9 33.0 79.1 57.3 36.9 69.9
R21 FastVLM (Ours) FastViTHD Vic. 15+1.1 768 144 125 387 65.0 78.7 69.4 87.5 67.0 42.2 81.3 65.5 37.0 73.7
R22 FastVLM (Ours) FastViTHD Qw.2 15+1.1 768 144 125 446 65.6 85.9 69.5 87.2 73.0 41.3 81.3 66.9 43.6 75.3
R23 FastVLM (Ours) FastViTHD Qw.2 15+12.5 768 144 125 446 64.7 96.0 71.9 87.4 73.6 49.4 81.1 75.3 44.6 74.3
R24 Qwen-VL [4] ViT-G/14 Qw. 1400+50 448 256 1844 - 59.3 67.1 63.8 - - - 79.5 65.1 - -
R25 Qwen-VL-Chat [4] ViT-G/14 Qw. 1400+50 448 256 1844 - 57.5 68.2 61.5 - - - 78.2 62.6 - -
R26 ConvLLaVA [28] ConvNeXT-L Vic. 4.9+0.6 1024 256 200 1157 - - 62.5 87.7 - 44.4 - 48.5 35.1 69.3
R27 FastVLM (Ours) FastViTHD Vic. 0.5+0.6 1024 256 125 577 63.1 67.4 64.4 88.1 64.8 31.7 79.2 35.6 35.1 68.5
R28 FastVLM (Ours) FastViTHD Vic. 0.5+1.1 1024 256 125 577 63.3 74.1 67.4 87.1 66.5 32.4 79.3 62.8 37.3 69.9
R29 FastVLM (Ours) FastViTHD Vic. 15+1.1 1024 256 125 577 65.2 80.3 70.6 87.2 71.5 40.1 81.6 72.4 36.7 73.5
R30 FastVLM (Ours) FastViTHD Qw.2 15+1.1 1024 256 125 641 65.8 84.9 72.1 87.8 75.8 44.1 81.7 73.3 46.2 75.1
R31 LLaVA-1.5 [53] ViT-L/14 Vic. 0.5+0.6 336 576 304 1297 62.0 70.4 58.2 85.9 59.6 31.1 76.6 28.1 35.3 66.1
R32 MobileVLMv2 [19] ViT-L/14 Vic. 1.2+3.6 336 576 304 1297 64.6 74.8 66.8 86.1 - - - - - -
R33 ShareGPT4V [13] ViT-L/14 Vic. 1.2+0.7 336 576 304 1297 63.3 68.4 60.4 85.7 72.6 37.6 80.6 - - 69.7
R34 ViTamin [12] ViTamin-L Vic. 0.5+0.6 384 576 333 1308 61.6 67.6 59.8 85.5 66.1 33.6 78.9 - - -
R35 ConvLLaVA [28] ConvNeXT-L Vic. 4.9+0.6 1536 576 200 2740 - - 65.8 87.3 - 45.9 - 59.0 35.8 70.2
R36 VILA [50] ViT-L/14 L-2 50+1 336 576 304 1297 62.3 68.2 64.4 85.5 69.7 34.9 79.9 - - 62.8
R37 LLaVA-FlexAttn [46] ViT-L/14 Vic. 0.5+0.6 1008 576 304 - 62.2 - 48.9 85.9 - 29.4 78.7 - - -
R38 MM1 [66]∗ ViT-H - 3000+1.5 1344 720 632 - - 72.6 72.8 86.6 81.5 42.1 82.8 76.8 37.0 69.9
R39 LLaVA-NeXT†∗ ViT-L/14 L-3 - 672 2880 304 20347 65.2 72.8 64.6 - 80.1 - - 78.2 41.7 72.7
R40 FastVLM (Ours) FastViTHD Qw.2 15+6.5 1024 256 125 641 66.0 87.4 73.1 87.3 72.4 47.6 82.3 78.7 42.8 75.9
R41 FastVLM (Ours) FastViTHD Qw.2 15+12.5 1024 256 125 641 65.2 95.7 73.4 86.9 71.1 48.4 81.6 82.7 47.3 74.1
R42 FastVLM (Ours) FastViTHD Qw.2 15+23.1 1024 256 125 641 64.1 96.8 76.1 87.2 - 56.3 81.9 88.0 44.0 74.5
VLMs with Multiple Vision Encoders and 8B LLM
ConvNeXT-L 1536 200
R43 MiniGemini-HD† L-3 1.5+1.5 2880 21832 64.5 75.1 70.2 - - - - 74.6 37.3 73.2
ViT-L/14 672 304
ViT-SO400M 384 430
ConvNeXt-XXL 1024 846
R44 Cambrian-1 [78] L-3 2.5+7 576 5085 64.6 80.4 71.7 - - - - 77.8 42.7 74.7
DINOv2-ViT-L/14 518 304
ViT-L/14 336 304

Table 6. VLM evaluations and comparison with recent methods. The models are grouped based on total number of visual tokens.
“-” indicates that performance was not reported in the respective paper. For the dataset column, “-” indicates that the dataset size for
pretraining (“PT”) or instruction tuning (“IT”) is not explicitly mentioned in the respective paper. For methods that have more than 2
stages of training, we report the total samples used for all the pretraining stages as part of “PT”. “TTFT” means time to first token (the
sum of the vision encoder latency and the LLM prefilling time), we report latency only for models that are publicly available and in a
format favorable to MLX [31] “Vic.” refers to Vicuna [98], “Qw.2” refers to Qwen2 [86] and “Qw.” refers to Qwen [3]. “L-2” refers to
LLaMA-2. “L-3” refers to LLaMA-3. “ML.” refers to MobileLLaMA [18, 19]. “DS.” refers to DeepSeek LLM [21]. ∗ For input resolution
and visual tokens, we report the highest supported resolution by the respective models as some models like LLaVA-OneVision [45] and
MM1 [66] use dynamic input resolution. FastVLM models using dynamic resolution employs a simple 2×2 grid, with tile size set to 1024.
†- performance numbers reported from [78]. For VLMs that use multiple vision encoders, the size of each encoder is listed independently,
for TTFT, the latency from each encoder is summed up.

7
R5, R13, and R42 checkpoints from Stage 3 training, as part of the instruction tuning and pretraining datasets used to
of our open-sourced codebase. train the respective VLMs, to facilitate a fair comparison.
All FastVLM models reported in the paper are trained on Hierarchical Backbones. When we compare FastVLM
a single node with 8× NVIDIA H100-80GB GPUs. Stage 1 (R20) with ConvLLaVA [28] (R18), with the same LLM
training of VLM is quick, taking roughly 30 minutes to train and similar training data size, our model obtains +8.4%
with a Qwen2-7B decoder. Stage 1.5 and Stage 2 training better performance on TextVQA and +12.5% better per-
runs are dependent on input resolution. For an input reso- formance on DocVQA while being 22% faster. The gap
lution of 1024×1024, Stage 1.5 takes 77 hours and Stage widens at higher resolution, where FastVLM (R28 and R29)
2 takes 8 hours. The reported wall clock times correspond achieves superior performance on wide range of bench-
to the following datasets used in these stages: 15 million marks while being 2× faster than ConvLLaVA (R26), with
samples in Stage 1.5 and 1.1 million samples in Stage 2. the same LLM decoder.
Evaluation. We evaluate the models on the main- Dataset Scaling. When scaling the pretraining dataset
stream benchmarks of GQA [34], ScienceQA [59], by incorporating an intermediate pretraining stage for reso-
TextVQA [74], POPE [48], LLaVA-in-the-wild [54], lution scaling with 15M samples, FastVLM (R21) matches
VQAv2 [29], MMVet [91], MMMU [93], DocVQA [65] or surpasses MM1 [66] (R38) across a wide range of bench-
and SeedBench [42]. For GQA, ScienceQA, TextVQA, marks. Remarkably, FastVLM achieves this performance
POPE and LLaVA-in-the-wild benchmarks, we use the of- while generating 5× fewer visual tokens. With an in-
ficial evaluation from LLaVA [54]. For the remaining eval- put resolution of 1024×1024 and a larger instruction tun-
uations we use lmms-eval [96] library v0.2.2. We use the ing dataset of size 12.5M, FastVLM (R41) outperforms
default settings for all the evaluations and lmms-eval de- MM1 (R38) and LLaVA-NeXT (R39) across various bench-
faults to 0613 version of GPT for evaluations that rely on marks, including text-rich evaluations, like TextVQA and
GPT as a judge. DocVQA, which are sensitive to input resolution and num-
For ablations presented in Sec. 3, we report GQA, ber of visual tokens. The gap widens when further scale up
TextVQA, POPE, DocVQA and SeedBench. GQA and input resolution using AnyRes, more details in Sec. C.1.
SeedBench are general knowledge benchmarks, DocVQA We provide details of the dataset splits in Sec. D.
and TextVQA represent text-rich evaluations and POPE is a Multiple Vision Encoders. Recently, MiniGemini [49]
hallucination benchmark. Together these benchmarks pro- and Cambrian-1 [78] introduced models that rely on mul-
vide diversity and are quick to evaluate for ablations. Most tiple vision encoders. In Tab. 6, we compare FastVLM
importantly, they exhibit lower variance to different initial- (R40), which uses a single vision encoder with methods that
izations and under probabilistic decoding setting. We report use multiple encoders and trained on similarly scaled vi-
the variance for all the evals for different initialization in sual instruction tuning dataset. In Cambrian-1 [78] (R44),
Sec. D.3. The standard deviation across the 5 selected met- vision encoding contributes 3.2× more than LLM prefill-
rics is less than 0.5. We call the average of these 5 bench- ing to the total time-to-first-token of approximately 5 sec-
marks Avg-5, and use it as a reliable signal for our analysis. onds (detailed breakdown is provided in Tab. 10). FastVLM
The empirical standard deviation estimate for Avg-5 is 0.1. (R40) outperforms Cambrian-1 (R44) when trained on
Benchmarking. We benchmark all the models on a a similar visual instruction tuning dataset, while being
MacBook Pro with the M1 Max chip and 32GB RAM. The 7.9× faster. By scaling the instruction tuning dataset
image encoder is converted to a Core ML package file us- to 12.5M, FastVLM (R41) achieves superior performance
ing coremltools v7.2 and benchmarked on the neural engine over Cambrian-1 (R44) with 2.3× fewer visual tokens, even
using XCode 15.4 (15F31d). The LLM is benchmarked on on text-rich evaluations (see Tab. 11) that are sensitive to the
the MacBook Pro GPU using MLX [31]. The model is first number of visual tokens.
converted using mlx lm.convert tool, which converts Effect of Decoder. VLM performance also depends
the models on huggingface to the MLX format and casts the on the quality of LLM, as demonstrated in prior studies,
tensors to FP16. The prefilling latency is estimated using like [44]. By switching from Vicuna-7B (R21, R29) to
mlx lm.cache prompt tool [31]. Time-To-First-Token Qwen2 [77, 86] models (R22, R30), we see improvement
(TTFT) is estimated by adding the image encoding latency in performance across all the benchmarks. The improve-
at a specific resolution to the LLM prefilling latency for the ments are significant on MMVet, LLaVA-in-the-wild and
associated visual tokens. MMMU benchmarks. With Qwen2-0.5B as the LLM de-
coder, FastVLM (R4) outperforms LLaVA-OneVision [45]
4.1. Comparison with state-of-the-art
(R2) while being 85× faster. This result underscores the
In Tab. 6, we compare FastVLM with recently published quality of our vision encoder, as both models use the same
methods. The training setup can vary widely between LLM decoder, while FastViTHD is 3.4× smaller compared
works. For each, we report the LLM decoder and the sizes to SigLIP-SO400M [94].

8
Row Vision #Visual Vis. Enc. Seed
Method LLM ChartQA OCRBench TextVQA DocVQA InfoVQA MMMU AI2D SQA
Ann. Encoder Tokens Size(M)↓ BenchI
0.5B Model Comparison
R1 SmolVLM2-0.5B [62]† ViT-B SmolLM2 1088∗ 93 62.8 61.0 60.2 70.5 25.5 33.7 59.2 80.0 62.2
R2 FastVLM-0.6B (Ours) FastViTHD Qw.2 256 125 71.4 55.8 65.8 79.1 43.3 33.6 66.0 82.0 69.3
1-3B Model Comparison
R3 SmolVLM2-2.2B [62]† ViT-SO400M SmolLM2 2106∗ 430 68.7 72.9 73.0 80.0 37.8 42.0 70.0 89.6 71.3
R4 MolmoE A1B-7B [22]† ViT-SO400M OLMoE. 1728 430 48.0 54.7 61.5 77.7 - 33.9 71.0 87.5 67.9
R5 MM1.5-1B [95] ViT-H - 1440 632 67.2 60.5 72.5 81.0 50.5 35.8 59.3 82.1 70.2
R6 FlorenceVL-3B [11] DaViT Phi-3 576 770 70.7 63.0 69.1 82.1 51.3 41.8 73.8 84.6 70.6
R7 FastVLM-1.7B (Ours) FastViTHD Qw.2 256 125 69.5 61.2 71.2 84.2 49.6 38.0 76.6 92.9 72.6

Table 7. Comparison with concurrent works. Models are ordered in the descending order of the number of visual tokens. The number
of visual tokens listed for each model is based on the highest supported input resolution for the respective models. ∗ - For the number of
visual tokens generated by SmolVLM2, we estimated it based on the model configs as there is no explicit mention of this information in
the paper. The reported visual tokens correspond to the highest resolution supported by the respective model. † - models are evaluated
using VLMEvalKit [25]. The rest of the models are evaluated using lmms-eval [96] library. The FastVLM model used in this table is R5
and R13 from Tab. 6

Concurrent works. In Tab. 7, we compare FastVLM prototyping the demo app. We are also grateful to Angelos
with concurrent works like SmolVLM2 [62] and Flo- Katharopoulos and Awni Hannun for their insightful
renceVL [11]. FastVLM (R7) performs competitively feedback and suggestions on benchmarking. Addition-
against SmolVLM2 (R3), particularly on text-rich evalu- ally, we thank Jen-Hao Rick Chang and Vishwanath
ations such as ChartQA and TextVQA, while surpassing Sindagi for their helpful feedback and recommendations.
it on the DocVQA and InfoVQA benchmarks. FastVLM
(R7) accomplishes this using 8.2× fewer visual tokens than
SmolVLM2 (R3). Additionally, FastVLM (R7) outper- References
forms FlorenceVL (R6) on benchmarks such as TextVQA [1] Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine
and DocVQA, while using 2.3× fewer visual tokens and Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Men-
a 6.2× smaller vision encoder. On knowledge based eval- sch, Katherine Millican, Malcolm Reynolds, Roman Ring,
uations like AI2D and ScienceQA, FastVLM (R7) outper- Eliza Rutherford, Serkan Cabi, Tengda Han, Zhitao Gong,
forms all other competing models. At the smallest scale, Sina Samangooei, Marianne Monteiro, Jacob L. Menick,
FastVLM (R2) outperforms a similar sized SmolVLM2 Sebastian Borgeaud, Andy Brock, Aida Nematzadeh, Sa-
model (R1) on a wide range of benchmarks while using hand Sharifzadeh, Mikolaj Binkowski, Ricardo Barreira,
Oriol Vinyals, Andrew Zisserman, and Karén Simonyan.
4.3× fewer visual tokens.
Flamingo: a visual language model for few-shot learning.
Advances in neural information processing systems, 2022. 2
5. Conclusion [2] Anas Awadalla, Irena Gao, Josh Gardner, Jack Hessel, Yusuf
In this work, we introduced FastVLM, which leverages the Hanafy, Wanrong Zhu, Kalyani Marathe, Yonatan Bitton,
Samir Gadre, Shiori Sagawa, Jenia Jitsev, Simon Kornblith,
FastViTHD vision backbone for efficient encoding of high-
Pang Wei Koh, Gabriel Ilharco, Mitchell Wortsman, and
resolution inputs. FastViTHD has a hybrid architecture, is
Ludwig Schmidt. Openflamingo: An open-source frame-
pretrained on reinforced image-text data, and outputs a sub- work for training large autoregressive vision-language mod-
stantially reduced number of visual tokens with minimal els. arXiv preprint arXiv:2308.01390, 2023. 2
accuracy sacrifice. FastVLM has competitive performance [3] Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xi-
with prior works across a wide range of VLM benchmarks aodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang,
while improving efficiency in both time-to-first-token and Binyuan Hui, Luo Ji, Mei Li, Junyang Lin, Runji Lin, Day-
the number of parameters in the vision backbone. Rigor- iheng Liu, Gao Liu, Chengqiang Lu, Keming Lu, Jianxin
ous benchmarking on an M1 MacBook Pro demonstrates Ma, Rui Men, Xingzhang Ren, Xuancheng Ren, Chuanqi
that FastVLM achieves a state-of-the-art resolution-latency- Tan, Sinan Tan, Jianhong Tu, Peng Wang, Shijie Wang, Wei
accuracy trade-off compared to existing works. Wang, Shengguang Wu, Benfeng Xu, Jin Xu, An Yang,
Hao Yang, Jian Yang, Shusheng Yang, Yang Yao, Bowen
Yu, Hongyi Yuan, Zheng Yuan, Jianwei Zhang, Xingxuan
Acknowledgement Zhang, Yichang Zhang, Zhenru Zhang, Chang Zhou, Jingren
We thank David Koski, Christopher Webb, Matt Biddulph, Zhou, Xiaohuan Zhou, and Tianhang Zhu. Qwen technical
Nick Henderson, Megan Maher Welsh, Jamie Cheng, report. arXiv preprint arXiv:2309.16609, 2023. 7, 2
and Jerremy Holland for their valuable contributions to [4] Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan

9
Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jingren Dai. Internvl: Scaling up vision foundation models and
Zhou. Qwen-vl: A versatile vision-language model for un- aligning for generic visual-linguistic tasks. arXiv preprint
derstanding, localization, text reading, and beyond. arXiv arXiv:2312.14238, 2023. 2, 3, 4
preprint arXiv:2308.12966, 202k. 2, 7 [16] Zhe Chen, Weiyun Wang, Hao Tian, Shenglong Ye, Zhang-
[5] Rohan Bavishi, Erich Elsen, Curtis Hawthorne, Maxwell wei Gao, Erfei Cui, Wenwen Tong, Kongzhi Hu, Jiapeng
Nye, Augustus Odena, Arushi Somani, and Sağnak Taşırlar. Luo, Zheng Ma, et al. How far are we to gpt-4v? closing
Introducing our multimodal models, 2023. 2 the gap to commercial multimodal models with open-source
[6] Lucas Beyer, Andreas Steiner, André Susano Pinto, Alexan- suites. arXiv preprint arXiv:2404.16821, 2024. 2
der Kolesnikov, Xiao Wang, Daniel Salz, Maxim Neumann, [17] Chenglin Yang et al. MOAT: Alternating mobile convolution
Ibrahim Alabdulmohsin, Michael Tschannen, Emanuele and attention brings strong vision models. In ICLR, 2023. 4
Bugliarello, Thomas Unterthiner, Daniel Keysers, Skanda [18] Xiangxiang Chu, Limeng Qiao, Xinyang Lin, Shuang Xu,
Koppula, Fangyu Liu, Adam Grycner, Alexey Gritsenko, Yang Yang, Yiming Hu, Fei Wei, Xinyu Zhang, Bo Zhang,
Neil Houlsby, Manoj Kumar, Keran Rong, Julian Eisensch- Xiaolin Wei, et al. Mobilevlm: A fast, reproducible and
los, Rishabh Kabra, Matthias Bauer, Matko Bošnjak, Xi strong vision language assistant for mobile devices. arXiv
Chen, Matthias Minderer, Paul Voigtlaender, Ioana Bica, preprint arXiv:2312.16886, 2023. 3, 7, 2
Ivana Balazevic, Joan Puigcerver, Pinelopi Papalampidi, [19] Xiangxiang Chu, Limeng Qiao, Xinyu Zhang, Shuang Xu,
Olivier Henaff, Xi Xiong, Radu Soricut, Jeremiah Harmsen, Fei Wei, Yang Yang, Xiaofei Sun, Yiming Hu, Xinyang
and Xiaohua Zhai. Paligemma: A versatile 3b vlm for trans- Lin, Bo Zhang, et al. Mobilevlm v2: Faster and
fer, 2024. 2 stronger baseline for vision language model. arXiv preprint
[7] Mu Cai, Jianwei Yang, Jianfeng Gao, and Yong Jae arXiv:2402.03766, 2024. 7, 2
Lee. Matryoshka multimodal models. arXiv preprint [20] Wenliang Dai, Junnan Li, Dongxu Li, Anthony Meng Huat
arXiv:2405.17430, 2024. 3, 6 Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale
[8] Jie ”Cao and Jing” Xiao. ”an augmented benchmark dataset Fung, and Steven Hoi. Instructblip: Towards general-
for geometric question answering through dual parallel text purpose vision-language models with instruction tuning,
encoding”. In ”Proceedings of the 29th International Con- 2023. 2, 3, 7
ference on Computational Linguistics”, ”2022”. 4 [21] DeepSeek-AI. Deepseek llm: Scaling open-source language
[9] Junbum Cha, Wooyoung Kang, Jonghwan Mun, and models with longtermism. arXiv preprint arXiv:2401.02954,
Byungseok Roh. Honeybee: Locality-enhanced projector for 2024. 7, 2
multimodal llm. In Proceedings of the IEEE/CVF Confer- [22] Matt Deitke, Christopher Clark, Sangho Lee, Rohun Tri-
ence on Computer Vision and Pattern Recognition (CVPR), pathi, Yue Yang, Jae Sung Park, Mohammadreza Salehi,
2024. 3 Niklas Muennighoff, Kyle Lo, Luca Soldaini, Jiasen Lu,
[10] Soravit Changpinyo, Piyush Sharma, Nan Ding, and Radu Taira Anderson, Erin Bransom, Kiana Ehsani, Huong Ngo,
Soricut. Conceptual 12m: Pushing web-scale image-text pre- YenSung Chen, Ajay Patel, Mark Yatskar, Chris Callison-
training to recognize long-tail visual concepts. In Proceed- Burch, Andrew Head, Rose Hendrix, Favyen Bastani, Eli
ings of the IEEE/CVF Conference on Computer Vision and VanderBilt, Nathan Lambert, Yvonne Chou, Arnavi Chheda,
Pattern Recognition, pages 3558–3568, 2021. 3 Jenna Sparks, Sam Skjonsberg, Michael Schmitz, Aaron
[11] Jiuhai Chen, Jianwei Yang, Haiping Wu, Dianqi Li, Jian- Sarnat, Byron Bischoff, Pete Walsh, Chris Newell, Piper
feng Gao, Tianyi Zhou, and Bin Xiao. Florence-vl: Enhanc- Wolters, Tanmay Gupta, Kuo-Hao Zeng, Jon Borchardt,
ing vision-language models with generative vision encoder Dirk Groeneveld, Jen Dumas, Crystal Nam, Sophie Le-
and depth-breadth fusion. arXiv preprint arXiv:2412.04424, brecht, Caitlin Wittlif, Carissa Schoenick, Oscar Michel,
2024. 9 Ranjay Krishna, Luca Weihs, Noah A. Smith, Hannaneh Ha-
[12] Jieneng Chen, Qihang Yu, Xiaohui Shen, Alan Yuille, and jishirzi, Ross Girshick, Ali Farhadi, and Aniruddha Kem-
Liang-Chieh Chen. Vitamin: Designing scalable vision bhavi. Molmo and pixmo: Open weights and open data
models in the vision-language era. In Proceedings of the for state-of-the-art multimodal models. arXiv preprint
IEEE/CVF Conference on Computer Vision and Pattern arXiv:2409.17146, 2024. 9
Recognition, 2024. 4, 5, 7, 1, 2 [23] Haiwen Diao, Yufeng Cui, Xiaotong Li, Yueze Wang,
[13] Lin Chen, Jisong Li, Xiaoyi Dong, Pan Zhang, Conghui Huchuan Lu, and Xinlong Wang. Unveiling encoder-free
He, Jiaqi Wang, Feng Zhao, and Dahua Lin. Sharegpt4v: vision-language models. arXiv preprint arXiv:2406.11832,
Improving large multi-modal models with better captions. 2024. 2
arXiv preprint arXiv:2311.12793, 2023. 7, 2 [24] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov,
[14] Liang Chen, Haozhe Zhao, Tianyu Liu, Shuai Bai, Junyang Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner,
Lin, Chang Zhou, and Baobao Chang. An image is worth 1/2 Mostafa Dehghani, Matthias Minderer, Georg Heigold, Syl-
tokens after layer 2: Plug-and-play inference acceleration for vain Gelly, et al. An image is worth 16x16 words: Trans-
large vision-language models, 2024. 6 formers for image recognition at scale. arXiv preprint
[15] Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, arXiv:2010.11929, 2020. 3, 4
Sen Xing, Muyan Zhong, Qinglong Zhang, Xizhou Zhu, [25] Haodong Duan, Junming Yang, Yuxuan Qiao, Xinyu Fang,
Lewei Lu, Bin Li, Ping Luo, Tong Lu, Yu Qiao, and Jifeng Lin Chen, Yuan Liu, Xiaoyi Dong, Yuhang Zang, Pan Zhang,

10
Jiaqi Wang, et al. Vlmevalkit: An open-source toolkit for [39] Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi
evaluating large multi-modality models. In Proceedings Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer
of the 32nd ACM International Conference on Multimedia, Whitehead, Alexander C. Berg, Wan-Yen Lo, Piotr Dollár,
2024. 9 and Ross Girshick. Segment anything. arXiv preprint
[26] Alex Fang, Albin Madappally Jose, Amit Jain, Ludwig arXiv:2304.02643, 2023. 4
Schmidt, Alexander Toshev, and Vaishaal Shankar. Data fil- [40] Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson,
tering networks. arXiv preprint arXiv:2309.17425, 2023. 3 Kenji Hata, Joshua Kravitz, Stephanie Chen, Yannis Kalan-
[27] Samir Yitzhak Gadre, Gabriel Ilharco, Alex Fang, Jonathan tidis, Li-Jia Li, David A. Shamma, Michael S. Bernstein, and
Hayase, Georgios Smyrnis, Thao Nguyen, Ryan Marten, Li Fei-Fei. Visual genome: Connecting language and vision
Mitchell Wortsman, Dhruba Ghosh, Jieyu Zhang, et al. Dat- using crowdsourced dense image annotations. International
acomp: In search of the next generation of multimodal Journal of Computer Vision, 2017. 4
datasets. arXiv preprint arXiv:2304.14108, 2023. 4, 5 [41] Hugo Laurençon, Andrés Marafioti, Victor Sanh, and Léo
[28] Chunjiang Ge, Sijie Cheng, Ziming Wang, Jiale Yuan, Yuan Tronchon. Building and better understanding vision-
Gao, Jun Song, Shiji Song, Gao Huang, and Bo Zheng. Con- language models: insights and future directions., 2024. 4
vllava: Hierarchical backbones as visual encoder for large [42] Bohao Li, Rui Wang, Guangzhi Wang, Yuying Ge, Yix-
multimodal models, 2024. 2, 3, 7, 8 iao Ge, and Ying Shan. Seed-bench: Benchmarking mul-
[29] Yash Goyal, Tejas Khot, Douglas Summers-Stay, Dhruv Ba- timodal llms with generative comprehension. arXiv preprint
tra, and Devi Parikh. Making the v in vqa matter: Elevating arXiv:2307.16125, 2023. 8
the role of image understanding in visual question answer- [43] Bo Li, Hao Zhang, Kaichen Zhang, Dong Guo, Yuanhan
ing. In Proceedings of the IEEE conference on computer Zhang, Renrui Zhang, Feng Li, Ziwei Liu, and Chunyuan
vision and pattern recognition, pages 6904–6913, 2017. 8 Li. Llava-next: What else influences visual instruction tun-
ing beyond data?, 2024. 6, 1, 3
[30] Jarvis Guo, Tuney Zheng, Yuelin Bai, Bo Li, Yubo Wang,
King Zhu, Yizhi Li, Graham Neubig, Wenhu Chen, and Xi- [44] Bo Li, Kaichen Zhang, Hao Zhang, Dong Guo, Renrui
ang Yue. Mammoth-vl: Eliciting multimodal reasoning with Zhang, Feng Li, Yuanhan Zhang, Ziwei Liu, and Chunyuan
instruction tuning at scale. arXiv preprint arXiv:2412.05237, Li. Llava-next: Stronger llms supercharge multimodal capa-
2024. 6, 1, 4 bilities in the wild, 2024. 8
[45] Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng
[31] Awni Hannun, Jagrit Digani, Angelos Katharopoulos, and
Li, Hao Zhang, Kaichen Zhang, Yanwei Li, Ziwei Liu, and
Ronan Collobert. MLX: Efficient and flexible machine learn-
Chunyuan Li. Llava-onevision: Easy visual task transfer.
ing on apple silicon, 2023. 7, 8, 3
arXiv preprint arXiv:2408.03326, 2024. 2, 7, 8, 3, 4
[32] Wenbo Hu, Zi-Yi Dou, Liunian Harold Li, Amita Kamath,
[46] Junyan Li, Delin Chen, Tianle Cai, Peihao Chen, Yining
Nanyun Peng, and Kai-Wei Chang. Matryoshka query trans-
Hong, Zhenfang Chen, Yikang Shen, and Chuang Gan. Flex-
former for large vision-language models, 2024. 3, 6
attention for efficient high-resolution vision-language mod-
[33] Wenxuan Huang, Zijie Zhai, Yunhang Shen, Shaoshen Cao, els. In European Conference on Computer Vision, pages
Fei Zhao, Xiangfeng Xu, Zheyu Ye, and Shaohui Lin. 286–302. Springer, 2025. 7
Dynamic-llava: Efficient multimodal large language models [47] Kevin Y. Li, Sachin Goyal, Joao D. Semedo, and J. Zico
via dynamic vision-language context sparsification, 2024. 6 Kolter. Inference optimal vlms need only one visual token
[34] Drew A Hudson and Christopher D Manning. Gqa: A new but larger models, 2024. 4
dataset for real-world visual reasoning and compositional [48] Yifan Li, Yifan Du, Kun Zhou, Jinpeng Wang, Wayne Xin
question answering. In Proceedings of the IEEE/CVF con- Zhao, and Ji-Rong Wen. Evaluating object hallucina-
ference on computer vision and pattern recognition, pages tion in large vision-language models. arXiv preprint
6700–6709, 2019. 8, 5 arXiv:2305.10355, 2023. 8
[35] Kushal Kafle, Scott Cohen, Brian Price, and Christopher [49] Yanwei Li, Yuechen Zhang, Chengyao Wang, Zhisheng
Kanan. Dvqa: Understanding data visualizations via ques- Zhong, Yixin Chen, Ruihang Chu, Shaoteng Liu, and Jiaya
tion answering. In CVPR, 2018. 4 Jia. Mini-gemini: Mining the potential of multi-modality
[36] Siddharth Karamcheti, Suraj Nair, Ashwin Balakrishna, vision language models. arXiv preprint arXiv:2403.18814,
Percy Liang, Thomas Kollar, and Dorsa Sadigh. Prismatic 2023. 8
vlms: Investigating the design space of visually-conditioned [50] Ji Lin, Hongxu Yin, Wei Ping, Yao Lu, Pavlo Molchanov,
language models. In International Conference on Machine Andrew Tao, Huizi Mao, Jan Kautz, Mohammad Shoeybi,
Learning (ICML), 2024. 3 and Song Han. Vila: On pre-training for visual language
[37] Aniruddha Kembhavi, Mike Salvato, Eric Kolve, Minjoon models, 2023. 2, 7
Seo, Hannaneh Hajishirzi, and Ali Farhadi. A diagram is [51] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays,
worth a dozen images, 2016. 4 Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence
[38] Geewook Kim, Teakgyu Hong, Moonbin Yim, JeongYeon Zitnick. Microsoft coco: Common objects in context. In
Nam, Jinyoung Park, Jinyeong Yim, Wonseok Hwang, Sang- European Conference on Computer Vision, pages 740–755,
doo Yun, Dongyoon Han, and Seunghyun Park. Ocr-free 2014. 4
document understanding transformer. In European Confer- [52] Ziyi Lin, Chris Liu, Renrui Zhang, Peng Gao, Longtian Qiu,
ence on Computer Vision (ECCV), 2022. 4 Han Xiao, Han Qiu, Chen Lin, Wenqi Shao, Keqin Chen,

11
et al. Sphinx: The joint mixing of weights, tasks, and visual ings of the IEEE/CVF winter conference on applications of
embeddings for multi-modal large language models. arXiv computer vision, pages 2200–2209, 2021. 8, 4, 6
preprint arXiv:2311.07575, 2023. 2, 6 [66] Brandon McKinzie, Zhe Gan, Jean-Philippe Fauconnier,
[53] Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Sam Dodge, Bowen Zhang, Philipp Dufter, Dhruti Shah, Xi-
Improved baselines with visual instruction tuning, 2023. 1, anzhi Du, Futang Peng, Floris Weers, Anton Belyi, Haotian
2, 3, 5, 6, 7, 4 Zhang, Karanjeet Singh, Doug Kang, Ankur Jain, Hongyu
[54] Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Hè, Max Schwarzer, Tom Gunter, Xiang Kong, Aonan
Visual instruction tuning. In Advances in Neural Information Zhang, Jianyu Wang, Chong Wang, Nan Du, Tao Lei, Sam
Processing Systems (NeurIPS), 2023. 2, 6, 8 Wiseman, Guoli Yin, Mark Lee, Zirui Wang, Ruoming Pang,
[55] Haotian Liu, Chunyuan Li, Yuheng Li, Bo Li, Yuanhan Peter Grasch, Alexander Toshev, and Yinfei Yang. Mm1:
Zhang, Sheng Shen, and Yong Jae Lee. Llava-next: Im- Methods, analysis & insights from multimodal llm pre-
proved reasoning, ocr, and world knowledge, 2024. 2 training, 2024. 2, 6, 7, 8, 3
[56] Yuliang Liu, Zhang Li, Mingxin Huang, Biao Yang, Wenwen [67] Anand Mishra, Shashank Shekhar, Ajeet Kumar Singh, and
Yu, Chunyuan Li, Xucheng Yin, Cheng lin Liu, Lianwen Jin, Anirban Chakraborty. Ocr-vqa: Visual question answering
and Xiang Bai. Ocrbench: On the hidden mystery of ocr in by reading text in images. In ICDAR, 2019. 4
large multimodal models, 2024. 4 [68] OpenAI. Gpt-4 technical report. arXiv preprint
[57] Zhuang Liu, Hanzi Mao, Chao-Yuan Wu, Christoph Feicht- arXiv:2303.08774, 2023. 2
enhofer, Trevor Darrell, and Saining Xie. A convnet for the [69] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya
2020s. Proceedings of the IEEE/CVF Conference on Com- Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry,
puter Vision and Pattern Recognition (CVPR), 2022. 3 Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning
[58] Haoyu Lu, Wen Liu, Bo Zhang, Bingxuan Wang, Kai Dong, transferable visual models from natural language supervi-
Bo Liu, Jingxiang Sun, Tongzheng Ren, Zhuoshu Li, Hao sion. In International conference on machine learning, pages
Yang, Yaofeng Sun, Chengqi Deng, Hanwei Xu, Zhenda Xie, 8748–8763. PMLR, 2021. 1, 2, 3
and Chong Ruan. Deepseek-vl: Towards real-world vision- [70] Yuzhang Shang, Mu Cai, Bingxin Xu, Yong Jae Lee, and Yan
language understanding, 2024. 7, 2 Yan. Llava-prumerge: Adaptive token reduction for efficient
[59] Pan Lu, Swaroop Mishra, Tanglin Xia, Liang Qiu, Kai-Wei large multimodal models. arXiv preprint arXiv:2403.15388,
Chang, Song-Chun Zhu, Oyvind Tafjord, Peter Clark, and 2024. 3, 6
Ashwin Kalyan. Learn to explain: Multimodal reasoning via [71] Piyush Sharma, Nan Ding, Sebastian Goodman, and Radu
thought chains for science question answering. Advances in Soricut. Conceptual captions: A cleaned, hypernymed, im-
Neural Information Processing Systems, 2022. 8, 4 age alt-text dataset for automatic image captioning. In Pro-
[60] Pan Lu, Hritik Bansal, Tony Xia, Jiacheng Liu, Chunyuan Li, ceedings of the 56th Annual Meeting of the Association for
Hannaneh Hajishirzi, Hao Cheng, Kai-Wei Chang, Michel Computational Linguistics (Volume 1: Long Papers), pages
Galley, and Jianfeng Gao. Mathvista: Evaluating mathemat- 2556–2565, 2018. 3
ical reasoning of foundation models in visual contexts. In In- [72] Baifeng Shi, Ziyang Wu, Maolin Mao, Xin Wang, and Trevor
ternational Conference on Learning Representations (ICLR), Darrell. When do we not need larger vision models? In
2024. 2 European Conference on Computer Vision (ECCV), 2024. 2
[61] Gen Luo, Yiyi Zhou, Yuxin Zhang, Xiawu Zheng, Xi- [73] Min Shi, Fuxiao Liu, Shihao Wang, Shijia Liao, Subhashree
aoshuai Sun, and Rongrong Ji. Feast your eyes: Mixture-of- Radhakrishnan, De-An Huang, Hongxu Yin, Karan Sapra,
resolution adaptation for multimodal large language models. Yaser Yacoob, Humphrey Shi, Bryan Catanzaro, Andrew
arXiv preprint arXiv:2403.03003, 2024. 2 Tao, Jan Kautz, Zhiding Yu, and Guilin Liu. Eagle: Ex-
[62] Andrés Marafioti, Orr Zohar, Miquel Farré, Merve Noyan, ploring the design space for multimodal llms with mixture
Elie Bakouch, Pedro Cuenca, Cyril Zakka, Loubna Ben of encoders. arXiv:2408.15998, 2024. 3
Allal, Anton Lozhkov, Nouamane Tazi, Vaibhav Srivastav, [74] Amanpreet Singh, Vivek Natarajan, Meet Shah, Yu Jiang,
Joshua Lochner, Hugo Larcher, Mathieu Morlon, Lewis Tun- Xinlei Chen, Dhruv Batra, Devi Parikh, and Marcus
stall, Leandro von Werra, and Thomas Wolf. Smolvlm: Rohrbach. Towards vqa models that can read. In Proceedings
Redefining small and efficient multimodal models. arXiv of the IEEE/CVF conference on computer vision and pattern
preprint arXiv:2504.05299, 2025. 9 recognition, pages 8317–8326, 2019. 8, 4
[63] Ahmed Masry, Do Long, Jia Qing Tan, Shafiq Joty, and Ena- [75] Quan Sun, Yuxin Fang, Ledell Wu, Xinlong Wang, and Yue
mul Hoque. ChartQA: A benchmark for question answering Cao. Eva-clip: Improved training techniques for clip at scale.
about charts with visual and logical reasoning. In ”Find- arXiv preprint arXiv:2303.15389, 2023. 3
ings of the Association for Computational Linguistics: ACL [76] Chameleon Team. Chameleon: Mixed-modal early-fusion
2022”, ”2022”. 4, 5, 6 foundation models, 2024. 2
[64] Minesh Mathew, Viraj Bagal, Rubèn Pérez Tito, Dimosthe- [77] Qwen Team. Qwen2.5: A party of foundation models, 2024.
nis Karatzas, Ernest Valveny, and C. V Jawahar. Infograph- 2, 8
icvqa, 2021. 4 [78] Shengbang Tong, Ellis Brown, Penghao Wu, Sanghyun
[65] Minesh Mathew, Dimosthenis Karatzas, and CV Jawahar. Woo, Manoj Middepogu, Sai Charitha Akula, Jihan Yang,
Docvqa: A dataset for vqa on document images. In Proceed- Shusheng Yang, Adithya Iyer, Xichen Pan, Austin Wang,

12
Rob Fergus, Yann LeCun, and Saining Xie. Cambrian-1: [87] Senqiao Yang, Yukang Chen, Zhuotao Tian, Chengyao
A fully open, vision-centric exploration of multimodal llms, Wang, Jingyao Li, Bei Yu, and Jiaya Jia. Visionzip: Longer
2024. 2, 3, 7, 8, 4 is better but not necessary in vision language models, 2024.
[79] Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier 6
Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste [88] Jiabo Ye, Haiyang Xu, Haowei Liu, Anwen Hu, Ming Yan,
Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, Aure- Qi Qian, Ji Zhang, Fei Huang, and Jingren Zhou. mplug-
lien Rodriguez, Armand Joulin, Edouard Grave, and Guil- owl3: Towards long image-sequence understanding in multi-
laume Lample. Llama: Open and efficient foundation lan- modal large language models, 2024. 2
guage models, 2023. 2 [89] Qinghao Ye, Haiyang Xu, Guohai Xu, Jiabo Ye, Ming
[80] Maria Tsimpoukelli, Jacob Menick, Serkan Cabi, SM Es- Yan, Yiyang Zhou, Junyang Wang, Anwen Hu, Pengcheng
lami, Oriol Vinyals, and Felix Hill. Multimodal few-shot Shi, Yaya Shi, Chaoya Jiang, Chenliang Li, Yuanhong Xu,
learning with frozen language models. Conference on Neu- Hehong Chen, Junfeng Tian, Qian Qi, Ji Zhang, and Fei
ral Information Processing Systems (NeurIPS), 2021. 2 Huang. mplug-owl: Modularization empowers large lan-
[81] Pavan Kumar Anasosalu Vasu, James Gabriel, Jeff Zhu, guage models with multimodality, 2023.
Oncel Tuzel, and Anurag Ranjan. Mobileone: An im- [90] Qinghao Ye, Haiyang Xu, Jiabo Ye, Ming Yan, Anwen Hu,
proved one millisecond mobile backbone. In Proceedings of Haowei Liu, Qi Qian, Ji Zhang, Fei Huang, and Jingren
the IEEE/CVF Conference on Computer Vision and Pattern Zhou. mplug-owl2: Revolutionizing multi-modal large lan-
Recognition, 2023. 1 guage model with modality collaboration, 2023. 2
[82] Pavan Kumar Anasosalu Vasu, James Gabriel, Jeff Zhu, On- [91] Weihao Yu, Zhengyuan Yang, Linjie Li, Jianfeng Wang,
cel Tuzel, and Anurag Ranjan. Fastvit: A fast hybrid vision Kevin Lin, Zicheng Liu, Xinchao Wang, and Lijuan Wang.
transformer using structural reparameterization. In Proceed- Mm-vet: Evaluating large multimodal models for integrated
ings of the IEEE/CVF International Conference on Com- capabilities. arXiv preprint arXiv:2308.02490, 2023. 8
puter Vision (ICCV), 2023. 2, 3, 4, 1 [92] Weihao Yu, Chenyang Si, Pan Zhou, Mi Luo, Yichen Zhou,
[83] Pavan Kumar Anasosalu Vasu, Hadi Pouransari, Fartash Jiashi Feng, Shuicheng Yan, and Xinchao Wang. Metaformer
Faghri, Raviteja Vemulapalli, and Oncel Tuzel. Mobile- baselines for vision. IEEE Transactions on Pattern Analysis
clip: Fast image-text models through multi-modal reinforced and Machine Intelligence, 2024. 4
training. In Proceedings of the IEEE/CVF Conference on [93] Xiang Yue, Yuansheng Ni, Kai Zhang, Tianyu Zheng, Ruoqi
Computer Vision and Pattern Recognition (CVPR), 2024. 2, Liu, Ge Zhang, Samuel Stevens, Dongfu Jiang, Weiming
3, 4, 1 Ren, Yuxuan Sun, Cong Wei, Botao Yu, Ruibin Yuan, Ren-
[84] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszko- liang Sun, Ming Yin, Boyuan Zheng, Zhenzhu Yang, Yibo
reit, Llion Jones, Aidan N Gomez, Ł ukasz Kaiser, and Illia Liu, Wenhao Huang, Huan Sun, Yu Su, and Wenhu Chen.
Polosukhin. Attention is all you need. In Advances in Neural Mmmu: A massive multi-discipline multimodal understand-
Information Processing Systems, 2017. 2 ing and reasoning benchmark for expert agi. In Proceedings
[85] Le Xue, Manli Shu, Anas Awadalla, Jun Wang, An Yan, of the IEEE/CVF Conference on Computer Vision and Pat-
Senthil Purushwalkam, Honglu Zhou, Viraj Prabhu, Yutong tern Recognition, pages 9556–9567, 2024. 8, 5
Dai, Michael S. Ryoo, Shrikant Kendre, Jieyu Zhang, Can [94] Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, and
Qin, Shu Zhang, Chia-Chih Chen, Ning Yu, Juntao Tan, Lucas Beyer. Sigmoid loss for language image pre-training.
Tulika Manoj Awalgaonkar, Shelby Heinecke, Huan Wang, International Conference on Computer Vision (ICCV), 2023.
Yejin Choi, Ludwig Schmidt, Zeyuan Chen, Silvio Savarese, 2, 3, 8
Juan Carlos Niebles, Caiming Xiong, and Ran Xu. xgen- [95] Haotian Zhang, Mingfei Gao, Zhe Gan, Philipp Dufter,
mm (BLIP-3): A family of open large multimodal models. Nina Wenzel, Forrest Huang, Dhruti Shah, Xianzhi Du,
CoRR, abs/2408.08872, 2024. 2 Bowen Zhang, Yanghao Li, Sam Dodge, Keen You, Zhen
[86] An Yang, Baosong Yang, Binyuan Hui, Bo Zheng, Bowen Yang, Aleksei Timofeev, Mingze Xu, Hong-You Chen, Jean-
Yu, Chang Zhou, Chengpeng Li, Chengyuan Li, Dayiheng Philippe Fauconnier, Zhengfeng Lai, Haoxuan You, Zirui
Liu, Fei Huang, Guanting Dong, Haoran Wei, Huan Lin, Wang, Afshin Dehghan, Peter Grasch, and Yinfei Yang.
Jialong Tang, Jialin Wang, Jian Yang, Jianhong Tu, Jian- Mm1.5: Methods, analysis & insights from multimodal llm
wei Zhang, Jianxin Ma, Jin Xu, Jingren Zhou, Jinze Bai, fine-tuning, 2024. 9
Jinzheng He, Junyang Lin, Kai Dang, Keming Lu, Keqin [96] Kaichen Zhang, Bo Li, Peiyuan Zhang, Fanyi Pu,
Chen, Kexin Yang, Mei Li, Mingfeng Xue, Na Ni, Pei Joshua Adrian Cahyono, Kairui Hu, Shuai Liu, Yuanhan
Zhang, Peng Wang, Ru Peng, Rui Men, Ruize Gao, Runji Zhang, Jingkang Yang, Chunyuan Li, and Ziwei Liu. Lmms-
Lin, Shijie Wang, Shuai Bai, Sinan Tan, Tianhang Zhu, eval: Reality check on the evaluation of large multimodal
Tianhao Li, Tianyu Liu, Wenbin Ge, Xiaodong Deng, Xi- models, 2024. 8, 9
aohuan Zhou, Xingzhang Ren, Xinyu Zhang, Xipin Wei, [97] Yuan Zhang, Chun-Kai Fan, Junpeng Ma, Wenzhao Zheng,
Xuancheng Ren, Yang Fan, Yang Yao, Yichang Zhang, Yu Tao Huang, Kuan Cheng, Denis Gudovskiy, Tomoyuki
Wan, Yunfei Chu, Yuqiong Liu, Zeyu Cui, Zhenru Zhang, Okuno, Yohei Nakata, Kurt Keutzer, et al. Sparsevlm: Vi-
and Zhihao Fan. Qwen2 technical report. arXiv preprint sual token sparsification for efficient vision-language model
arXiv:2407.10671, 2024. 1, 2, 5, 6, 7, 8, 3 inference. arXiv preprint arXiv:2410.04417, 2024. 6

13
[98] Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan
Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan
Li, Dacheng Li, Eric. P Xing, Hao Zhang, Joseph E. Gonza-
lez, and Ion Stoica. Judging llm-as-a-judge with mt-bench
and chatbot arena, 2023. 1, 2, 3, 6, 7
[99] Deyao Zhu, Jun Chen, Xiaoqian Shen, Xiang Li, and Mo-
hamed Elhoseiny. Minigpt-4: Enhancing vision-language
understanding with advanced large language models. arXiv
preprint arXiv:2304.10592, 2023. 2

14
FastVLM: Efficient Vision Encoding for Vision Language Models
Supplementary Material
A. Training Setup Stage-1 Stage-1.5 Stage-2 Stage 3
Recap-CC3M + 1.1M / 6.5M
Data LLaVA-1.5 558K 10.6M
Recap-CC12M [43] 12.5M
For experiments presented in Tab. 1, Tab. 2, Tab. 4, Tab. 5,
Learning Rate 1e-3 2e-5 2e-5 2e-5
we perform 2-stage training with the hyperparameters listed Batch size 256 128 128 128
in Tab. 8. The model is trained for a single epoch in all LR. schedule cosine decay cosine decay cosine decay cosine decay
LR. warmup ratio 0.03 0.03 0.03 0.03
the stages. Note, in Tab. 5, we do not re-train other to- Optimizer AdamW AdamW AdamW AdamW
ken pruning works, we simply report the performance of Trainable
Projector
Full Full Full
modules Model Model Model
the respective methods as they adhere to the 2-stage training
setup described in Tab. 8, which was originally introduced
Table 9. 4-Stage training setup. We report performance after
in LLaVA-1.5 [53].
Stage-2 and Stage-3 of training in Tab. 6. Please note, both Stage-2
To showcase our model’s performance in the presence of and Stage-3 are visual instruction tuning stages.
additional dataset, we scale both pretraining and instruction
tuning datasets in Sec. 4. For results presented in R15, R19,
R20, R27, R28 in Tab. 6, we still perform 2-stage training
high image resolutions, so we opted not to include them
described in Tab. 8, for R20 and R28, we use instruction
in our model. We use the same ConvFFN layer defined
tuning dataset of size 1.1 million samples in Stage-2. For
in [82], i.e. 7×7 depthwise convolutions preceding a typ-
results presented in R3, R4, R7, R8, R11, R12, R16, R21,
ical FFN layer. The stem downsamples the input tensor by
R22, R23, R29, R30, R40 and R41 we scale-up both in-
factor of 4 on each side, and each patch embedding layer
struction tuning dataset and pretraining dataset. We also in-
downsamples the input tensor by a factor 2. Although re-
troduce and additional stage of pretraining with the scaled-
cent architectures like ViTamin [12] recommend an overall
up dataset as described in Tab. 9. Details of 1.1 million, 6.5
downsampling factor of only 16, FastViTHD incorporates
million and 12.5 million instruction tuning dataset is pre-
an additional patch embedding layer compared to FastViT,
sented in Sec.D. For results presented in R5, R13 and R42,
resulting in an overall downsampling factor of 64× for the
we introduce additional finetuning on high quality instruc-
input tensor. In each stage, we increase the number of chan-
tion tuning dataset from [30] in Stage 3. We publicly re-
nels by a factor of 2 as done in FastViT and other convolu-
lease R4, R12, and R41 checkpoints from Stage 2 training,
tional and hybrid transformer architectures. This results in
as well as the R5, R13, and R42 checkpoints from Stage 3
a Stage-5 with the widest MLP layers in the architecture,
training, as part of our open-sourced codebase.
performing self-attention on an input tensor which is down-
sampled by a factor of 64.
Stage-1 Stage-2
Data LLaVA-1.5 558K LLaVA-1.5 665k
Learning Rate 1e-3 2e-5
B.1. Naive Scaling
Batch size 256 128
LR. schedule cosine decay cosine decay
In order to scale the model size of FastViT, we simply
LR. warmup ratio 0.03 0.03 increased the embedding dimensions per stage to [128,
Optimizer AdamW AdamW 256, 512, 1024], and set the number of layers per
Trainable
Projector
Full stage to [2, 12, 16, 6]. Patch embedding layers in
modules Model
each stage use squeeze-excite layers and the MLP expan-
sion ratio is set to 3.0, following the design in [83].
Table 8. 2-Stage training setup used in ablations for Sec. 3.

C. Additional Results
B. Architecture Details
We present the performance of FastVLM on text-rich
The patch embedding layers shown in Fig. 2, consists of benchmarks under various training settings in Tab. 11.
7×7 depthwise convolutions with [81] style train-time over- FastVLM surpasses MM1 and Cambrian-1 across a wide
parameterization, followed by 1×1 pointwise convolution. range of benchmarks by scaling up pretraining and instruc-
The stride for 7×7 depthwise convolution is set to 2 in or- tion tuning datasets. This result highlights the quality of
der to downsample the input tensor. In [83], squeeze-excite visual tokens produced by FastViTHD, as FastVLM is able
layers were incorporated into this block; however, we found to achieve these improvements with 2.8× less visual tokens
them to negatively impact inference latency, especially for than MM1 and with a vision encoder that is 5.1× smaller.

1
Row Vision Input #Visual Vis. Enc. Vision Enc. LLM
Method LLM
Ann. Encoder Res. Tokens Size(M)↓ Latency(ms)↓ Prefilling(ms)↓
0.5B Model Comparison
R1 nanoLLaVA ViT-SO400M Qw.1.5 384 729 430 272.1 263.3
R2 LLaVAOV [45]∗ ViT-SO400M Qw.2 1152 7290 430 2721.4 11402.4
R3 FastVLM (Ours) FastViTHD Qw.2 1024 256 125 116.3 50.5
R3 FastVLM (Ours)∗ FastViTHD Qw.2 2048 1280 125 581.5 336.4
1-2B Model Comparison
R4 MobileVLMv2 [19] ViT-L/14 ML. 336 144 304 127.4 458
R5 FastVLM (Ours) FastViTHD Qw.2 768 144 125 54.8 97.1
R6 DeepSeekVL [58] ViT-SO400M DS. 384 576 430 272.1 -
R7 MM1 [66]∗ ViT-H - 1344 720 632 - -
R8 FastVLM (Ours) FastViTHD Qw.2 1024 256 125 116.3 116.1
R8 FastVLM (Ours)∗ FastViTHD Qw.2 2048 1280 125 581.5 681.7
7B Model Comparison
R9 InstructBLIP [20] ViT-g/14 Vic. 224 32 1012 149.5 152.1
R11 FastVLM (Ours) FastViTHD Vic. 256 16 125 6.8 143.4
R12 MobileVLMv2 [19] ViT-L/14 Vic. 336 144 304 127.4 332.1
R13 ConvLLaVA [28] ConvNeXT-L Vic. 768 144 200 164.3 332.1
R14 FastVLM (Ours) FastViTHD Vic. 768 144 125 54.8 332.1
R17 FastVLM (Ours) FastViTHD Qw.2 768 144 125 54.8 391.2
R20 ConvLLaVA [28] ConvNeXT-L Vic. 1024 256 200 696.1 461.1
R26 LLaVA-1.5 [53] 336 576 304 127.4 1170.0
R27 MobileVLMv2 [19] ViT-L/14 Vic. 336 576 304 127.4 1170.0
R28 ShareGPT4V [13] 336 576 304 127.4 1170.0
R29 ViTamin [12] ViTamin-L Vic. 384 576 333 137.6 1170.0
R30 ConvLLaVA [28] ConvNeXT-L Vic. 1536 576 200 1569.7 1170.0
R31 VILA [50] ViT-L/14 L-2 336 576 304 127.4 1169.5
R33 MM1 [66]∗ ViT-H - 1344 720 632 - -
R34 LLaVA-NeXT∗ ViT-L/14 L-3 672 2880 304 637.0 19709.7
R21 FastVLM (Ours) FastViTHD Vic. 1024 256 125 116.3 461.1
R36 FastVLM (Ours) FastViTHD Qw.2 1024 256 125 116.3 524.5
R36 FastVLM (Ours)∗ FastViTHD Qw.2 2048 1280 125 581.5 3139.5
VLMs with Multiple Vision Encoders and 8B LLM
ConvNeXT-L 1536 200 1569.7
R35 MiniGemini-HD L-3 2880 19709.7
ViT-L/14 672 304 552.6
ViT-SO400M 384 430 272.1
ConvNeXt-XXL 1024 846 2290.4
R36 Cambrian-1 [78] L-3 576 1223.6
DINOv2-ViT-L/14 518 304 1171.5
ViT-L/14 336 304 127.4

Table 10. Breakdown of prefilling latencies for recent methods. The models are grouped based on total number of visual tokens.
For models that were difficult to export or unavailable, we mark them as ’-’ in the table. “Vic.” refers to Vicuna [98], “Qw.2” refers to
Qwen2 [86] and “Qw.” refers to Qwen [3]. “L-2” refers to LLaMA-2. “L-3” refers to LLaMA-3. “ML.” refers to MobileLLaMA [18, 19].
“DS.” refers to DeepSeek LLM [21]. ∗ For input resolution and visual tokens, we report the highest supported resolution by the respective
models as some models like LLaVA-OneVision [45] and MM1 [66] use dynamic input resolution. FastVLM models using dynamic
resolution employs a simple 2×2 grid, with tile size set to 1024. For VLMs that use multiple vision encoders, the size of each encoder is
listed independently, for TTFT, the latency from each encoder is summed up.

C.1. Dynamic Resolution (AnyRes) Results C.2. CVBench and MathVista Results
Results in Tab. 12 show that, in comparison to Cambrian-
From Fig. 6, it is evident that VLMs prefer visual encoding 1 [78], FastVLM is significantly better on MathVista [60]
with fewer semantic breaks. Variants with more tiles typi- and competitive on CVBench [78], even though we have
cally underperform compared to those with fewer tiles and a a single backbone and significantly fewer tokens. Results
static resolution. To scale up input resolution, we train vari- on both CVBench and MathVista benchmarks further im-
ants of FastVLM with support for dynamic input resolution, prove as we scale the SFT dataset by including LLaVA-
where we use a tile size of 1024×1024 and use a simple OneVision [45].
2×2 grid. This enables the model to process a peak input
resolution of 2048×2048 using only 4 tiles, unlike models D. Datasets
like InternVL2 [16] which uses roughly 36 tiles to process
D.1. Pretraining Datasets
images of resolution 2688×2688. We report performance
of FastVLM (R4, R5, R9, R10, R27, R28) with dynamic For Stage-1 training, we only use LLaVA-1.5 558K [53]
resolution support on text-rich benchmarks in Tab. 11. dataset. For Stage-1.5 training, we use densely captioned

2
Row Vision Data (M) Input #Visual Vis. Enc. TTFT
Method LLM ChartQA OCRBench TextVQA DocVQA InfoVQA
Ann. Encoder (PT+IT) Res. Tokens Size(M)↓ (ms)↓
0.5B Model Comparison
R1 LLaVAOV [45]∗ ViT-SO400M Qw.2 4.5+3.2 1152 7290 430 14124 61.4 - - 70.0 46.3
R2 FastVLM (Ours) FastViTHD Qw.2 15+12.5 1024 256 125 166 63.4 54.9 62.9 70.4 35.8
R3 FastVLM (Ours) FastViTHD Qw.2 15+23.1 1024 256 125 166 71.4 55.8 65.8 79.1 43.3
R4 FastVLM (Ours)∗ FastViTHD Qw.2 15+12.5 2048 1280 125 918 68.8 59.0 65.4 82.1 49.3
R5 FastVLM (Ours)∗ FastViTHD Qw.2 15+23.1 2048 1280 125 918 64.2 55.7 65.5 85.3 53.9
1-2B Model Comparison
R6 MM1 [66]∗ ViT-H - 3000+1.5 1344 720 632 - 61.8 56.6 68.2 68.4 38.5
R7 FastVLM (Ours) FastViTHD Qw.2 15+12.5 1024 256 125 233 69.6 62.9 69.0 75.6 41.7
R8 FastVLM (Ours) FastViTHD Qw.2 15+23.1 1024 256 125 233 69.5 61.2 71.2 84.2 49.6
R9 FastVLM (Ours)∗ FastViTHD Qw.2 15+12.5 2048 1280 125 1263 76.4 63.2 71.5 87.6 60.0
R10 FastVLM (Ours)∗ FastViTHD Qw.2 15+12.5 2048 1280 125 1263 71.2 62.4 72.9 90.2 62.5
7B Model Comparison
R11 MM1 [66]∗ ViT-H - 3000+1.5 1344 720 632 - 72.6 62.6 72.8 76.8 45.5
R12 LLaVA-NeXT†∗ ViT-L/14 L-3 - 672 2880 304 20347 69.5 49.0 64.6 72.6 -
ViT-L/14 336 304
ViT-SO400M 384 430
R13 Cambrian-1 [78] L-3 2.5+7 576 5085 73.3 62.4 71.7 77.8 -
ConvNeXt-XXL 1024 846
DINOv2-ViT-L/14 518 304
R14 FastVLM (Ours) FastViTHD Vic. 0.5+0.6 768 144 125 387 17.1 30.0 62.9 32.9 28.7
R15 FastVLM (Ours) FastViTHD Vic. 0.5+1.1 768 144 125 387 59.1 38.4 67.5 57.3 29.7
R16 FastVLM (Ours) FastViTHD Vic. 15+1.1 768 144 125 387 65.4 45.3 69.4 65.5 32.0
R17 FastVLM (Ours) FastViTHD Qw.2 15+1.1 768 144 125 446 69.3 45.9 69.5 66.9 34.3
R18 FastVLM (Ours) FastViTHD Qw.2 15+11.9 768 144 125 446 74.2 59.0 72.8 72.0 44.3
R19 FastVLM (Ours) FastViTHD Vic. 0.5+0.6 1024 256 125 577 19.2 29.3 64.4 35.6 28.9
R20 FastVLM (Ours) FastViTHD Vic. 0.5+1.1 1024 256 125 577 61.0 38.3 67.4 62.8 32.0
R21 FastVLM (Ours) FastViTHD Vic. 15+1.1 1024 256 125 577 66.9 47.1 70.6 72.4 34.7
R22 FastVLM (Ours) FastViTHD Qw.2 15+1.1 1024 256 125 641 71.0 49.7 72.1 73.3 37.5
R23 FastVLM (Ours) FastViTHD Qw.2 15+6.5 1024 256 125 641 76.6 52.9 73.1 78.7 44.2
R24 FastVLM (Ours) FastViTHD Qw.2 15+11.9 1024 256 125 641 77.0 63.3 74.8 78.9 49.7
R25 FastVLM (Ours) FastViTHD Qw.2 15+12.5 1024 256 125 641 77.5 65.7 73.4 82.7 51.2
R26 FastVLM (Ours) FastViTHD Qw.2 15+23.1 1024 256 125 641 78.1 67.3 76.1 88.0 55.1
R27 FastVLM (Ours)∗ FastViTHD Qw.2 15+12.5 2048 1280 125 3721 82.4 67.3 76.6 92.3 68.3
R28 FastVLM (Ours)∗ FastViTHD Qw.2 15+23.1 2048 1280 125 3721 74.6 65.7 77.1 94.5 71.1

Table 11. Comparison with recent methods on text-rich benchmarks. The models are grouped based on total number of visual tokens.
“-” indicates that performance was not reported in the respective paper. For the dataset column, “-” indicates that the dataset size for
pretraining (“PT”) or instruction tuning (“IT”) is not explicitly mentioned in the respective paper. For methods that have more than 2 stages
of training, we report the total samples used for all the pretraining stages as part of “PT”. “TTFT” means time to first token (the sum of the
vision encoder latency and the LLM prefilling time), we report latency only for models that are publicly available and in a format favorable
to MLX [31] “Vic.” refers to Vicuna [98], “Qw.2” refers to Qwen2 [86]. “L-3” refers to LLaMA-3. * - For input resolution and visual
tokens, we report the highest supported resolution by the respective models that use dynamic input resolution. †- performance numbers
reported from [78]. For VLMs that use multiple vision encoders, the size of each encoder is listed independently, for TTFT, the latency
from each encoder is summed up.

Method
LLM Data (M) Input #Visual Latency CVBench CVBench Math this photo?”. For each (image, dense-caption) pair, we ran-
Decoder (PT+IT) Res. Tokens Enc.(ms) 2D 3D Vista
Cambrian-1 LLama3-8B 2.5+7 Mult. 576 3861.4 72.3 72.0 49.0
domly selected a generic question to form a triplet of (ques-
FastVLM Qwen2-7B 15+6.5 1024 256 116.3 71.8 69.6 57.6 tion, image, dense-caption). With a 0.5 probability, we
FastVLM Qwen2-7B 15+11.9 768 144 54.8 75.9 79.3 64.6 placed the image’s special token <image> either before
FastVLM Qwen2-7B 15+11.9 1024 256 116.3 76.7 80.9 64.8
or after the question. From recent works like [43, 66, 78]
Table 12. Evaluation on CVBench and MathVista. “6.5M” and our results in Tab. 6, scaling dataset in Scale-1.5 is ben-
instruction tuning dataset is from Cambrian-1. “11.9” instruc- eficial to improve the performance of VLM across a wide
tion tuning dataset is concatenation of Cambrian-1 and LLaVA- range of evaluations. Even though FastViTHD is smaller
OneVision datasets. than ViT-L/14 and ViT-H used in [43, 66] respectively, we
see similar scaling trends.
versions of CC3M [71] and CC12M [10] introduced in [43].
The total size of this dataset is 15 million image-text pairs.
We generated 300 generic questions, such as “What is in

3
D.2. Visual Instruction Tuning Datasets GQA SQA TextVQA POPE
LLaVA
MMVet VQAv2 DocVQA
Seed
BenchW BenchI
We use 3 different version of instruction tuning datasets. 62.69 64.25 60.71 85.8 59.4 29.6 77.27 27.57 53.31
The smallest scale is LLaVA-1.5 665K dataset [53]. 62.68 64.95 60.61 86.1 60.1 31.6 77.39 28.37 53.55
62.69 65.64 60.68 85.3 61.4 31.1 77.31 28.26 53.46
We further scale up this dataset by including train-
Std. 0.0047 0.57 0.041 0.33 0.83 0.85 0.049 0.35 0.099
ing splits of the following datasets; AI2D [37], Sci-
enceQA [59], ChartQA [63], COCO [51], DocVQA [65], Table 13. VLM benchmarks across three independent runs with
DVQA [35], GeoQA+ [8], OCRVQA [67], SegmentAny- frozen FastViT image encoder. Training setup is LLaVA-1.5 with
thing [39], SynthDoG-EN [38], TextVQA [74] and Vi- Vicuna 7B as LLM. Standard deviation across runs is listed in the
sual Genome [40]. The conversational data for the listed bottom row.
datasets is sourced from [15]. The total number of sam-
ples in this dataset is 1.1 million and is referred to as
“1.1M” in all the tables. We further scale-up instruction too small, simply using higher input resolution can reduce
tuning dataset using image-based conversational data from errors as shown in Tab. 15. Some cases require broader
Cambrian-7M [78], which amounts to 5.4 million sam- general knowledge to obtain a correct response as seen in
ples. Filtered Cambrian-7M [78] is merged with “1.1M Tab. 14, in such cases using a bigger LLM reduces failures.
” dataset to obtain “6.5M” instruction tuning dataset. We Some cases require reasoning about a higher resolution im-
then append all available single-image instruction tuning age, in which case using a bigger LLM can decrease failures
data open-sourced by LLaVA-OneVision [45] to “6.5M” to as seen in Tab. 16. Some failures result from misjudgment,
obtain “11.9M” instruction tuning dataset. We then include where correct responses are misclassified by the LLM judge
roughly 0.6M samples from DocMatix [41] dataset to ob- or incorrect labels, we ignore these cases in our analysis.
tain “12.5M” instruction tuning dataset. From Tab. 6, we
see further improvements in VLM benchmarks when in-
struction tuning dataset is scaled, following trends exhib-
ited by image encoders much bigger than FastViTHD. The
best performing models are achieved when Stage 2 mod-
els are further fine-tuned on the MammothVL instruction
tuning dataset [30]. Specifically, we filter the corpus to in-
clude only single-image instruction tuning data, resulting in
“10.6M” samples.

D.3. Evaluations
In addition to evaluations listed in Sec. 4, we report perfor-
mance of FastVLM on ChartQA [63], OCRBench [56] and
InfoVQA [64] to compare FastVLM against recent methods
on text-rich benchmarks. In Tab. 13, report performance of
FastViT model (with architectural interventions) from mul-
tiple training runs and compute the standard deviation of
metrics reported in Tab. 6. As described in Sec. 4, for ab-
lations we are interested in benchmarks that are quick to
evaluate and exhibit lower variance to different initializa-
tions. From Tab. 13, GQA, TextVQA, POPE, DocVQA
and SeedBench fit the criteria. While VQAv2 also exhibits
lower variance it is substantially larger and takes long time
to evaluate. The standard deviation across the selected met-
rics is below 0.5, so we use the average of these metrics as
a reliable indicator for our analysis in Sec. 3.

E. Qualitative Analysis
We analyzed failures across benchmarks and found: In text-
rich benchmarks (e.g., DocVQA, ChartQA), failures oc-
cur when text is too small or precise alignment is needed
(e.g., reading tables). In majority of cases where the text is

4
When to Use a Larger LLM When to Use Higher Resolution

Dataset: MMMU [93]

User What is the common term for the yel-
low area surrounding the site of an in-
fection? Options: [”I don’t know and I
don’t want to guess”, ’Corona’, ’Border’,
’Halo’, ’Toxin zone’] Dataset: GQA [34]
Ground Truth D User What is sitting inside the bowls?
Ground Truth squash
FastVLM-0.5B E✗
@ 256 FastVLM-0.5B Sculpture ✗
FastVLM-0.5B E✗ @ 256
@ 1024 FastVLM-0.5B Squash ✓
FastVLM-1.5B D✓ @ 1024
@ 256 FastVLM-1.5B Potato ✗
FastVLM-1.5B D✓ @ 256
@ 1024 FastVLM-1.5B Squash ✓
@ 1024

Dataset: MMMU [93] Dataset: ChartQA [63]

User The sinoatrial (SA) node is indicated by User What was the highest expenditure on for-
. Options: [’A’, ’B’, ’C’, ’D’, ’E’] eign military aid in 2009/10?
Ground Truth A Ground Truth 3781

FastVLM-0.5B E✗ FastVLM-0.5B Germany ✗

@ 256 @ 256
FastVLM-0.5B D✗ FastVLM-0.5B 3781 ✓
@ 1024 @ 1024
FastVLM-1.5B A✓ FastVLM-1.5B 275 ✗
@ 256 @ 256
FastVLM-1.5B A✓ FastVLM-1.5B 3781 ✓
@ 1024 @ 1024

Table 14. Cases that require broader general knowledge, in which Table 15. Cases where increasing the resolution is sufficient to
case a larger LLM is preferred. obtain a better response.

5
When to Use Higher Resolution and a Larger LLM

Dataset: ChartQA [63]

User What was the value of the commercial
property market in 2016?
Ground Truth 883
FastVLM-0.5B 10000 ✗
@ 256
FastVLM-0.5B 871 ✗
@ 1024
FastVLM-1.5B 1100 ✗
@ 256
FastVLM-1.5B 883 ✓
@ 1024

Dataset: DocVQA [65]

User Under which department ‘Stockroom’ is
organized?
Ground Truth Research Service Department
FastVLM-0.5B Department of Chemistry ✗
@ 256
FastVLM-0.5B Library ✗
@ 1024
FastVLM-1.5B Research Department ✗
@ 256
FastVLM-1.5B Research Service Department ✓
@ 1024

Table 16. Cases where increasing the resolution alone is not suf-
ficient to obtain a better response. Larger LLM in combination
with higher resolution is required to obtain a correct response.