0% found this document useful (0 votes)
4 views12 pages

2403.14520v4

The document discusses Cobra, an extension of the Mamba framework designed for multi-modal large language models to enhance inference efficiency. It highlights the potential for significant improvements in performance as these models evolve. The authors are affiliated with various academic and research institutions.

Uploaded by

Mennan Gök
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views12 pages

2403.14520v4

The document discusses Cobra, an extension of the Mamba framework designed for multi-modal large language models to enhance inference efficiency. It highlights the potential for significant improvements in performance as these models evolve. The authors are affiliated with various academic and research institutions.

Uploaded by

Mennan Gök
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

Cobra: Extending Mamba to Multi-Modal Large Language Model for Efficient

Inference
Han Zhao1 2 * , Min Zhang1 * , Wei Zhao2 , Pengxiang Ding2 , Siteng Huang3 , Donglin Wang2†
1
Zhejiang University 2 Westlake University 3 DAMO Academy, Alibaba Group
[email protected], [email protected], [email protected],
[email protected], [email protected], [email protected]

Abstract evolve, there is potential for significant improvements in ar-


arXiv:2403.14520v4 [cs.CV] 8 Jan 2025

eas such as real-time interaction in dynamic environments,


In recent years, applying multi-modal large language mod- cross-modal retrieval tasks, and the seamless integration of
els (MLLMs) in various fields has achieved remarkable suc- language and visual processing in everyday technology.
cess. However, as the foundation model for many downstream
tasks, MLLMs comprise the well-known Transformer net- MLLMs typically rely on the well-known Transformer
work, which has a less efficient quadratic computation com- network as a foundational model for many downstream
plexity. In this paper, we introduce Cobra, a multi-modal tasks. However, due to their quadratic computational com-
large-scale language model built upon a state-space model, plexity, Transformer networks are often less efficient, mak-
which has demonstrated significant potential in efficiently ing it challenging to meet the demands of application scenar-
handling long sequences with fast inference and linear scal- ios that require high real-time performance and are suitable
ability concerning sequence length. Specifically, Cobra in- for edge deployment. As shown in Figure 1 (d), the MoE-
volves replacing Transformer-based backbone models (e.g., LLaVA (Lin et al. 2024) model can process only 20.33 to-
LLaMA or Phi) with pre-trained Mamba language models.
kens generated per second, indicating a low processing effi-
We then empirically explore effective strategies for align-
ing visual and textual modalities and integrating various pre- ciency. Despite these challenges, there remains a significant
trained Mamba model variants with visual encoders. Exper- demand for MLLMs in such areas. Therefore, the ability
iments across various multi-modal benchmarks demonstrate to deploy MLLMs that support fast inference with low
that: (i) Cobra performs 3× ∼ 4× faster than the most com- resource utilization is particularly crucial.
putationally efficient state-of-the-art methods, e.g., LLaVA- Traditional approaches have primarily focused on improv-
Phi and MobileVLM v2. Additionally, its performance is sig- ing the efficiency of MLLMs by reducing model capacity
nificantly enhanced thanks to the implementation of linear se- or compressing the length of the visual context while gen-
quential modeling. (ii) Cobra fine-tunes a small parameter (∼ erally maintaining the Transformer architecture within the
48% of model parameters), leading to a significant improve-
ment in overall performance compared to LLaVA. The project
language model (Zhu et al. 2024; Zhou et al. 2024; Zhang
page is available at: https://sites.google.com/view/cobravlm. et al. 2024; Chu et al. 2023, 2024; Lin et al. 2024). For exam-
ple, LLaVA-Phi (Zhu et al. 2024) builds a multi-modal base
model using a small-scale Phi-2 as the core language model.
1 Introduction MobileVLM (Chu et al. 2023, 2024) utilizes MobileLLaMA
as its base model, training a series of smaller-scale language
Multi-modal large language models (MLLMs) have re- models based on the LLaMA architecture. These methods
cently achieved impressive results across a variety of down- aim to enhance the inference speed of MLLMs by reduc-
stream tasks, including multi-modal content generation (Lu ing the size of the language models. Although this approach
et al. 2022; Wu et al. 2024), vision-based question answer- improves efficiency, it often comes at the expense of sig-
ing (OpenAI 2023; Gao et al. 2023; Liu et al. 2023b,c), nificantly reduced model performance. In Figure 1 (b), Mo-
and embodied intelligence (Brohan et al. 2023; Kim et al. bileVLM v2-3B (Chu et al. 2024) has the worst performance
2024; Ding et al. 2024). By effectively aligning pre-trained on all MLLM benchmarks compared to the other models.
large language models with visual modalities, MLLMs have
demonstrated a strong ability to comprehend and navigate In this paper, our primary goal is to enhance the infer-
complex visual-language contexts. These advancements not ence speed of multi-modal large language models (MLLMs)
only highlight the versatility of MLLMs but also pave the while ensuring their performance remains uncompromising.
way for further research into more nuanced and sophis- To achieve this, we propose Cobra, which integrates the
ticated applications. For example, as MLLMs continue to Mamba large language model and utilizes the linear scal-
ability of state-space modeling (SSM). This approach ef-
* These authors contributed equally. fectively addresses the quadratic computational complex-

Corresponding author. ity inherent in traditional Transformer architectures. Specif-
Copyright © 2025, Association for the Advancement of Artificial ically, Cobra consists of three key components: a vision en-
Intelligence (www.aaai.org). All rights reserved. coder that concatenates DINOv2 (Oquab et al. 2024) and
Figure 1: Overview of Cobra. (a) Our innovative integration of a vision encoder with the efficient Mamba language model
significantly enhances the reasoning efficiency of MLLMs. (b) Cobra demonstrates competitive performance on general MLLM
benchmark tests. (c) Cobra generates accurate textual descriptions (e.g., green text indicates a correct answer), outperforming
current state-of-the-art MLLMs that produce inaccurate answers (shown in red text) while maintaining rapid reasoning speeds.
(d) A comparison of our proposed Cobra and the baseline in terms of the number of tokens generated per second.

SigLIP (Zhai et al. 2023) features, a projector that maps 2 Related works
visual features to the language embedding space and the 2.1 Multi-modal Large Language Models
Mamba LLM backbone, as shown in Figure 1 (a) and Fig-
Building on the success of large language models (LLMs),
ure 2. In Figure 1 (c), Cobra performs 3× ∼ 4× faster than
numerous extensions have been developed to apply LLMs
MobileVLM v2-3B. Interestingly, Cobra can generate more
to multi-modal tasks, integrating information from multiple
accurate responses, as highlighted by the text in green, effec-
sources such as text, images, and audio to enable compre-
tively mitigating the hallucination problem commonly seen
hensive understanding and reasoning across different modal-
in MLLMs. Even compared to the much larger LLaVA v1.5
ities (Chu et al. 2023; Liu et al. 2023b; Taori et al. 2023;
model with 7 billion parameters, Cobra still performs com-
Bai et al. 2023; Alayrac et al. 2022; Awadalla et al. 2023;
parably on several specific benchmarks with about 48% of
Liu et al. 2023c; Chen et al. 2023b). These models leverage
the parameters. Our main contributions are summarised:
vast amounts of data and intricate architectures to achieve
• We investigate that existing MLLMs typically rely on state-of-the-art performance in tasks such as image caption-
Transformer networks, exhibiting a quadratic compu- ing (Ke et al. 2019), visual question answering (Antol et al.
tational complexity. To address this inefficiency, we 2015) and cross-modal retrieval (Hendriksen et al. 2023).
present Cobra, a novel MLLM with linear computation. Recent advances have harnessed the formidable reasoning
• Our research investigates various multi-modal fusion power of LLMs such as LLaMA (Touvron et al. 2023) and
strategies to enhance the integration of visual and lan- Vicuna (Chiang et al. 2023). However, a notable common-
guage within the Mamba LLM. Extensive experiments ality among existing MLLMs is their reliance on the Trans-
evaluate the effectiveness of different fusion approaches. former backbone to model dependencies among sequential
• Extensive experiments are conducted to evaluate the per- tokens. Despite the Transformer network’s exceptional ca-
formance of Cobra in comparison to concurrent studies pability in capturing relationships within data, its quadratic
aimed at improving the computational efficiency of foun- computational complexity presents a significant drawback,
dational MLLMs. Notably, Cobra-3.5B even achieves particularly when dealing with large-scale language models.
comparable performance to LLaVA with fewer parame- To mitigate this problem, several studies have been pro-
ters, underscoring its efficiency. Cobra-8B surpasses the posed to present more compact and efficient MLLMs (Zhu
LLaVA v1.5 model of similar size on all tested bench- et al. 2024; Zhou et al. 2024; Zhang et al. 2024; Chu et al.
marks, achieving an average accuracy improvement of 2023, 2024). For example, LLaVA-Phi (Zhu et al. 2024)
approximately 6%. It also remains faster in inference builds a multi-modal foundation model taking the small-
compared to MobileVLM v2-3B. scale Phi-2 as the LLM. MobileVLM (Chu et al. 2023, 2024)
introduces MobileLLaMA as the base model which trains a an SSM for each input channel independently. ∆ is a time-
family of small-scaled LLM based on LLaMA architecture. scale parameter that helps transform A and B into discrete-
However, these methods achieve effective MLLMs by us- time parameters A and B, respectively. The discretization
ing smaller-scale LLMs that significantly reduce the perfor- rule for A and B with the zero-order hold is as follows:
mance while increasing the speed of inference.
A = exp(∆A), (2a)
2.2 State Space Models B = (∆A) −1
(exp(∆A) − I) · ∆B. (2b)
State space models (SSMs) have demonstrated highly
promising performance across various tasks, including long- Thus, the structured SSM can be summarized as the fol-
range sequence modeling (Smith, Warrington, and Linder- lowing recurrence form in Equation(3a) and (3b):
man 2023; Hasani et al. 2022), image generation (Yan, Gu,
and Rush 2023; Bellagente et al. 2024) and reinforcement hk = Ahk−1 + Bxk , (3a)
learning (Bar-David et al. 2023; Lu et al. 2023). One of the yk = Chk . (3b)
key advantages of SSMs is their flexibility, as they can be
formulated as recurrent neural networks (RNNs) for efficient
inference or as models capable of parallel processing entire The model can also be written as the convolution form
input sequences, enabling more efficient training. (4a), (4b) to process the sequence in parallel:
Recently, a new selective SSM structure called  k

Mamba (Gu and Dao 2023) has been introduced, which K = CB, CAB, . . . , CA B, . . . (4a)
is regarded as a strong competitor to the Transformer
y =x∗K (4b)
architecture. Compared to LLMs of similar capacity,
Mamba-based language models (Dao and Gu 2024; Waleffe
et al. 2024) demonstrate competitive performance with a Based on the structured SSM, the selective SSM (Gu and
distinct advantage: their inference speeds scale linearly Dao 2023) is further introduced to endow the model with
with sequence length while maintaining constant memory the ability to selectively propagate or forget information ac-
usage. This efficiency allows Mamba to handle long con- cording to the sequential input tokens. Specifically, the se-
texts and perform inference more effectively. In contrast, lective SSM achieves these functions by making the param-
Transformer-based models face challenges such as increas- eters (A, B, C) depend on the input x, which significantly
ing GPU memory consumption and computation time that enhances the model’s expressive capacity. Gu et al. (Gu and
grows quadratically with sequence length (Katharopoulos Dao 2023) proposed a hardware-aware algorithm called se-
et al. 2020). In this paper, we conduct an in-depth explo- lective scan to allow efficient implementation of the model.
ration of extending Mamba-based LLMs into practical
and efficient MLLMs. Through extensive experiments, we 3.2 Cobra Model
examine the distinctive characteristics of Mamba MLLMs
and develop efficient training strategies to significantly To accomplish the purpose of building a multi-modal large
enhance their performance. language model (MLLM) that is capable of receiving visual
information, we introduce Cobra as illustrated in Figure 2.
3 Methodology Cobra consists of three key components: a vision encoder,
a projector, and a Mamba LLM backbone. We present the
This section introduces the preliminary concepts of state
implementation details for each component below.
space models (Section 3.1). Then we describe the details of
our proposed Cobra (Section 3.2), which mainly includes the • Vision encoder: We fuse DINOv2 (Oquab et al. 2024)
vision encoder, the projector, and the Mamba LLM. and SigLIP (Zhai et al. 2023) as our vision backbone.
The intuition is that combining the visual representations,
3.1 Preliminaries which capture low-level spatial properties from DINOv2
Traditional state space models (SSMs) (Gu, Goel, and Ré and the semantic properties provided by SigLIP further
2022; Smith, Warrington, and Linderman 2023) are charac- improves the performance on downstream tasks (Tong
terized by the parameters (∆, A, B, C). Given a continuous- et al. 2024; Karamcheti et al. 2024). Considering an
time scalar input signal x(t), the SSM can be described by image Xv ∈ RC×H×W as input, the vision encoder
the following ordinary differential equation: splits the image into Nv = HW/P 2 same-size patches,
where P 2 is the patch size. Both two vision encoders
h′ (t) = Ah(t) + Bx(t), (1a) take the patchified image as an input token sequence
y(t) = Ch(t), (1b) and extract the channel-wise concatenation of the out-
put of two encoders as the compact visual representations
where the parameters of the SSM, A ∈ RN ×N , B ∈ RN ×1
Rv ∈ RNv ×(DDINOv2 +DSigLIP ) :
and C ∈ R1×N , represent constant matrices. The variables
h, x, and y are continuous-time variables concerning time t, Rv = [φDINOv2 (Xv ); φSigLIP (Xv )], (5)
representing the hidden state, input, and output.
In practice, the SSMs operate in a discretized form to han- for a subsequent task-specific head, where Dv denotes
dle input sequences, and we use a mixer layer that constructs the dimension of the above-produced tokens.
Figure 2: Cobra model architecture. Given an image observation and a language instruction, the model generates the corre-
sponding answer. The architecture consists of three key components: ⃝ 1 a vision encoder that concatenates DINOv2 (Oquab
et al. 2024) and SigLIP (Zhai et al. 2023) features, ⃝
2 a projector that maps visual features to the language embedding space and
⃝3 the Mamba LLM backbone, a Mamba 2.8 or 7B-parameter large language model (Gu and Dao 2023; Mercat et al. 2024).

• Projector: The projector is a simple learnable module that Vision Encoder DINOv2 + SigLIP ViT-SO
aligns the feature of vision and text by transforming the LLM init. Mamba-2.8b-Zephyr / Mamba-7B
dimension of the original visual representation to the di- Projector init. Random
mension of the tokens in the Mamba language model: Image resolution 384 × 384
Image token num. 729
Hv = ϕ(Rv ). (6) Global batch size 128
We introduce two implementations of the different pro- Training steps 19K
jectors in Cobra to map visual tokens into the same la- Optimizer AdamW
tent space as the language tokens. The multiple-layer per- LR schedule Cosine decay
ceptron (MLP) can be used to merge information from Learning Rate 2e-5
different modalities. In addition, the lightweight down- Weight decay 0.1
sample projector suggested by (Chu et al. 2024) is also Warm-up ratio 0.03
tested to achieve a greater reduction in computation cost. Number of epochs 2
• Mamba backbone: The Mamba backbone is a stack of Table 1: The configuration of models and hyperparameters.
multiple identical basic blocks with the short convolu-
tion, SSM module, the residual connection, and RM-
chose to eliminate the pre-alignment stage and instead di-
SNorm (Zhang and Sennrich 2019) for each block. The
rectly fine-tune the entire LLM backbone and the projec-
model receives the concatenation of visual embeddings
tor. This fine-tuning process spans two epochs, with random
transformed from the projection layer and text embed-
sampling conducted on a combined dataset comprising:
dings, denoted as H ∈ RLin ×D , and transforms this se-
quence into target token sequence Y = {yi }L i=1 in an 1. The mixed dataset used in LLaVA v1.5, which con-
auto-regressive manner: tains a total of 655K visual multi-turn conversations includ-
L
ing academic VQA (Goyal et al. 2017; Hudson and Man-
Y ning 2019; Krishna et al. 2016; Singh et al. 2019) sam-
p(Y |Hv , Hq ) = p(yi |Hv , Hq , y<i ). (7) ples, as well as visual instruction tuning data in LLaVA-
i=1
Instruct (Liu et al. 2023c) and pure text instruction tuning
Lastly, the tokens will be detokenized to the response an- data in ShareGPT (ShareGPT 2023).
swer in natural language. 2. LVIS-Instruct-4V (Wang et al. 2023), which contains
220K images with visually aligned and context-aware in-
3.3 Training Recipe structions generated by GPT-4V.
Recent research (Karamcheti et al. 2024) suggests that the 3. LRV-Instruct (Liu et al. 2023a), a 400K visual instruc-
pre-alignment phase may be unnecessary in the LLaVA- tion dataset that covers 16 vision-and-language tasks aiming
based training paradigm (Liu et al. 2023b; Chu et al. 2024) on mitigating hallucination.
(i.e., training only the pre-alignment phase of the projection
layer and fine-tuning the large language model (LLM) for Overall, the entire dataset contains approximately 1.2 mil-
one epoch each). It has been observed that even after fine- lion images, corresponding multi-turn dialogue data, and
tuning, the model remains underfitted. In light of this, we pure text dialogue data.
Model LLM Backbone Res. VQA-v2 GQA VizWiz TextVQA VSR POPE
Large Scale MLLMs
OpenFlamingo (Awadalla et al. 2023) MPT-7B 336 52.7 - 27.5 33.6 - -
BLIP-2 (Li et al. 2023a) Vicuna-13B 224 - 41.0 19.6 42.5 50.9 -
MiniGPT-4 (Zhu et al. 2023) Vicuna-7B 224 32.2 - - - - -
InstructBLIP (Dai et al. 2023) Vicuna-7B 224 - 49.2 34.5 50.1 54.3 -
InstructBLIP (Dai et al. 2023) Vicuna-13B 224 - 49.5 33.4 50.7 52.1 -
Shikra (Chen et al. 2023a) Vicuna-13B 224 77.4 - - - - -
IDEFICS (Laurençon et al. 2023) LLaMA-7B 224 50.9 - 35.5 25.9 - -
IDEFICS (Laurençon et al. 2023) LLaMA-75B 224 60.0 - 36.0 30.9 - -
Qwen-VL (Bai et al. 2023) Qwen-7B 448 78.2 59.3 35.2 63.8 - -
LLaVA v1.5 (Liu et al. 2023b) Vicuna-7B 336 78.5 62.0 50.0 58.2 51.5 85.9
Cobra-8B (ours) Mamba-7B 384 79.2 63.9 56.2 59.5 62.9 87.6
Small Scale MLLMs
MoE-LLaVA (Lin et al. 2024) StableLM-1.6B 336 76.7 60.3 36.2 50.1 - 85.7
MoE-LLaVA (Lin et al. 2024) Phi2-2.7B 384 79.9 62.6 43.7 57.0 - 85.7
LLaVA-Phi (Zhu et al. 2024) Phi2-2.7B 336 71.4 - 35.9 48.6 - 85.0
MobileVLM v2 (Chu et al. 2024) MobileLLaMA-2.7B 336 - 61.1 - 57.5 - 84.7
Cobra-3.5B (ours) Mamba-2.8B 384 77.8 62.3 49.7 58.2 58.4 88.4
Table 2: Experiments of four open-ended benchmarks (blue) and two closed-set benchmarks (red). Res. represents the image
resolution used for the vision encoder input. The best performance is highlighted in bold and the second-best result is underlined.

4 Experiments based descriptions, and RefCOCOg emphasizes long and


In this section, we conduct extensive experiments to evaluate rich descriptions (Kazemzadeh et al. 2014; Yu et al. 2016).
the performance of our proposed Cobra method, aiming to Baseline methods. We compare Cobra to a large num-
answer the following questions: RQ1: How does the perfor- ber of algorithms that span different sizes, including (1)
mance of our proposed Cobra method compare with state- large-scale MLLMs: OpenFlamingo (Awadalla et al. 2023),
of-the-art MLLMs? (Section 4.2) RQ2: How does the in- BLIP-2 (Li et al. 2023a), MiniGPT-4 (Zhu et al. 2023), In-
ference speed of Cobra compare to three transformer-based structBLIP (Dai et al. 2023), Shikra (Chen et al. 2023a),
baselines? (Section 4.3) RQ3: How effective is the proposed IDEFICS (Laurençon et al. 2023), Qwen-VL (Bai et al.
Cobra in different settings (or ablation study)? (Section 4.4) 2023) and LLaVA v1.5 (Liu et al. 2023b). (2) small-scale
MLLMs: MoE-LLaVA (Lin et al. 2024), LLaVA-Phi (Zhu
4.1 Experimental Setup et al. 2024) and MobileVLM v2 (Chu et al. 2024).
Datasets. We conduct our experiments on a diverse set Implementation details. Our training process includes
of nine benchmarks, including (1) four open-ended vi- multi-modal instruction tuning, during which we fine-tune
sual question answering (VQA), i.e., VQA-v2 (Goyal et al. both the multi-modal projector and the Mamba LLM. The
2017), GQA (Hudson and Manning 2019), VizWiz (Gu- model is trained using 8 NVIDIA A100 80GB GPUs. We
rari et al. 2018) and TextVQA (Singh et al. 2019). (2) two have selected various open-source model weights, including
closed-set visual question answering (VQA), i.e., VSR (Liu, Mamba with 2.8 billion and 7 billion parameters, to serve as
Emerson, and Collier 2023) and POPE (Li et al. 2023b). (3) the LLM backbone for our model. The model configurations
three visual grounding, i.e., RefCOCO, RefCOCO+ and Re- and hyperparameters are detailed in Table 1, with additional
fCOCOg (Kazemzadeh et al. 2014; Yu et al. 2016). information provided in the supplementary material.
VQA-v2 (Goyal et al. 2017) evaluates models’ general Prompt order. In our prompt template design, we were sur-
ability to understand and reason about images and ques- prised to discover that the word order in the templates sig-
tions. GQA (Hudson and Manning 2019) assesses spatial un- nificantly impacts the model’s performance, particularly in
derstanding and multi-step inference in real-world images. TextVQA. For example, Cobra, which follows LLaVA and
VizWiz (Gurari et al. 2018) is similar to VQA-v2 but in- InstructBLIP evaluations, uses input tokens parsed by the
cludes a series of unanswerable questions. TextVQA (Singh OCR system as prompts—formatted as “Question\n Ref-
et al. 2019) focuses on reasoning from text in images. erence OCR token: ...”. We found that this specific prompt
VSR (Liu, Emerson, and Collier 2023) is composed of de- structure reduced performance substantially, from 47.9% to
manding True/False questions that probe individual spatial 43.0%, compared to not using any prompts at all. Through
relationships within various scenes, which is challenging to extensive experimental exploration, we addressed this issue
MLLMs. POPE (Li et al. 2023b) is comprised of specific by adjusting the prompt order to “Reference OCR token:
Yes/No questions designed to evaluate MLLMs’ tendency to ...\n Question”, which improved performance. This sensi-
generate hallucinations. RefCOCO focuses on short descrip- tivity to prompt order may be due to an inductive bias in
tions with spatial anchors, RefCOCO+ relies on appearance- the RNN model. We hope that our findings will encourage
Model LLM Backbone Total Params Visual Tokens Evalavg (tokens/s) T otal (s)
MoE-LLaVA Phi-2-2.7B 5.3B (3.6B Activated) 576 20.33 12.59
LLaVA-Phi Phi-2-2.7B 3.1B 576 40.89 6.26
MobileVLM v2 MobileLLaMA-2.7B 3.1B 144 49.50 5.17
Cobra-3.5B (ours) Mamba-2.8B 3.5B 729 166.47 1.54
Cobra-LDPv2-3.5B (ours) Mamba-2.8B 3.5B 196 166.85 1.53
Cobra-8B (ours) Mamba-7B 7.8B 729 79.92 3.20
Table 3: Latency comparison of small-scale MLLMs with ∼3 billion parameters.

Model RefCOCO RefCOCO+ RefCOCOg Avg. parameter scales with different architectures: MoE-LLaVA,
LLaVA-Phi, and MobileVLM v2.
LLaVA v1.5 55.1 49.5 50.9 51.8
In the evaluation, all models received the same exam-
Cobra-3.5B 52.7 45.6 46.9 48.4 ple image. We used the same question “Describe the image
Cobra-8B 58.2 52.5 54.4 55.0 specifically” as the textual prompt and set the number of out-
put tokens to 256 for all models. The total time Ttotal from
Table 4: Experiments of three visual grounding benchmarks. the image encoding to finished generating the complete an-
Avg. represents the average accuracy of the model on three swer is recorded and we calculated the average number of
benchmarks. The best performance is highlighted in bold. tokens generated per second by Evalavg = 256/Ttotal .
All the evaluations were done on the hardware with a sin-
further research in the community on this problem. gle Nvidia A100 PCIe 80GB GPU. The results from Table 3
show that our model has a significant advantage in inference
4.2 Overall Performance speed compared to transformer-based models. Compared to
In Table 1, we report the overall performance of Cobra MobileVLM v2, which has undergone several lightweight
and fourteen baselines under large-scale MLLMs and small- optimizations, Cobra only took about 30% of the time to
scale MLLMs on six datasets. According to Table 1, we complete inference when the number of visual tokens pro-
have the following findings: (1) In the large scale MLLMs cessed significantly increased.
setting with more than 7 billion parameters. our pro- We also evaluated the results of Cobra-LDPv2, a vari-
posed Cobra-8B achieves the best performance on all evalu- ant of our model that replaces the projector with an LDPv2
ated benchmarks. (2) In the small scale MLLMs setting block, which reduces the number of visual tokens per image
with around 3 billion (total or activated) parameters. to 196. The results showed no significant speed improve-
Cobra-3.5B achieved the best performance on all bench- ment in our evaluation method. Due to the nature of par-
marks except VQA-v2 and GQA, where it was only sur- allel RNN models, the number of prompt tokens only af-
passed by MoELLAVA, a multi-modal mixture-of-experts fects the speed of the model’s first parallel forward process.
language model expanded and fine-tuned from phi-2-2.7B. Given that LDP significantly compresses visual information
Our model lags behind by 1%-2% in accuracy on these two through pooling, it can impact the performance of MLLM to
metrics, while our inference speed is over 8 times faster than some extent (see our ablation studies for performance com-
that of the model as shown in Section 4.3. parison). We believe that for the structure of Cobra or other
It is noteworthy that Cobra-3.5B, with only 48% of the RNN-based MLLMs, adopting such a lightweight design on
total parameters of LLaVA v1.5-7B, achieves comparable the projector may be unnecessary.
results on open-ended VQA benchmarks and shows signif-
icant improvements in the challenging closed-set prediction 4.4 Ablation Studies
tasks of VSR and POPE. On these two benchmarks, there are We conduct ablation studies to verify the network design of
performance improvements of 6.9% and 2.5% respectively. Cobra, mainly involving the choice of projectors, vision en-
As shown in Table 4, we also evaluated the localiza- coders, language models, and training strategies.
tion capabilities of our two models alongside LLaVA v1.5- Vision encoders. Recent works discover that despite CLIP-
7B. The results indicate that Cobra-3.5B has accuracy rates like language-image models may offer rich semantics, it
that are 3%-4% lower than LLaVA v1.5-7B across all three has the potential to lose the detailed information for images
benchmarks. In contrast, Cobra-8B exhibits the highest ac- themselves. Therefore, we adopt DINOv2 as a supplemen-
curacy among the three models, outperforming the others by tary encoder and concatenate visual representations from
over 3% in accuracy on all benchmarks. Given that the train- two encoders for the subsequent LLM. As shown in Table 3,
ing schemes for Cobra were identical, these results demon- the fusion of DINOv2 and SigLIP features leads to better
strate that the grounding ability of the model is significantly performance compared with SigLIP-only on all the bench-
influenced by the performance of the language model itself. marks except TextVQA. Especially, we found the fused ar-
chitecture significantly improves the accuracy by 5%-6% on
4.3 Inference Speed VSR and localization benchmarks. This result implies that
We evaluated the generation speed of our model compared to there is a meaningful principle when selecting the vision en-
three transformer-based baseline models of similar activated coder for downstream tasks.
Model VQA-v2 GQA VizWiz TextVQA VSR POPE RefCOCO RefCOCO+ RefCOCOg
77.8 62.3 49.7 58.2 58.4 88.4 52.7 45.6 48.9
w/ SigLIP 77.5 (0.3 ↓) 61.8 (0.5 ↓) 48.3 (1.4 ↓) 58.8 (0.6 ↑) 53.2 (5.2 ↓) 88.2 (0.2 ↓) 46.7 (6.0 ↓) 40.1 (5.5 ↓) 43.8 (5.1 ↓)
w/ LDPv2 76.2 (1.6 ↓) 61.9 (0.4 ↓) 50.2 (0.5 ↑) 54.7 (3.5 ↓) 56.1 (2.3 ↓) 87.7 (0.7 ↓) 50.3 (2.4 ↓) 42.9 (2.7 ↓) 46.9 (2.0 ↓)
w/ Base 77.8 (0.0 ↕) 62.7 (0.4 ↑) 47.2 (2.5 ↓) 57.9 (0.3 ↓) 54.4 (4.0 ↓) 89.0 (0.6 ↑) 52.2 (0.5 ↓) 45.6 (0.0 ↕) 48.6 (0.3 ↓)
w/ 1 Ep FT 76.5 (1.3 ↓) 60.9 (1.4 ↓) 48.5 (1.2 ↓) 57.5 (0.7 ↓) 53.8 (4.6 ↓) 88.1 (0.3 ↓) 42.5 (10.2 ↓) 34.3 (11.3 ↓) 39.0 (9.9 ↓)
w/ PT+FT 75.7 (2.1 ↓) 60.4 (1.9 ↓) 44.2 (5.5 ↓) 58.0 (0.2 ↓) 51.6 (6.8 ↓) 86.9 (1.5 ↓) 37.3 (15.4 ↓) 29.7 (15.9 ↓) 34.3 (14.6 ↓)

Table 5: Ablation studies of Cobra-3.5B on vision encoders, projectors, language models and training strategies
Visual input example Spatial Reasoning Scene Description

User Is the bicycle parked to the right of the dog? What’s going on in this image?
Cobra-3.5B (ours) The bicycle is actually parked on the left side of the A monkey is holding two knives in its hands, while a
dog, not the right. man in the background is wearing a white shirt with a
floral pattern.
LLaVA v1.5-7B Yes, the bicycle is parked on the right side of the dog. In this image, a monkey is holding two knives in its
hands, seemingly posing for a picture.

Table 6: Visualization of the MLLM example. Cobra generates accurate and more detailed textual descriptions compared with
the baseline, where green indicates a correct answer, red produces inaccurate answers and blue is a more detailed description.

Projectors. Besides, a different choice of projection layer is from several different approaches that treat pre-alignment as
used in the experiments. We investigate a lightweight down- the first stage of training. (Lin et al. 2024; Chu et al. 2024).
sample projector (LDPv2) to see if we can further speed Visualization of the MLLM example. We visualize some
up the inference process without obvious deterioration in examples to demonstrate the performance. In Table 6, Cobra
performance. Applying LDPv2 to Cobra harms the perfor- outperforms LLaVA v1.5 in the first example involving the
mance on all benchmarks except VizWiz. Unfortunately, we judgment of spatial relationships. Cobra correctly identified
observed that the models using LDPv2 show a significant that the dog was parked to the right of the bicycle, whereas
decrease in accuracy on TextVQA, VSR, and localization LLaVA provided the opposite, incorrect answer. In the sec-
benchmarks, which require precise visual information. ond example, Cobra offered a more detailed description of
Base or instruct-tuned LLMs. We also explored the appli- the background information compared with LLaVA. More
cation of different Mamba LLMs. Specifically, we chose a examples are shown in the supplementary material.
base model that had not been fine-tuned on any chat datasets.
As indicated in Table 3, the fine-tuned model achieved no- 5 Conclusion
tably higher accuracy on the VizWiz and VSR benchmarks In this study, we propose Cobra, which addresses the ef-
compared to the pre-trained model that did not utilize chat ficiency bottleneck of existing multi-modal large language
corpora, with accuracy improvements of 2.5% and 4%, re- models (MLLMs) that rely on Transformer networks with
spectively. On other benchmarks, the differences were not quadratic computational complexity. We explore combin-
significant with accuracy gaps within 1%. The chat model ing language models with linear computational complex-
exhibits a slight disadvantage compared to the base model ity and multi-modal inputs. In terms of fusing visual and
only on the GQA and POPE benchmarks. linguistic information, we have successfully optimized the
Training strategies. Different training strategies were in- internal information integration of the Mamba language
vestigated. The results show that fine-tuning the language model through in-depth research on different modality fu-
model for two epochs yields strictly better performance on sion schemes, achieving more effective multi-modal repre-
all evaluated benchmarks compared with the model that only sentation. Experiments demonstrate that Cobra not only sig-
fine-tuned for one epoch. This suggests that the model may nificantly improves computational efficiency, but also per-
be underfitted with only one epoch of training. forms comparably to advanced models like LLaVA, espe-
Additionally, we discovered that initializing a pre-aligned cially excelling in overcoming visual hallucination and spa-
projector during the fine-tuning stage actually harms the tial relationship judgment. It even significantly reduces the
model’s performance, resulting in consistently lower accu- number of parameters. This opens up new possibilities for
racy across all benchmarks compared to a model with a ran- deploying high-performance AI models in environments that
domly initialized projector (when both models are fine-tuned require high-frequency processing of visual information,
for one epoch) except TextVQA. This conclusion differs such as visual-based robotic feedback control, in the future.
6 Acknowledgments Dao, T.; and Gu, A. 2024. Transformers are SSMs: Gener-
This work was supported by the National Science and alized Models and Efficient Algorithms Through Structured
Technology Innovation 2030 - Major Project (Grant No. State Space Duality. arXiv:2405.21060.
2022ZD0208800), and NSFC General Program (Grant No. Ding, N.; Chen, Y.; Xu, B.; Qin, Y.; Zheng, Z.; Hu, S.; et al.
62176215). 2023. Enhancing Chat Language Models by Scaling High-
quality Instructional Conversations. arXiv:2305.14233.
References Ding, P.; Zhao, H.; Song, W.; Zhang, W.; Zhang, M.; et al.
Alayrac, J.-B.; Donahue, J.; Luc, P.; Miech, A.; Barr, I.; et al. 2024. QUAR-VLA: Vision-Language-Action Model for
2022. Flamingo: a Visual Language Model for Few-Shot Quadruped Robots. arXiv:2312.14457.
Learning. arXiv:2204.14198. Gao, P.; Han, J.; Zhang, R.; Lin, Z.; Geng, S.; et al. 2023.
Antol, S.; Agrawal, A.; Lu, J.; Mitchell, M.; Batra, D.; Zit- LLaMA-Adapter V2: Parameter-Efficient Visual Instruction
nick, C. L.; and Parikh, D. 2015. Vqa: Visual question an- Model. arXiv:2304.15010.
swering. In Proceedings of the IEEE international confer- Goyal, Y.; Khot, T.; Summers-Stay, D.; Batra, D.; and
ence on computer vision, 2425–2433. Parikh, D. 2017. Making the V in VQA Matter: Elevating
Awadalla, A.; Gao, I.; Gardner, J.; Hessel, J.; Hanafy, Y.; the Role of Image Understanding in Visual Question An-
et al. 2023. OpenFlamingo: An Open-Source Framework swering. arXiv:1612.00837.
for Training Large Autoregressive Vision-Language Mod- Gu, A.; and Dao, T. 2023. Mamba: Linear-Time Sequence
els. arXiv:2308.01390. Modeling with Selective State Spaces. arXiv:2312.00752.
Bai, J.; Bai, S.; Yang, S.; Wang, S.; Tan, S.; et al. Gu, A.; Goel, K.; and Ré, C. 2022. Efficiently Mod-
2023. Qwen-VL: A Versatile Vision-Language Model for eling Long Sequences with Structured State Spaces.
Understanding, Localization, Text Reading, and Beyond. arXiv:2111.00396.
arXiv:2308.12966. Gurari, D.; Li, Q.; Stangl, A. J.; Guo, A.; Lin, C.; et al. 2018.
Bar-David, S.; Zimerman, I.; Nachmani, E.; and Wolf, L. VizWiz Grand Challenge: Answering Visual Questions from
2023. Decision S4: Efficient Sequence-Based RL via State Blind People. arXiv:1802.08218.
Spaces Layers. arXiv:2306.05167. Hasani, R.; Lechner, M.; Wang, T.-H.; Chahine, M.; Amini,
Bellagente, M.; Tow, J.; Mahan, D.; Phung, D.; Zhuravin- A.; and Rus, D. 2022. Liquid Structural State-Space Models.
skyi, M.; Adithyan, R.; Baicoianu, J.; Brooks, B.; Cooper, arXiv:2209.12951.
N.; Datta, A.; Lee, M.; Mostaque, E.; Pieler, M.; Pinna- Hendriksen, M.; Vakulenko, S.; Kuiper, E.; and de Rijke,
parju, N.; Rocha, P.; Saini, H.; Teufel, H.; Zanichelli, N.; M. 2023. Scene-centric vs. object-centric image-text cross-
and Riquelme, C. 2024. Stable LM 2 1.6B Technical Re- modal retrieval: a reproducibility study. In European Con-
port. arXiv:2402.17834. ference on Information Retrieval, 68–85. Springer.
Brohan, A.; Brown, N.; Carbajal, J.; Chebotar, Y.; Chen, X.; Hudson, D. A.; and Manning, C. D. 2019. GQA: A New
et al. 2023. RT-2: Vision-Language-Action Models Transfer Dataset for Real-World Visual Reasoning and Composi-
Web Knowledge to Robotic Control. arXiv:2307.15818. tional Question Answering. arXiv:1902.09506.
Chen, K.; Zhang, Z.; Zeng, W.; Zhang, R.; Zhu, F.; et al. Karamcheti, S.; Nair, S.; Balakrishna, A.; Liang, P.; Kol-
2023a. Shikra: Unleashing Multimodal LLM’s Referential lar, T.; et al. 2024. Prismatic VLMs: Investigating the
Dialogue Magic. arXiv:2306.15195. Design Space of Visually-Conditioned Language Models.
Chen, L.; Li, J.; Dong, X.; Zhang, P.; He, C.; Wang, J.; et al. arXiv:2402.07865.
2023b. ShareGPT4V: Improving Large Multi-Modal Mod- Katharopoulos, A.; Vyas, A.; Pappas, N.; and Fleuret, F.
els with Better Captions. arXiv:2311.12793. 2020. Transformers are RNNs: Fast Autoregressive Trans-
Chiang, W.-L.; Li, Z.; Lin, Z.; Sheng, Y.; Wu, Z.; Zhang, H.; formers with Linear Attention. arXiv:2006.16236.
Zheng, L.; Zhuang, S.; Zhuang, Y.; Gonzalez, J. E.; Stoica, Kazemzadeh, S.; Ordonez, V.; andre Matten, M.; and Berg,
I.; and Xing, E. P. 2023. Vicuna: An Open-Source Chatbot T. L. 2014. ReferItGame: Referring to Objects in Pho-
Impressing GPT-4 with 90%* ChatGPT Quality. tographs of Natural Scenes. In Conference on Empirical
Chu, X.; Qiao, L.; Lin, X.; Xu, S.; Yang, Y.; Hu, Y.; et al. Methods in Natural Language Processing.
2023. MobileVLM : A Fast, Strong and Open Vision Lan- Ke, L.; Pei, W.; Li, R.; Shen, X.; and Tai, Y.-W. 2019. Re-
guage Assistant for Mobile Devices. arXiv:2312.16886. flective decoding network for image captioning. In Proceed-
Chu, X.; Qiao, L.; Zhang, X.; Xu, S.; Wei, F.; Yang, Y.; et al. ings of the IEEE/CVF international conference on computer
2024. MobileVLM V2: Faster and Stronger Baseline for vision, 8888–8897.
Vision Language Model. arXiv:2402.03766. Kim, M. J.; Pertsch, K.; Karamcheti, S.; Xiao, T.; Balakr-
Cui, G.; Yuan, L.; Ding, N.; Yao, G.; Zhu, W.; Ni, Y.; ishna, A.; et al. 2024. OpenVLA: An Open-Source Vision-
et al. 2023. UltraFeedback: Boosting Language Models with Language-Action Model. arXiv:2406.09246.
High-quality Feedback. arXiv:2310.01377. Krishna, R.; Zhu, Y.; Groth, O.; Johnson, J.; Hata, K.;
Dai, W.; Li, J.; Li, D.; Tiong, A. M. H.; Zhao, J.; et al. 2023. et al. 2016. Visual Genome: Connecting Language and
InstructBLIP: Towards General-purpose Vision-Language Vision Using Crowdsourced Dense Image Annotations.
Models with Instruction Tuning. arXiv:2305.06500. arXiv:1602.07332.
Laurençon, H.; Saulnier, L.; Tronchon, L.; Bekman, S.; 627B token cleaned and deduplicated version of RedPa-
Singh, A.; et al. 2023. OBELICS: An Open Web-Scale jama. https://www.cerebras.net/blog/slimpajama-a-627b-
Filtered Dataset of Interleaved Image-Text Documents. token-cleaned-and-deduplicated-version-of-redpajama.
arXiv:2306.16527. Taori, R.; Gulrajani, I.; Zhang, T.; Dubois, Y.; Li, X.;
Li, J.; Li, D.; Savarese, S.; and Hoi, S. 2023a. BLIP-2: Boot- Guestrin, C.; Liang, P.; and Hashimoto, T. B. 2023. Stanford
strapping Language-Image Pre-training with Frozen Image Alpaca: An Instruction-following LLaMA model. https:
Encoders and Large Language Models. arXiv:2301.12597. //github.com/tatsu-lab/stanford alpaca.
Li, Y.; Du, Y.; Zhou, K.; Wang, J.; Zhao, W. X.; et al. 2023b. Tong, S.; Liu, Z.; Zhai, Y.; Ma, Y.; LeCun, Y.; and Xie, S.
Evaluating Object Hallucination in Large Vision-Language 2024. Eyes Wide Shut? Exploring the Visual Shortcomings
Models. arXiv:2305.10355. of Multimodal LLMs. arXiv:2401.06209.
Lin, B.; Tang, Z.; Ye, Y.; Cui, J.; Zhu, B.; Jin, P.; et al. Touvron, H.; Lavril, T.; Izacard, G.; Martinet, X.; Lachaux,
2024. MoE-LLaVA: Mixture of Experts for Large Vision- M.-A.; Lacroix, T.; Rozière, B.; Goyal, N.; Hambro, E.;
Language Models. arXiv:2401.15947. Azhar, F.; Rodriguez, A.; Joulin, A.; Grave, E.; and Lample,
Liu, F.; Emerson, G.; and Collier, N. 2023. Visual Spatial G. 2023. LLaMA: Open and Efficient Foundation Language
Reasoning. arXiv:2205.00363. Models. arXiv:2302.13971.
Liu, F.; Lin, K.; Li, L.; Wang, J.; Yacoob, Y.; et al. 2023a. Waleffe, R.; Byeon, W.; Riach, D.; Norick, B.; Korthikanti,
Mitigating Hallucination in Large Multi-Modal Models via V.; et al. 2024. An Empirical Study of Mamba-based Lan-
Robust Instruction Tuning. arXiv:2306.14565. guage Models. arXiv:2406.07887.
Liu, H.; Li, C.; Li, Y.; and Lee, Y. J. 2023b. Improved Base- Wang, J.; Meng, L.; Weng, Z.; He, B.; Wu, Z.; et al. 2023.
lines with Visual Instruction Tuning. arXiv:2310.03744. To See is to Believe: Prompting GPT-4V for Better Visual
Liu, H.; Li, C.; Wu, Q.; and Lee, Y. J. 2023c. Visual Instruc- Instruction Tuning. arXiv:2311.07574.
tion Tuning. arXiv:2304.08485. Wightman, R. 2019. PyTorch Image Models. https://github.
Loshchilov, I.; and Hutter, F. 2019. Decoupled Weight De- com/rwightman/pytorch-image-models.
cay Regularization. arXiv:1711.05101. Wu, S.; Fei, H.; Qu, L.; Ji, W.; and Chua, T.-S. 2024. NExT-
Lu, C.; Schroecker, Y.; Gu, A.; Parisotto, E.; Foerster, J.; GPT: Any-to-Any Multimodal LLM. arXiv:2309.05519.
et al. 2023. Structured State Space Models for In-Context Yan, J. N.; Gu, J.; and Rush, A. M. 2023. Diffusion Models
Reinforcement Learning. arXiv:2303.03982. Without Attention. arXiv:2311.18257.
Lu, J.; Clark, C.; Zellers, R.; Mottaghi, R.; and Kembhavi, Yu, L.; Poirson, P.; Yang, S.; Berg, A. C.; and Berg,
A. 2022. Unified-IO: A Unified Model for Vision, Lan- T. L. 2016. Modeling Context in Referring Expressions.
guage, and Multi-Modal Tasks. arXiv:2206.08916. arXiv:1608.00272.
Mercat, J.; Vasiljevic, I.; Keh, S.; Arora, K.; Dave, A.; Zhai, X.; Mustafa, B.; Kolesnikov, A.; and Beyer, L.
Gaidon, A.; and Kollar, T. 2024. Linearizing Large Lan- 2023. Sigmoid Loss for Language Image Pre-Training.
guage Models. arXiv:2405.06640. arXiv:2303.15343.
OpenAI. 2023. GPT-4V(ision) System Card. Zhang, B.; and Sennrich, R. 2019. Root Mean Square Layer
Oquab, M.; Darcet, T.; Moutakanni, T.; Vo, H.; Szafraniec, Normalization. arXiv:1910.07467.
M.; Khalidov, V.; et al. 2024. DINOv2: Learning Robust Zhang, P.; Zeng, G.; Wang, T.; and Lu, W. 2024.
Visual Features without Supervision. arXiv:2304.07193. TinyLlama: An Open-Source Small Language Model.
Penedo, G.; Malartic, Q.; Hesslow, D.; Cojocaru, R.; Cap- arXiv:2401.02385.
pelli, A.; et al. 2023. The RefinedWeb dataset for Falcon Zhao, Y.; Gu, A.; Varma, R.; Luo, L.; Huang, C.-C.; et al.
LLM: outperforming curated corpora with web data, and 2023. PyTorch FSDP: Experiences on Scaling Fully Sharded
web data only. arXiv preprint arXiv:2306.01116. Data Parallel. arXiv:2304.11277.
Rafailov, R.; Sharma, A.; Mitchell, E.; Ermon, S.; Manning, Zhou, B.; Hu, Y.; Weng, X.; Jia, J.; Luo, J.; Liu, X.; et al.
C. D.; and Finn, C. 2023. Direct Preference Optimiza- 2024. TinyLLaVA: A Framework of Small-scale Large Mul-
tion: Your Language Model is Secretly a Reward Model. timodal Models. arXiv:2402.14289.
arXiv:2305.18290.
Zhu, D.; Chen, J.; Shen, X.; Li, X.; and Elhoseiny, M. 2023.
ShareGPT. 2023. https://sharegpt.com/. MiniGPT-4: Enhancing Vision-Language Understanding
Singh, A.; Natarajan, V.; Shah, M.; Jiang, Y.; Chen, X.; with Advanced Large Language Models. arXiv:2304.10592.
et al. 2019. Towards VQA Models That Can Read. Zhu, Y.; Zhu, M.; Liu, N.; Ou, Z.; Mou, X.; and Tang, J.
arXiv:1904.08920. 2024. LLaVA-Phi: Efficient Multi-Modal Assistant with
Smith, J. T. H.; Warrington, A.; and Linderman, S. W. Small Language Model. arXiv:2401.02330.
2023. Simplified State Space Layers for Sequence Mod-
eling. arXiv:2208.04933.
Soboleva, D.; Al-Khateeb, F.; Myers, R.; Steeves, J. R.; 7 Appendix
Hestness, J.; and Dey, N. 2023. SlimPajama: A The appendix is organized as follows:
• We provide more detailed implementation details, in- Model OCR First OCR Last w/o OCR tokens
cluding modality processing, model architecture, infer-
ence setup, and hardware systems in Section 7.1. LLaVA v1.5 - 58.2 46.1
• We demonstrate the significant impact of prompt order Cobra-3.5B 58.2 43.0 47.9
on Cobra, as mentioned in Section 4.1 of the main docu- w/ SigLIP 58.8 47.3 49.3
ment, and present additional experiments in Section 7.2. w/ LDPv2 54.7 44.7 40.3
w/ Base 57.9 47.6 47.9
• We present more examples of Cobra in terms of genera-
w/ 1 Ep FT 57.5 45.4 46.4
tion quality and its ability to overcome visual hallucina-
w/ PT+FT 58.0 47.4 46.6
tions in Section 7.3.
Cobra-8B 59.5 43.0 50.7
7.1 Implementation Details Table 7: Additional Results of TextVQA.
Modality processing. We utilize the default image trans-
formations provided by torchvision and TIMM (Wightman
2019) to implement all image processing operations. We The text form of the prompt is processed by the same tok-
naively resize all the images to the resolution of 384 × 384 enizer that GPT-NeoX uses to obtain tokens, which are then
and normalize pixel values according to the defaults defined passed through an embedding layer to obtain continuous em-
by each pre-trained backbone, which often adhere to the tra- beddings. The embeddings obtained from passing the image
ditional ImageNet defaults. We extract patch features from through the encoder are directly concatenated to the begin-
the penultimate layer, as done in other MLLM methods (Liu ning of the embedding sequence. This is then input into the
et al. 2023c). Mamba model to start generating answers.
Large language model (LLM). The LLM backbone is ini- Evaluation. We fork the vlm-evaluation4 tool as our evalu-
tialized with the pre-trained weights from the Mamba chat ation tool on the benchmarks.
model. We have chosen various open-source model weights, Hardware. For experiments with models of 2.8 billion pa-
including Mamba models with 2.8 billion and 7 billion pa- rameters scale, the whole training process of a single model
rameters, as the LLM backbone for our proposed model. takes about 26.5 hours on 8 NVIDIA A100 80GB GPUs.
The Mamba-2.8B model1 was pre-trained on the SlimPa- During the training process, we use the PyTorch Fully
jama dataset (Soboleva et al. 2023) consisting of 627 bil- Sharded Data Parallel (Zhao et al. 2023) framework and en-
lion tokens, we also evaluate a model2 that underwent super- able automatic mixed-precision with FP32 and BF16 for dis-
vised fine-tuning on the UltraChat-200k dataset (Ding et al. tributed training. The batch size is set as 128. We employ the
2023), as well as direct preference optimization (Rafailov AdamW (Loshchilov and Hutter 2019) optimizer with a co-
et al. 2023) on the UltraFeedback dataset (Cui et al. 2023). sine decay learning rate to update the network parameters
The Mamba-7B model3 is a base model, which was pre- and set the learning rate to 2 × 10−5 , with a decay factor of
trained on the RefinedWeb (Penedo et al. 2023) dataset with 0.1 and a warm-up ratio of 0.03. The model is trained for 2
1.2T tokens and was not fine-tuned on any chat dataset. epochs via supervised fine-tuning.
Prompt template. To maintain consistency with the instruc- 7.2 Additional Evaluation on TextVQA
tion template of the pre-trained Mamba chat model, our
prompt format follows the subsequent format: Here, we present the experimental results of all our mod-
els on the TextVQA dataset, constructed according to the
description of prompt order in Section 4.1. As in Table 7,
<|user|> “OCR First” represents placing the Reference OCR tokens
1
Xinstruct <|endoftext|> before the question, while “OCR Last” involves present-
<|assistant|> ing the question first and then the Reference OCR tokens,
1
Xanswer <|endoftext|> which is also the default prompt format for LLaVA. It can be
<|user|>
2 observed that all models perform significantly better when
Xinstruct <|endoftext|>
<|assistant|> the OCR tokens are placed before the question, resulting in
an accuracy improvement of over 10%. In the “OCR Last”
mode, the performance of most models is even lower than
For other base models that were not fine-tuned on a chat that without OCR tokens as prompts (except ”w/ LDPv2”
dataset, we use the following prompt template: and ”w/ PT+FT”). We attribute this phenomenon to the in-
herent inductive bias of the RNN models.
1
In:Xinstruct
1
Out:Xanswer <|endoftext|> 7.3 More Examples
2
In:Xinstruct Here, we provide more examples generated by Cobra-3.5B
Out: and LLaVA v1.5-7B, which are shown in Table 8–10.

1
https://huggingface.co/state-spaces/mamba-2.8b-slimpj
2
https://huggingface.co/xiuyul/mamba-2.8b-zephyr
3 4
https://huggingface.co/TRI-ML/mamba-7b-rw https://github.com/TRI-ML/vlm-evaluation
User What is unusual about this image?
Cobra (ours) The unusual aspect of this image is that a man is ironing clothes while standing on the back of a yellow
SUV parked on a city street. This is not a typical scene, as people usually iron clothes indoors, in a
designated ironing area, or at home. The man’s choice to iron clothes on the back of a vehicle in a busy
city street adds an element of surprise and humor to the image.
LLaVA v1.5 The unusual aspect of this image is that a man is ironing clothes while standing on the back of a moving
car. This is not a typical scene, as ironing clothes is usually done indoors, in a stationary position, and
not while the car is in motion. The man’s ability to perform this task while standing on the back of a car
is both unconventional and potentially dangerous, as it could lead to accidents or injuries.

Table 8: Additional Example 1. Cobra generates more detailed textual descriptions compared with the baseline, where blue
represents a more detailed description.

User What’s going on in this image?


Cobra (ours) The image showcases a serene landscape featuring a calm body of water surrounded by lush greenery.
There are trees on the left and right sides of the water, and a few rocks are scattered near the water’s
edge. The sky above is clear, and the overall ambiance of the painting is tranquil and picturesque.
LLaVA v1.5 The image features a serene scene of a river surrounded by a lush forest. The river is calm, and the
water appears to be green, giving it a tranquil atmosphere. The forest is filled with trees of various sizes,
creating a picturesque landscape.
In the foreground, there is a bench situated near the water, inviting visitors to sit and enjoy the view. A
bird can be seen perched on a branch, adding to the natural beauty of the scene. The overall setting is
perfect for relaxation and appreciation of nature.

Table 9: Additional Example 2. Cobra produced more accurate responses compared to the baseline, which suffered from severe
hallucinations in describing the image, where green indicates a correct answer, red produces inaccurate answers.
User What are the things I should be cautious about when I visit here?
Cobra (ours) When visiting this serene and picturesque location with a wooden dock extending over a calm lake,
surrounded by a forest, and mountains in the background, you should be cautious about the following:
1. Weather conditions: The image shows a cloudy day, which might indicate potential changes in
weather, such as rain or storms. It’s essential to be prepared for any sudden changes in weather con-
ditions, such as bringing rain gear or checking the weather forecast before visiting.
2. Water safety: The calmness of the lake might make it seem safe for swimming or other water activities,
but it’s important to be aware of the water currents, potential underwater hazards, and the presence of
any wildlife.
3. Wildlife: The forest surrounding the lake is home to various animals, such as birds, fish, and mammals.
Be cautious not to disturb their natural habitat and respect their space.
4. Navigation: If you plan to use the dock or any nearby boats, make sure you have the necessary skills
and experience to safely operate them.
5. Accessibility: The dock is located in a remote area, so it might be difficult to access in case of
emergencies or if you need assistance. Make sure you have a reliable means of transportation to and
from the location.
By being cautious and prepared, you can enjoy a peaceful and safe visit to this beautiful location.
LLaVA v1.5 When visiting the pier over the lake, there are a few things you should be cautious about. First, ensure
that you have appropriate footwear with good traction, as the pier may be wet or slippery, especially if
it has been raining or if there is algae on the surface. Second, be mindful of the weather conditions, as
the image shows a cloudy day, which could indicate potential rain or storms. It is essential to stay safe
and avoid being caught in a sudden downpour or storm. Lastly, be aware of your surroundings and the
potential for wildlife, such as birds or fish, that might be present in the area. This will help you to enjoy
your visit while minimizing any risks or disturbances to the local ecosystem.

Table 10: Additional Example 3. Compared with the baseline, Cobra produced responses that were clearer, more organized,
and significantly more detailed.

You might also like