0% found this document useful (0 votes)
85 views

Evaluating Vision-Language Models by

Uploaded by

Marko
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
85 views

Evaluating Vision-Language Models by

Uploaded by

Marko
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 17

TouchStone: Evaluating Vision-Language Models by

Language Models

1
Shuai Bai 1,2
Shusheng Yang 1
Jinze Bai 1
Peng Wang 1,3
Xingxuan Zhang
1
Junyang Lin 2
Xinggang Wang 1
Chang Zhou † 1
Jingren Zhou
1
Alibaba Group, 2 Huazhong University of Science and Technology, 3 Tsinghua University
arXiv:2308.16890v1 [cs.CV] 31 Aug 2023

September 1, 2023

Abstract
Large vision-language models (LVLMs) have recently witnessed rapid advancements,
exhibiting a remarkable capacity for perceiving, understanding, and processing visual
information by connecting visual receptor with large language models (LLMs). However,
current assessments mainly focus on recognizing and reasoning abilities, lacking direct
evaluation of conversational skills and neglecting visual storytelling abilities. In this paper,
we propose an evaluation method that uses strong LLMs as judges to comprehensively
evaluate the various abilities of LVLMs. Firstly, we construct a comprehensive visual
dialogue dataset TouchStone, consisting of open-world images and questions, covering
five major categories of abilities and 27 subtasks. This dataset not only covers funda-
mental recognition and comprehension but also extends to literary creation. Secondly,
by integrating detailed image annotations we effectively transform the multimodal in-
put content into a form understandable by LLMs. This enables us to employ advanced
LLMs for directly evaluating the quality of the multimodal dialogue without requiring
human intervention. Through validation, we demonstrate that powerful LVLMs, such
as GPT-4, can effectively score dialogue quality by leveraging their textual capabilities
alone, aligning with human preferences. We hope our work can serve as a touchstone for
LVLMs’ evaluation and pave the way for building stronger LVLMs. The evaluation code
is available at https://github.com/OFA-Sys/TouchStone.

1 Introduction

The utilization of large language models (LLMs) (Zhang et al., 2022; Gao et al., 2023b; Brown et al., 2020;
OpenAI, 2023; Anil et al., 2023) in the domain of chatbots (Ouyang et al., 2022; Chiang et al., 2023) has
exhibited remarkable prowess in various aspects such as language comprehension, generation, and interaction.
The extension of GPT-4 (OpenAI, 2023) to encompass LLMs has further facilitated the rapid development of
large vision-language models (LVLMs). Recently, several LVLMs (Dai et al., 2023; Li et al., 2023a; Zhu et al.,
2023; Su et al., 2023; Li et al., 2023b; Liu et al., 2023a; Ye et al., 2023; Gao et al., 2023a) have been proposed
with the objective of extending the capabilities of pure-text chatbots to incorporate multimodal chatbots. This
is achieved through the alignment of visual encoders with LLMs and the application of visual instruction
tuning techniques. However, it is noteworthy that the evaluation of these recent LVLMs has predominantly
focused on the human evaluation of generation quality within a limited subset of questions, thus lacking a
comprehensive quantitative evaluation.
Recent developments in LLM evaluation methodologies (Zheng et al., 2023), utilizing automated model
assessment, have shown encouraging potential in terms of efficiency and cost-effectiveness when compared
† Corresponding author

1
Figure 1: Overview of the dataset distribution and some examples. TouchStone encompasses five major
categories and 27 subcategories of questions originating from open scenes, covering the spectrum from
recognition and description to understanding and generation.

to manual evaluation. Nevertheless, despite these significant advancements in text-based capabilities, the
incorporation of multimodal inputs into LLMs remains constrained and underexplored.
Currently, the evaluation methods for LVLMs primarily involve comparing different models based on a
small set of questions or assessing their performance on traditional multimodal tasks such as VQA (Goyal
et al., 2017; Sidorov et al., 2020), image captioning (Agrawal et al., 2019; Chen et al., 2015a), and image
classification (Deng et al., 2009). However, traditional task metrics and annotations often have specific stylistic
preferences for the sake of evaluation and comparison. These stylistic preferences (Agrawal et al., 2019; Chen
et al., 2015a; Goyal et al., 2017) do not necessarily align with human preferences, like VQA and image caption.
Besides, obtaining human annotator’s ratings or comparisons for different models’ outputs is prohibitive
and difficult to further scale up. Additionally, hallucination lies an crucial obstacle for current LLMs or

2
LVLMs broader application, but how to evaluate LVLMs’ hallucination degree is always overlooked in most
current LVLM evaluations and remains to be explored. Therefore, there is an urgent need for automated
evaluation techniques that can provide objective and efficient assessments of LVLM performance in open-
ended dialogues. Recently, MME (Fu et al., 2023) has been proposed to transform questions into binary
judgment statements for large model evaluation. MMBench (Liu et al., 2023b) evaluates models based on
their accuracy in choosing answers. However, the binary judgment and the ability to choose an answer may
not fully capture the complexity in open-ended real-world dialogues, thereby limiting their suitability as a
comprehensive evaluation method.
To tackle these challenges, we propose an automated evaluation method termed TouchStone, which provides
a comprehensive assessment of the capabilities of multimodal language models. The principles of our design
are two-folds:
Firstly, in order to evaluate the overall abilities of the models, we construct a comprehensive visual dialogue
dataset, encompassing five major categories of abilities and 27 subtasks. These categories include basic
descriptive ability, visual recognition ability, visual comprehension ability, visual storytelling ability, and
multi-image analysis ability. This not only tests the model’s recognition and comprehension abilities but also
tests its literary creation and analysis abilities. The images and questions in our dataset are curated in an
open-world setting and have been manually annotated and verified.
Secondly, TouchStone involves converting information from other modalities, such as images, into textual
forms by utilizing detailed image annotations and descriptions. This enables the employment of advanced
LLMs to directly assess the quality of dialogues without the requirements for human evaluators or visual-
augmented LLMs. To reflect the model’s performance in real-world scenarios, we conducted a direct
evaluation of the quality of dialogue by comparing its correctness, relevance, and usefulness. For scoring,
we utilize a leading language model as a judge, comparing the responses of different LVLM with the
answers generated by GPT-4 using pairwise comparisons. The response generated by GPT-4 is obtained
through the input of fine-grained image annotations and question, and are referred to as GPT4-HA (Human
Assisted). To address positional bias, we incorporate position balancing into our scoring mechanism. Through
the comparison with human evaluations, we find that powerful LLMs like GPT-4 (OpenAI, 2023) can
effectively score dialogue quality based solely on their text-based capabilities, while also being able to discern
hallucination issues.
Our contributions can be summarized as follows:

• We curate a diverse visual dialogue dataset that covers five categories of abilities and 27 subtasks,
encompassing not only basic recognition and comprehension but also extending to literary creation.
• TouchStone converts information from other modalities, such as images, into textual forms using detailed
annotations. This allows advanced language models to directly assess dialogue quality without manual
intervention.
• We show that GPT-4 can serve as a reasonable evaluator to evaluate the quality of LVLMs’ response.
Specifically, in our experiment, we find that GPT-4 shows highly-consistent judgment compared to
human preference.

2 Related Work

2.1 Large Language Models

Language pretrained models such as GPT (Radford et al., 2019; Brown et al., 2020), BERT (Devlin et al., 2018),
and T5 (Raffel et al., 2020) have demonstrated exceptional performance in a multitude of natural language
processing (NLP) tasks, thanks to their extensive pre-training on copious amounts of data. Notably, the
GPT-3 (Brown et al., 2020) model, with its decoder-only architecture, has exhibited impressive zero-shot

3
[Detailed Image Description]
This image shows us the proper way to slice beef. By marking "Grain" on the right side of the dark red beef with a small amount of white
tendon and fat on two pieces of cowhide paper placed diagonally on a wooden table, and indicating "Slice this way" on the left side
perpendicular to the grain, viewers can quickly understand the correct method of cutting beef. First, observe the direction of the beef's grain, and
then cut the knife perpendicular to the grain. Dry rosemary, coarse salt, and black pepper are placed around the beef, which not only adds to the
visual appeal of the image, but also highlights some commonly used beef marinades.
[Question]
How should the beef in this picture be cut?
[Assistant 1] [Assistant 2]
The beef in this picture should be cut perpendicular to the grain. The beef in the picture should be cut into smaller, more manageable pieces to be easier to cook,
serve, or store. The beef, which appears to be a steak, has a ruler drawn on it, indicating the
appropriate cutting lines for trimming the meat into more appropriate sizes, such as 1-inch
steaks or 3-4 ounce servings. Following these lines would help in proper portioning and
[GPT-4 Comments based on detailed description] preparation of the beef before cooking, ensuring even cooking and a nice presentation.
8 5
Assistant 1's answer is more helpful, relevant, and accurate as it directly addresses the question and refers to the image description provided. The answer clearly states that the beef should be
cut perpendicular to the grain, which is the main point of the image description.
Assistant 2's answer, while providing some useful information about portioning and preparation, does not directly address the question or the image description. It mentions a ruler, which is
not present in the image description, and does not mention cutting perpendicular to the grain.
Therefore, Assistant 1 receives an 8, and Assistant 2 receives a 5.

Figure 2: Visual dialogue and human annotation example. Fine-grained descriptions, along with two
dialogues, are fed into GPT-4 for scoring and explanation. The highlighted text in red demonstrates the
model’s ability to discern hallucination situations in this context.

capabilities as the model size and training data have increased. Furthermore, the field has witnessed the
emergence of increasingly sophisticated large-scale models like OPT (Zhang et al., 2022), LLaMA (Touvron
et al., 2023), and PaLM (Anil et al., 2023), which have been meticulously constructed to tackle complex
NLP challenges. InstructGPT (Ouyang et al., 2022) incorporates instruction fine-tuning and reinforcement
learning, enabling them to align with human preferences and effectively execute specified instructions.
Moreover, ChatGPT surpasses its predecessors by engaging in conversation interactions and successfully
carrying out diverse user commands. It demonstrates the potential to address a wide range of real-world
tasks and requirements.

2.2 Vision-Language Models

Extensive research has been undertaken on cross-modal Vision-Language Models (VLMs), utilizing different
pre-training tasks like mask prediction (He et al., 2022; Bao et al., 2021), next-token prediction (Chen et al.,
2020), and contrastive learning (Radford et al., 2021). BLIP has employed a flexible model combination to
achieve multiple tasks, while OFA (Wang et al., 2022) has integrated various textual and visual tasks into a
unified framework. PaLI (Chen et al., 2022) has provided empirical evidence supporting the effectiveness of
larger-scale visual encoders in VLMs. OFA-sys (Bai et al., 2022) attempts to construct a unified multitask
framework from a system perspective. Flamingo (Alayrac et al., 2022) has leveraged gated cross-attention to
establish connections between pre-trained visual encoders and Large Language Models (LLMs) for impressive
few-shot capabilities. Additionally, Kosmos (Huang et al., 2023) has demonstrated exceptional zero-shot
OCR recognition and inference abilities by training LLMs using input visual features. GPT-4 (OpenAI, 2023)
has recently introduced visual inputs, allowing LVLMs to encompass a broader range of functionalities.
Many recent approaches (Li et al., 2023a; Zhu et al., 2023; Su et al., 2023; Ye et al., 2023; Gao et al., 2023a) have
attempted to integrate pre-trained visual encoders with LLMs. BLIP-2 (Li et al., 2023b) has developed the
Q-Former model to align the visual encoder and LLM, while LLaVA (Liu et al., 2023a) has constructed visual
instruction data to fine-tune LLMs for visual capabilities. Instructblip (Dai et al., 2023) has introduced text
instructions into Q-Former to further enhance the performance. mPLUG-Owl (Ye et al., 2023) has attempted
to train a visual encoder to improve the alignment. Kosmos2 (Peng et al., 2023) and Shikra (Chen et al., 2023)
explore the visual localization ability of LVLM. The recent proposed Qwen-VL (Bai et al., 2023) achieves
preeminent performance on a wide range of vision-centric tasks such as image captioning, visual question
answering, and text-oriented visual tasks. However, less effort is conducted to evaluate how these LVLMs
perform under the real-world user behavior. In this work, we make an attempt toward tackling this problem.

4
2.3 Vision-Language Model Evaluation

Early LVLMs (Bao et al., 2021; Wang et al., 2022; Chen et al., 2022; He et al., 2022) primarily focused on
assessing their performance across various subtasks (Agrawal et al., 2019; Deng et al., 2009; Lin et al., 2014),
often through fine-tuning or zero-shot evaluation on different cross-modal tasks. However, the introduction
of more versatile LVLMs, such as GPT-4, has expanded the scope of capabilities to encompass language-
based interactions for understanding visual inputs and executing instructions. These models (OpenAI,
2023) demonstrate the potential for achieving general artificial intelligence, surpassing the limitations
of conventional evaluation methods. The annotations in these tasks (Agrawal et al., 2019; Chen et al.,
2015a), tend to emphasize specific formats and styles, which may not fully capture human preferences.
Additionally, the diverse evaluation metrics (Chen et al., 2015b; Lin, 2004; Vedantam et al., 2015) and
methodologies employed across different tasks make it challenging to establish a unified and comprehensive
benchmark. Furthermore, despite the impressive generalization abilities exhibited by current LVLMs, they
are susceptible to noticeable hallucination problems, necessitating careful and limited evaluation in this
aspect. Addressing these limitations, our research aims to develop a novel evaluation methodology that
directly compares the conversation to assess the performance of vision-language models. by utilizing detailed
human image annotations and descriptions, the advanced LLMs understand the image contents. This enables
the employment of advanced LLMs to directly assess the quality and hallucination of dialogues without the
need for manual intervention.

3 Approach

3.1 Data Collection and Statistics

To evaluate the abilities of LVLMs, we construct a diverse and comprehensive dataset that covers five
key dimensions: basic descriptive ability, visual recognition ability, visual comprehension ability, visual
storytelling ability, and multi-image analysis ability.
Basic Descriptive Ability. Image description involves the ability of a model to describe the information
contained in an image, including simple and detailed descriptions. Simple descriptions are typically short
phrases that describe the main subject and action of the image, while detailed descriptions provide more
in-depth information about the image scene, their attributes, and relationships.
Visual Recognition Ability. Image recognition is the task of recognizing objects or scenes within an image
and inferring relevant information. This area can be further divided into several sub-tasks, including attribute
QA, movie/TV recognition, art recognition, landmark recognition, celebrity recognition, emotion recognition,
text recognition, object recognition, and structure content recognition. These sub-tasks require different
techniques and approaches, such as identifying the number, size, color, height, and other attributes of objects
in the image, recognizing famous landmarks, mountains, and rivers, or understanding the emotions of people
in the image.
Visual Comprehension Ability. Image understanding involves the ability of a model to understand the
meaning of an image and associated tasks. This area encompasses several sub-tasks, such as style appreciation,
abstract image understanding, meme understanding, image analysis, chart analysis, general problem-solving,
and reasoning QA. These tasks require models to analyze the content of complicated charts, PPTs, or
flowcharts, understand the metaphor and analogy in the picture, or analyze the content of instruction
manuals, maps, and math problems.
Visual Storytelling Ability. The visual storytelling ability is the process of literary creation based on visual
content, including writing emails, poetry, stories, ads/commodity recommendations, and brainstorming.
These tasks require models to generate creative and original content based on the image.
Multi-Image Analysis Ability. Multi-image analysis is the task of analyzing and comparing multiple images.
This area includes tasks such as comparing two/multiple images, summarizing multiple image information,

5
Rank 1

Rank 2

Rank 3


Figure 3: The evaluation pipeline of TouchStone. Firstly, fine-grained descriptions of images are obtained
through manual annotation and inspection. These descriptions, along with questions, are fed into GPT-4
(text-only) to generate reference answers. On the other hand, different LVLMs directly take visual signals
and questions as input to generate answers. The generated answers, reference answers, questions, and
fine-grained descriptions are all scored by GPT-4. The final scores are averaged and used to rank the models,
representing their comprehensive performance.

comparing commodities, and step-by-step analysis of images. These tasks require models to analyze the
content of multiple images and summarize the information.
Overall, the five categories of questions comprehensively assess the model’s capabilities. As shown in Fig. 1,
examples of 27 subtasks are presented. From perception to cognition, and then to creativity, as the difficulty
increases, the demands on the model also become higher. Currently, LVLMs’ abilities are still in the early
stages. Our dataset currently places more emphasis on assessing basic abilities, where the highest proportion
of questions pertains to recognition, accounting for about 44.1%, followed by comprehension questions
at 29.6%. The proportions of the other categories are 15.3% for basic descriptive ability, 7.4% for visual
storytelling ability, and 3.6% for multi-image analysis ability. There are a total of 908 questions.

3.2 Evaluation

Automated and accurate evaluation of LVLMs in the context of open-world multimodal dialogues poses a
significant challenge. Referencing the work Chiang et al. (2023); Zheng et al. (2023), we apply a powerful LLM
as a judge to enable automated evaluation. To effectively comprehend the contents of an image, we manually
substitute the actual image input with fine-grained textual annotations. By inputting these annotations and
corresponding questions to a powerful LLM like GPT-4, we obtain reference answers.
For the evaluation of the LVLMs, we provide actual images and questions as input and obtain their respective
answers. Finally, we employ GPT-4 to score the answers generated by the LVLMs based on fine-grained
annotations and questions. The scoring instructions require the model to assess the usefulness, relevance,
and accuracy of the answers, considering the annotations as the content of the images. To ensure fairness in
the evaluation, each model’s answer is compared against a consistent reference answer from GPT-4. The
average score of the model in all questions is taken as the final score.
To eliminate the influence of answer position, we perform a second scoring round by swapping the positions
of the answers and then compute the average of the two scores obtained. This approach aims to mitigate any
bias introduced by the placement of the answers.
Additionally, in the experimental section, we compare the consistency of the results obtained through

6
Methods Training Data Model Architecture
Image-Text Data Instruction Data Visual Adapter Updated LLM Updated Visual
Encoder
InstructBLIP(Dai et al., 2023) - 16M Q-Former ✗ ✗
MiniGPT-4(Zhu et al., 2023) 5M 3.5K Q-Former+FC ✗ ✗
LLaVA(Liu et al., 2023a) 595K 158K FC ✓ ✗
LA-V2(Gao et al., 2023a) 567K 52K B-Tuning ✓ ✗
mPLUG-Owl(Ye et al., 2023) 204M 158K Visual Abstractor ✓ ✓
PandaGPT(Su et al., 2023) - 160K FC ✓ ✗
Qwen-VL(Bai et al., 2023) 1.5B 350K Attention pooling ✓ ✓

Table 1: Comparison of different LVLMs.

Figure 4: Comparison of consistency between model judgment and human judgment.

our proposed method with the results assigned by human evaluators. This comparison demonstrates the
feasibility of using fine-grained human annotations to represent other modalities’ content. It enables the
LLM to serve as a judge for evaluating multimodal content as well. The evaluation of LVLMs in open-world
multimodal dialogues remains a challenging task without a definitive solution. However, the introduction of
a powerful LLM as a judge, coupled with the substitution of images with fine-grained annotations, allows for
more efficient evaluation.

4 Results and Analysis

In this section, we present our experimental setup used to evaluate the performance of the LVLMs. We
validate the efficacy of our evaluation approach through human consistency assessment. Moreover, we
compare the performance across different tasks and also conduct an analysis of the model hallucination
problem. Additionally, we discuss the limitations of our approach and potential areas for improvement.

4.1 Consistency evaluation

In order to evaluate the consistency between model evaluation and human judgments for GPT-4, we compare
the results of both methods. We sample 200 questions based on their distribution and selected three models
- InstructBLIP (Dai et al., 2023), LLaVA (Liu et al., 2023a), and Qwen-VL - with different performances in
evaluation. A total of 600 questions and answers are evaluated, with three individuals providing their ratings
resulting in 1.8k votes. The majority vote of the three individuals is used as the ground-truth result, and
a fourth individual is introduced in cases where there is disagreement. We then calculate the consistency
between the model’s predicted results and the human predicted result. The consistency is measured by the
ratio of consistent scores to the total number of scores. The model2human consistency score is 72.2%, while

7
Visual Recognition Ablity

ity
Methods TouchStone Score

Visu
Abl

al
Qwen-VL 645.2
sion

Stor
en

mPLUG-Owl 605.4
reh

ytel
omp

ling
LLaVA 602.7
al C

A
blity
LLaMA-AdapterV2 590.1
Visu

InstructBLIP 552.4
ty
bli
Ba

is A
s

MiniGPT4 531.7
ic

lys
D

a
es

A n
cr

e PandaGPT 488.5
ag
ip
tio

i- Im
n

lt
A

u
bl

M
ity

Figure 5: Category-wise comparison and average scoring results for different LVLMs, where GPT4-HA
represents GPT-4’s responses with human annotations rather than visual inputs.

the human-generated scores exhibit 78.4% consistency, indicating that the consistency between the model’s
vote and human vote is a difference of 6.2%, which is relatively close.
As shown in Fig. 4, consistency varied across different abilities, with higher consistency observed in basic
recognition. As the difficulty of the tasks increases, human consistency gradually decreases. Comparing
different models, we find that models with lower scores have higher consistency, whereas models with higher
scores have lower consistency. This indicates that as the model’s ability improves, a more powerful scoring
model is needed for evaluation.

4.2 Performance Comparison

Observing the performance of various models in Fig. 5 and 6, currently, the models have an obvious difference
in literary creation performance, and there is still room for improvement in recognition, description, and
understanding analysis.
Visual storytelling ability. There is a noticeable difference between different models, especially MiniGPT-
4 (Zhu et al., 2023), InstructBLIP (Dai et al., 2023), and PandaGPT (Su et al., 2023), which perform slightly
worse in this aspect. When faced with instructions such as writing poetry or stories, these models tend to
provide simple descriptions rather than literary creations. Overall, models such as LLaVA (Liu et al., 2023a)
and mPLUG-Owl (Ye et al., 2023) excel in this aspect typically undergo the SFT (Supervised FineTuning)
stage, wherein the LLM is used to participate in training. On the other hand, other models are trained
through methods such as low-parameter training, such as LoRA (Hu et al., 2021) and Bias tuning (Gao et al.,
2023a), or by locking the LLM parameters. This suggests that training the LLM to learn visual content may
be more useful for some tasks that require a combination of model content and literary creation abilities.
Visual recognition ability. For models that freeze the visual encoder during pre-training, the recognition
ability does not show a strong correlation with the amount of pre-training data. This suggests that aligning
the pre-trained visual encoder with LLM does not benefit significantly from a larger data set. However,
models like mPLUG-Owl and Qwen-VL that release the visual encoder have better performance and are
trained with larger datasets. Differences between models in attribute recognition and emotion recognition
are relatively small, but for general recognition tasks such as celebrities, species, and film and television

8
Methods MPLUG-Owl LA-V2 PandaGPT LLAVA MiniGPT-4 InstructBLIP Qwen-VL
H-score (↓) 762.5 717.5 835.5 664.0 649.0 519.0 504.5

Table 2: Comparison of hallucination scores. The LLM takes fine-grained human annotations and model
predictions as inputs and predicts the degree of hallucination, where a higher score indicates a more serious
hallucination.

works, there are more differences among models, although accuracy and credibility are still far from ideal.
This may be related to the pre-training corpus. Currently, most models have some text recognition ability,
but the accuracy is still relatively low, especially for small characters, numbers, and handwriting. Qwen-VL
has a clear advantage in text recognition, and it is suggested that training the model solely through aligning
images and texts cannot enable it to master the ability to recognize densely packed texts.
Visual comprehension ability. Particularly, a significant disparity between the models is observed in image-
based math problem-solving and chart analysis tasks. Even when math question descriptions are provided in
natural language to the corresponding Language Modeling Model (LLM), similar performance gaps persist,
indicating shortcomings in LLM’s ability to effectively solve mathematical problems. Moreover, models often
struggle with precise identification and incorrect relationship establishment within charts, impeding their
ability to recognize and interpret chart elements accurately, leading to incorrect answers. Qwen-VL exhibits
a clear advantage in chart analysis, as it benefits from higher-resolution inputs and additional multi-task
learning stages that encompass the task of dense text recognition.
Multi-image analysis. In order to accommodate input from various models, multiple images are concatenated
into one image and inputted into the model. The models have weak capabilities in judging image differences
and summarizing continuous content. On the one hand, multi-images affect recognition accuracy, and on the
other hand, there are shortcomings in understanding the relationships between multiple contents, especially
in the case of PandaGPT (Su et al., 2023), where recognition ability decreases significantly when multiple
images are inputted.
Basic description. Inaccuracies in the attributes of content in the description are a contributing factor.
Moreover, the existing models exhibit significant instances of hallucination, leading to poor overall scores in
the most crucial evaluation of descriptive capabilities. We will provide a detailed comparison of the models’
hallucination tendencies in section 4.2.

4.3 Analysis of Model Hallucinations

Most existing LVLMs exhibit hallucination issues, such as predicting objects or content that do not exist in
the input visual signals. As illustrated in Fig. 2, through comparative analysis with GPT-4, we discover that
GPT-4 can detect hallucinations within the model and penalize the occurrence of these issues. In order to
compare the hallucinations of different LVLMs, we utilize various prompts to request the model to describe
the images. We then input the model descriptions and fine-grained human annotations into GPT-4 to evaluate
the model’s degree of hallucination.
As illustrated in Table 2, current LVLMs exhibit a high degree of hallucination in the description task. Among
them, PandaGPT (Su et al., 2023) has the highest degree of hallucination, possibly due to the insufficient
visual input provided by ImageBind (Girdhar et al., 2023), which only inputs the cls embedding to LLM. In
contrast, InstructBLIP (Dai et al., 2023) and Qwen-VL (Bai et al., 2023) achieve the lowest hallucination score
by favoring shorter answers, reducing the chances of hallucinations. Providing the model with more concise
prompts may be a strategy to prevent hallucinations.

9
4.4 Limitations and Potential Areas for Improvement

There is still a lot of room for improvement in the LVLMs based on evaluations and comparisons. In this
section, we propose several potential directions for enhancement in light of the current limitations.
Spatial understanding. These models perform poorly in understanding complex positional and structural
relationships. One reason is that LLMs themselves do not directly learn spatial concepts, and the representa-
tion and description of complex relationships in the data are also limited. Some methods (Peng et al., 2023;
Chen et al., 2023; Bai et al., 2023) have been attempted to incorporate certain localization tasks into LVLM,
which has allowed the model to acquire additional localization capabilities. Adding more data containing
location information, such as detection, segmentation, and scene graphs, may help models establish some
spatial relationship concepts. This broader understanding of spatial relationships can contribute to improved
performance in tasks like layout understanding and spatial planning.
Multi-image pre-training. While single-image pre-training is effective for LLM recognition, it has limited
utility in comparing and summarizing multiple images. For this reason, it is necessary to introduce more
interleaved image-text data for learning, such as webpages, articles, and news.
Enhancing LLM through Multimodal Content. While aligning vision encoders to LLMs quickly constructs
LVLMs, the models’ ability is also limited in some tasks, such as spatial understanding, dense text recognition,
and mathematical ability. Further exploration of how to improve LLM’s ability through multimodal content
is worthwhile.
Hallucination problem. Addressing the issue of visual hallucinations, where models generate content that
does not exist in the input image, is a crucial aspect to consider. Insufficient visual input can easily lead to
hallucinations. On the one hand, exploring techniques to strengthen the model’s judgment of non-existent
content is possible. On the other hand, focusing more on the model’s answers to visual content and reinforcing
the consistency between answers and visual content may help reduce visual hallucinations.
Higher resolution. Most LVLMs input the images with the 224 × 224 resolution, but increasing the resolution
of input images could improve models’ ability to recognize small objects, dense text, and fine-grained details,
leading to more accurate outputs.

5 Conclusion

In conclusion, we propose an evaluation method for large vision-language models (LVLMs) that use strong
LLMs as judges to comprehensively evaluate their various abilities. our TouchStone dataset encompasses
five major categories of abilities and 27 subtasks, which not only cover fundamental recognition and com-
prehension but also extend to literary creation. It integrates detailed image annotations and descriptions to
transform the multimodal input content into a form understandable by language models. Through validation,
we demonstrate that powerful LLMs, such as GPT-4, can effectively score dialogue quality by leveraging
their textual capabilities alone, aligning with human preferences. Our results indicate that there is still ample
room for improvement in current LVLMs, and identify potential areas for further development. Our method
provides a valuable tool for evaluating LVLMs and advancing their capabilities, ultimately promoting the
development of more effective and comprehensive vision-language models.

References

Harsh Agrawal, Karan Desai, Yufei Wang, Xinlei Chen, Rishabh Jain, Mark Johnson, Dhruv Batra, Devi
Parikh, Stefan Lee, and Peter Anderson. nocaps: novel object captioning at scale. In Proceedings of the IEEE
International Conference on Computer Vision, pages 8948–8957, 2019.
Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur

10
Mensch, Katherine Millican, Malcolm Reynolds, et al. Flamingo: a visual language model for few-shot
learning. Advances in Neural Information Processing Systems, 35:23716–23736, 2022.
Rohan Anil, Andrew M Dai, Orhan Firat, Melvin Johnson, Dmitry Lepikhin, Alexandre Passos, Siamak
Shakeri, Emanuel Taropa, Paige Bailey, Zhifeng Chen, et al. Palm 2 technical report. arXiv preprint
arXiv:2305.10403, 2023.

Jinze Bai, Rui Men, Hao Yang, Xuancheng Ren, Kai Dang, Yichang Zhang, Xiaohuan Zhou, Peng Wang, Sinan
Tan, An Yang, et al. Ofasys: A multi-modal multi-task learning system for building generalist models.
arXiv:2212.04408, 2022.
Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and
Jingren Zhou. Qwen-vl: A frontier large vision-language model with versatile abilities. arXiv preprint
arXiv:2308.12966, 2023.
Hangbo Bao, Li Dong, Songhao Piao, and Furu Wei. Beit: Bert pre-training of image transformers. arXiv
preprint arXiv:2106.08254, 2021.
Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind
Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners.
Advances in neural information processing systems, 33:1877–1901, 2020.
Keqin Chen, Zhao Zhang, Weili Zeng, Richong Zhang, Feng Zhu, and Rui Zhao. Shikra: Unleashing
multimodal llm’s referential dialogue magic. arXiv:2306.15195, 2023.
Mark Chen, Alec Radford, Rewon Child, Jeffrey Wu, Heewoo Jun, David Luan, and Ilya Sutskever. Generative
pretraining from pixels. In International conference on machine learning, pages 1691–1703. PMLR, 2020.
Xi Chen, Xiao Wang, Soravit Changpinyo, AJ Piergiovanni, Piotr Padlewski, Daniel Salz, Sebastian Goodman,
Adam Grycner, Basil Mustafa, Lucas Beyer, et al. Pali: A jointly-scaled multilingual language-image model.
arXiv preprint arXiv:2209.06794, 2022.

Xinlei Chen, Hao Fang, Tsung-Yi Lin, Ramakrishna Vedantam, Saurabh Gupta, Piotr Dollár, and C Lawrence
Zitnick. Microsoft coco captions: Data collection and evaluation server. arXiv preprint arXiv:1504.00325,
2015a.
Xinlei Chen, Hao Fang, Tsung-Yi Lin, Ramakrishna Vedantam, Saurabh Gupta, Piotr Dollár, and C Lawrence
Zitnick. Microsoft coco captions: Data collection and evaluation server. arXiv preprint arXiv:1504.00325,
2015b.

Wei-Lin Chiang, Zhuohan Li, Zi Lin, Ying Sheng, Zhanghao Wu, Hao Zhang, Lianmin Zheng, Siyuan Zhuang,
Yonghao Zhuang, Joseph E. Gonzalez, Ion Stoica, and Eric P. Xing. Vicuna: An open-source chatbot impress-
ing gpt-4 with 90%* chatgpt quality, March 2023. URL https://lmsys.org/blog/2023-03-30-vicuna/.
Wenliang Dai, Junnan Li, Dongxu Li, Anthony Meng Huat Tiong, Junqi Zhao, Weisheng Wang, Boyang
Li, Pascale Fung, and Steven Hoi. Instructblip: Towards general-purpose vision-language models with
instruction tuning. arXiv preprint arXiv:2305.06500, 2023.
Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical
image database. In 2009 IEEE conference on computer vision and pattern recognition, pages 248–255. Ieee, 2009.
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional
transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
Chaoyou Fu, Peixian Chen, Yunhang Shen, Yulei Qin, Mengdan Zhang, Xu Lin, Zhenyu Qiu, Wei Lin, Jinrui
Yang, Xiawu Zheng, et al. Mme: A comprehensive evaluation benchmark for multimodal large language
models. arXiv preprint arXiv:2306.13394, 2023.

11
Peng Gao, Jiaming Han, Renrui Zhang, Ziyi Lin, Shijie Geng, Aojun Zhou, Wei Zhang, Pan Lu, Conghui He,
Xiangyu Yue, Hongsheng Li, and Yu Qiao. Llama-adapter v2: Parameter-efficient visual instruction model.
arXiv preprint arXiv:2304.15010, 2023a.
Peng Gao, Jiaming Han, Renrui Zhang, Ziyi Lin, Shijie Geng, Aojun Zhou, Wei Zhang, Pan Lu, Conghui
He, Xiangyu Yue, et al. Llama-adapter v2: Parameter-efficient visual instruction model. arXiv preprint
arXiv:2304.15010, 2023b.
Rohit Girdhar, Alaaeldin El-Nouby, Zhuang Liu, Mannat Singh, Kalyan Vasudev Alwala, Armand Joulin, and
Ishan Misra. Imagebind: One embedding space to bind them all. In Proceedings of the IEEE/CVF Conference
on Computer Vision and Pattern Recognition, pages 15180–15190, 2023.
Yash Goyal, Tejas Khot, Douglas Summers-Stay, Dhruv Batra, and Devi Parikh. Making the v in vqa matter:
Elevating the role of image understanding in visual question answering. In Proceedings of the IEEE conference
on computer vision and pattern recognition, pages 6904–6913, 2017.
Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Dollár, and Ross Girshick. Masked autoencoders are
scalable vision learners. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition,
pages 16000–16009, 2022.

Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu
Chen. Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685, 2021.
Shaohan Huang, Li Dong, Wenhui Wang, Yaru Hao, Saksham Singhal, Shuming Ma, Tengchao Lv, Lei
Cui, Owais Khan Mohammed, Qiang Liu, et al. Language is not all you need: Aligning perception with
language models. arXiv preprint arXiv:2302.14045, 2023.
Bo Li, Yuanhan Zhang, Liangyu Chen, Jinghao Wang, Jingkang Yang, and Ziwei Liu. Otter: A multi-modal
model with in-context instruction tuning. arXiv preprint arXiv:2305.03726, 2023a.
Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrapping language-image pre-training
with frozen image encoders and large language models. arXiv preprint arXiv:2301.12597, 2023b.

Chin-Yew Lin. Rouge: A package for automatic evaluation of summaries. In Text summarization branches out,
pages 74–81, 2004.
Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and
C Lawrence Zitnick. Microsoft coco: Common objects in context. In Computer Vision–ECCV 2014: 13th
European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13, pages 740–755. Springer,
2014.
Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. arXiv preprint
arXiv:2304.08485, 2023a.
Yuan Liu, Haodong Duan, Yuanhan Zhang, Bo Li, Songyang Zhang, Wangbo Zhao, Yike Yuan, Jiaqi Wang,
Conghui He, Ziwei Liu, et al. Mmbench: Is your multi-modal model an all-around player? arXiv preprint
arXiv:2307.06281, 2023b.
OpenAI. Gpt-4 technical report, 2023.
Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang,
Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with
human feedback. Advances in Neural Information Processing Systems, 35:27730–27744, 2022.
Zhiliang Peng, Wenhui Wang, Li Dong, Yaru Hao, Shaohan Huang, Shuming Ma, and Furu Wei. Kosmos-2:
Grounding multimodal large language models to the world. arXiv:2306.14824, 2023.
Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. Language models
are unsupervised multitask learners. OpenAI blog, 1(8):9, 2019.

12
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry,
Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural
language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021.
Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou,
Wei Li, and Peter J Liu. Exploring the limits of transfer learning with a unified text-to-text transformer. The
Journal of Machine Learning Research, 21(1):5485–5551, 2020.
Oleksii Sidorov, Ronghang Hu, Marcus Rohrbach, and Amanpreet Singh. Textcaps: a dataset for image
captioning with reading comprehension. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow,
UK, August 23–28, 2020, Proceedings, Part II 16, pages 742–758. Springer, 2020.
Yixuan Su, Tian Lan, Huayang Li, Jialu Xu, Yan Wang, and Deng Cai. Pandagpt: One model to instruction-
follow them all. arXiv preprint arXiv:2305.16355, 2023.
Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix,
Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation
language models. arXiv preprint arXiv:2302.13971, 2023.

Ramakrishna Vedantam, C Lawrence Zitnick, and Devi Parikh. Cider: Consensus-based image description
evaluation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4566–4575,
2015.
Peng Wang, An Yang, Rui Men, Junyang Lin, Shuai Bai, Zhikang Li, Jianxin Ma, Chang Zhou, Jingren Zhou,
and Hongxia Yang. Ofa: Unifying architectures, tasks, and modalities through a simple sequence-to-
sequence learning framework. In International Conference on Machine Learning, pages 23318–23340. PMLR,
2022.
Qinghao Ye, Haiyang Xu, Guohai Xu, Jiabo Ye, Ming Yan, Yiyang Zhou, Junyang Wang, Anwen Hu, Pengcheng
Shi, Yaya Shi, et al. mplug-owl: Modularization empowers large language models with multimodality.
arXiv preprint arXiv:2304.14178, 2023.

Susan Zhang, Stephen Roller, Naman Goyal, Mikel Artetxe, Moya Chen, Shuohui Chen, Christopher Dewan,
Mona Diab, Xian Li, Xi Victoria Lin, et al. Opt: Open pre-trained transformer language models. arXiv
preprint arXiv:2205.01068, 2022.
Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin,
Zhuohan Li, Dacheng Li, Eric Xing, et al. Judging llm-as-a-judge with mt-bench and chatbot arena. arXiv
preprint arXiv:2306.05685, 2023.
Deyao Zhu, Jun Chen, Xiaoqian Shen, Xiang Li, and Mohamed Elhoseiny. Minigpt-4: Enhancing vision-
language understanding with advanced large language models. arXiv preprint arXiv:2304.10592, 2023.

13
Visual Storytelling Ablity Basic Description Ablity
Qwen-VL

mPLUG-Owl

LLaVA

LLaMA-AdapterV2

InstructBLIP

MiniGPT4

PandaGPT

Visual Comprehension Ablity Multi-Image Analysis Ablity

Visual Recognition Ablity

Figure 6: Comparison of different models across five major categories and 27 subtasks. Each model is
represented by a different color.

14
[Detailed Image Description]
The picture shows the night cityscape of Singapore. In the foreground, the water reflects the city lights, creating a
stunning view. On the left is Marina Bay Sands, a complex of three interconnected buildings with a magnificent sky
garden on top. In the middle, there are several high-rise buildings, a white buildings like blooming flowersand on the
far right, there is a Ferris wheel.

[Question] [GPT4-HA]

Which city is shown in the picture? Singapore


Model Answers

Qwen-VL: mPLUG-Owl:
The picture shows a nighttime view of Singapore city. The city shown in the picture is Singapore.
MiniGPT4:
LLaVA: InstructBLIP: The city in the picture is Singapore. The image
The picture shows the city of Singapore. Singapore is a nighttime view of the Singapore skyline
with the Marina Bay Sands hotel and casino on
LLaMA-AdapterV2: PandaGPT: the right and the Merlion statue in the center.
The picture shows the city of Singapore. The picture shows the city of Sydney, Australia. The Marina Bay Sands hotel and casino is a
large complex in Singapore that includes a
hotel, a convention center, a theater, a shopping
mall, and a casino. The Merlion is a large statue
[Detailed Image Description]
of a lion that is a popular symbol of Singapore.
The image shows a right-angled triangle drawn with a black line on a white background. The two legs of the triangle The body of the lion is made of concrete and
are labeled as 6 and 8, while the hypotenuse is represented by a question mark. the head is made of bronze. It is located near
the shore of the Singapore River and is a
popular tourist attraction.
[Question] [GPT4-HA]

Calculate the length of the hypotenuse of the right-angled 10


triangle in the diagram.

Model Answers

Qwen-VL: mPLUG-Owl:

If you know any two sides of a right-angled triangle, you The length of the hypotenuse of the right-
can use the Pythagorean theorem to find the length of the angled triangle in the diagram is given by the
sum of the squares of the lengths of the two MiniGPT4:
hypotenuse. In this case, one of the sides is 6, which is the
hypotenuse of a 3-4-5 right-angled triangle. To use the legs. Since the legs are 3 and 4, the hypotenuse
Pythagorean theorem with this triangle, you need to know can be calculated by using the Pythagorean The length of the hypotenuse of the right-
the length of the other side, which is 4 in this case. So, you theorem: angled triangle is 5cm.
can use the following formula to find the length of the Hypotenuse = c^2 = (3^2 + 4^2) = 15 + 16 = 31
hypotenuse: Therefore, the length of the hypotenuse is 31
units. LLaMA-AdapterV2:
hypotenuse = √(6² + 4²) = √(72 + 16) = √88 = 10
Therefore, the length of the hypotenuse of the right-angled To calculate the length of the hypotenuse of the
triangle is 10. InstructBLIP: right-angled triangle in the diagram, we need to
what is the length of the hypotenuse of the use the Pythagorean theorem. The Pythagorean
LLaVA: right-angled triangle in the diagram theorem states that the square of the length of
To calculate the length of the hypotenuse of the right-angled the hypotenuse (c) is equal to the sum of the
triangle in the diagram, you can use the Pythagorean PandaGPT: squares of the lengths of the other two sides (a
theorem. The theorem states that the square of the length of To calculate the length of the hypotenuse of and b). In this case, the lengths of the two sides
the hypotenuse is equal to the sum of the squares of the the right-angled triangle in the diagram, we are 6 and 8, so the hypotenuse is 82 + 62 = 100.
lengths of the two other sides. In this case, there are numbers first need to find the lengths of the two legs Therefore, the length of the hypotenuse in the
on the triangle's sides, so you can use those values to find of the triangle. The length of one leg is 11.5 diagram is 100.
the length of the hypotenuse. First, find the length of the two cm, and the length of the other leg is 12.5
other sides, and then square those values. Finally, divide the cm. The hypotenuse is the side opposite the
sum of the squares by 2 to get the length of the hypotenuse. right angle, so it is the sum of the two legs.
The hypotenuse is 11.5 + 12.5 = 24 cm.
[Detailed Image Description]
The picture shows a seal lying on a rock, flapping its right arm as if to hit the rock. Its eyes are sleepy, as if it has just
woken up. The seal's overall color is gray, with a patch of tan on its neck and chest, and a black area on its arm and
below its belly. The photo seems to have been taken near the beach, with large rocks on both sides of the seal. The seal
is wet from the sea water, and the sky is gray and it seems like it is not yet dawn.

[Question] [GPT4-HA]

What is the seal doing in this picture? The seal is lying on a rock, flapping its right
arm as if to hit the rock, and appears to have
Model Answers just woken up.
Qwen-VL:
mPLUG-Owl:
The seal is sitting on some rocks, likely resting or MiniGPT4:
observing its surroundings. In the image, the brown seal is standing up
and stretching its body.
The seal is standing on a rock, looking at
LLaVA: something on the water.
InstructBLIP:
n the image, the seal is sitting on some rocks and has its In this picture, the seal is standing on a rocky
front flipper extended. It appears to be looking at the surface and flapping its flippers.
camera.
PandaGPT:
LLaMA-AdapterV2:
The seal is sitting on a rock in the ocean,
The seal is standing on a rocky beach, looking at the looking up at the camera.
camera.

Figure 7: Examples of answering results.


15
[Detailed Image Description]

The picture shows an orange cat "practicing martial arts" on a lush green lawn. The cat has a mix of orange and white
fur, with thick fur. It wears a black collar around its neck and stands upright. The cat's face is turned to its right, with
its right front leg stretched to the right and its left front leg slightly bent. Its right hind leg is stretched to its left front,
and its left leg is almost vertically standing, presenting an overall kung fu posture. There is a black and white
rectangular box with the words "Everybody was kung fu fighting" on top of the cat's head. There are some fallen
leaves on the grass under the cat's feet.

[Question] [GPT4-HA]
This picture is funny because it humorously
Why is this picture funny? depicts an orange cat in a kung fu posture,
Model Answers with a caption "Everybody was kung fu
fighting," which is a reference to a popular
song, creating an amusing and unexpected MiniGPT4:
MM-LLaMA:
scene. The image shows a cat doing a karate move in
The image is funny because it features a cat lying on the
the grass.
ground on its back with its legs spread, appearing as if mPLUG-Owl:
it's doing the kung fu fighting pose. Cats typically do This picture is funny because it features a cat
not engage in martial arts, so the sight of a cat in this wearing a black karate uniform and doing a LLaMA-AdapterV2:
position is humorous, as it breaks expectations of what karate kick. It is an unexpected and humorous
a cat would normally do in a fight. Furthermore, the cat The image is funny because it depicts a cat
juxtaposition of a feline and the martial arts
is situated in a grassy area, which adds to the visual playing a game of kung fu fighting, which is an
practice, as cats are generally not known for
appeal and makes the scene even more amusing, as the unusual and unexpected scenario. Cats are
their agility or athleticism. The fact that the
grassy background contrasts with the exaggerated kung typically known for their relaxed and laid-back
cat is performing a kung fu move while
fu pose. The combination of the cat's position and the nature, so seeing a cat engaging in a martial
dressed up in a martial arts uniform adds an
unexpected scenario create a lighthearted and arts-like activity is humorous and surprising.
element of humor and makes the scene
entertaining moment, making the image a laugh-out- The cat's actions, such as kicking the air and
entertaining for viewers.
loud moment for viewers. jumping around, make the scene even more
InstructBLIP: amusing. The fact that the cat is wearing a
LLaVA: green shirt adds to the comical nature of the
the cat is doing a kung fu move
The picture is funny because it features an orange scene, as it contrasts with the cat's natural
tabby cat, which is standing on its hind legs and PandaGPT: behavior and the typical attire associated with
appears to be practicing martial arts or boxing in a martial arts.
The picture is funny because it shows a cat in a
grassy field. This is an unusual sight, as cats typically comical and unusual position. The cat is
do not engage in such activities. The cat's posture, standing on its hind legs, with its front paws on
combined with the fact that it is imitating human its head, as if it is trying to catch a fly. This is
actions, creates a humorous and unexpected scene that an uncommon and amusing sight, as cats
appeals to the viewer's imagination and sense of typically don't stand on their hind legs or
humor. 8.5 exhibit such behavior. The image captures the
cat's attempt to catch the fly, which adds to the
[Detailed Image Description] humor of the scene.
The picture shows an orange cat "practicing martial arts" on a lush green lawn. The cat has a mix of orange and white
fur, with thick fur. It wears a black collar around its neck and stands upright. The cat's face is turned to its right, with
its right front leg stretched to the right and its left front leg slightly bent. Its right hind leg is stretched to its left front,
and its left leg is almost vertically standing, presenting an overall kung fu posture. There is a black and white
rectangular box with the words "Everybody was kung fu fighting" on top of the cat's head. There are some fallen
leaves on the grass under the cat's feet.
[GPT4-HA]
[Question]
In the dance of spring, plum blossoms bloom,
Write a poem based on this picture. Light yellow petals with red spots loom,
Tender stamens whisper to the skies,
Model Answers Nature's beauty, a feast for our eyes. MiniGPT4:

MM-LLaMA: LLaVA: The blossoming branches sway with every


breeze,A symphony of pink and white flowers
White blossoms on a tree, In the image, there are several small flowers that dance.The sun shines down, casting a
Pristine petals soft and sweet, blooming on a tree branch, surrounded by a warm glow,And nature's beauty is on full
Beneath the sun's gentle light, clear sky. This inspires me to write a poem display, in this flowery bloom.
They sway and dance, so graceful and bright. about the beauty of nature and the simple joy
A gentle breeze kisses each blossom, LLaMA-AdapterV2:
of a flower in bloom.
Cradling them gently, lovingly, A single flower, a burst of color in the sky, The image of a tree with white blossoms and
As they sway in the tranquil breeze, Emerging from the bud, ready to spread its joy, green leaves is a sight to behold. The branches
A waltz in the soft, white dream. With gentle petals, soft hues of pink and white, are adorned with a beautiful display of
Nature's beauty, unfurled, Amidst the branches, it opens up and takes flowers, creating a picturesque scene. The
A moment of tranquil delight, flight. tree's branches are swaying in the breeze,
Whispers of spring's promise told, Multiple flowers, tiny and delicate, adorning adding to the natural beauty of the scene. The
As the snow melts, and winter dies. the tree, blossoms are in full bloom, showcasing the
White blossoms on a tree, A symphony of colors, a natural orchestra, tree's vibrant colors and captivating the
Pristine petals soft and sweet, Each one unique, yet all together in harmony, viewer's attention. The tree's branches are
Beneath the sun's gentle light, A reflection of life and love, a celebration of reaching out, inviting the viewer to explore
A blooming vision of pure delight. beauty. the beauty of nature. The scene is a reminder
mPLUG-Owl: In this image, the flowers on the branch and of the beauty of the world around us and the
Spring blossoms gently bloom, the clear blue sky, importance of preserving the environment. 5.0
With petals so delicate and bold. Represent the fleeting moments of life, and the
importance of cherishing each one, PandaGPT:
Their fragrant scent carries a sense of joy, In a garden, a small white flower blooms,
Warmth and renewal, bringing life to the air. For life is a delicate flower, with a beauty that
fades too soon, A delicate petal opens, revealing its hues.
As the flowers unfurl, they reveal a radiant A gentle breeze blows, causing the petals to dance,
glow, So let us appreciate the small things, the
simple joys that make it aewsome. As the sun shines down, casting a warm glow on
Embracing nature's beauty in full bloom. the scene.
Their vibrant hues and delicate form, The white flower stands out against the green,
A testament to the season's renewed promise InstructBLIP:
A peaceful sight, a moment of serenity to be seen.
of hope. a close up of a branch with flowers on it 2.5 The petals open and close, like a gentle sigh,
As the flower continues to bloom, a beauty to
admire.

Figure 8: Examples of answering results.


16
[Detailed Image Description]

The picture shows four different scenes featuring casual pants. In the first scene, the model is standing by a
railing, holding a white bag with an English letter on it. She is wearing a white undershirt and a shirt, paired with
black casual pants and white shoes. In the second scene, the model is riding an escalator, wearing a purple long-
sleeved sweatshirt, a black and white checkered jacket, black pants with a fashionable silver zipper, and a pair of
red and black shoes. In the third scene, the model is at a blue beach, wearing a white T-shirt, beige casual pants,
and white canvas shoes. In the fourth scene, the model is wearing a white T-shirt, with one hand in his pocket
and the other hand resting on his gray-black casual pants, paired with white sneakers. The picture showcases
different styles of casual pants from various angles and settings.

[GPT4-HA]
[Question]
The black casual pants in the first scene are
Compare the several pants in the picture and describe simple and paired with white shoes, the black
their unique features. pants in the second scene have a fashionable
silver zipper and are worn with red and black
Model Answers shoes, the beige casual pants in the third scene
MiniGPT4:
MM-LLaMA: are worn at a beach with white canvas shoes, and
the gray-black casual pants in the fourth scene The pants in the picture are a mix of plain, colored,
The image showcases various types of pants (trousers)
are paired with white sneakers and have a more and patterned styles. The first pair is a pair of plain
worn by both men and women. Some of the pants are
relaxed fit. black pants with a flat front and elastic waistband.
drawstring elasticized pants, which feature a elastic
The second pair is a pair of striped pants with an
band around the waist to provide a comfortable,
mPLUG-Owl: elastic waistband and cuffed legs. The third pair is a
flexible fit. Another distinctive feature observed on
The image features several pairs of pants with pair of colored pants with an elastic waistband and
some of the pants is a all-over print or patterning,
different styles and color combinations. Among cuffed legs. The fourth pair is a pair of patterned
which gives the clothing a unique and eye-catching
them, one pair has a unique feature of a pocket pants with an elastic waistband and cuffed legs. All
appearance. Additionally, multiple-colored pants are
on the side, which is not commonly seen in other of the pants appear to be comfortable and suitable for
present in the image, indicating that people can choose
pants. Another pair has a different cut, with the daily wear.
pants according to their personal style or match them
with their outfits. Overall, the pants in the picture offer leg opening appearing wider than the others. The LLaMA-AdapterV2:
versatility and creativity when it comes to color, last pair has a unique striped pattern, giving it a
visually appealing look. All of these pants are The image features four different pairs of pants, each
pattern, and fit.
designed to cater to different preferences, styles, with unique features. The first pair of pants is a pair of
and purposes. black pants with a zipper on the side. The second pair
LLaVA: of pants is a pair of gray pants with a zipper on the
The image features four different men wearing side. The third pair of pants is a pair of black pants
trousers, showcasing various styles and designs. The InstructBLIP: with a zipper on the side, and the fourth pair of pants
first man is wearing black skinny pants, which There are several pairs of pants in the picture, is a pair of black pants with a zipper on the side. These
accentuate his figure and are typically associated with each with its own unique features. One pair of pants are all worn by different people, and they are all
a fashionable, urban look. The second man is wearing pants is black and white, while another pair is standing in different positions.
grey dress pants, which are more formal and suitable all black. The third pair of pants appears to be a PandaGPT:
for business or formal occasions. The third man is light color, possibly beige or off-white. The
wearing brown chinos, a versatile and comfortable fourth pair of pants is dark brown, and the fifth In the image, there are four different pairs of pants
choice that can be dressed up or down, and the fourth pair of pants is a lighter shade of brown. Each being worn by a man. The first pair is a pair of black
man is wearing white athletic pants, indicative of an pair of pants has a different style and color, pants with white stripes on the sides. The second pair
active lifestyle or sporty personality. Each person's making them stand out from one another. is a pair of white pants with black stripes on the sides.
outfit reflects their unique sense of style and the The third pair is a pair of black pants with white
context in which they choose to wear these pants. stripes on the sides, and the fourth pair is a pair of
white pants with black stripes on the sides. Each pair
of pants has a unique design and color combination,
making them distinct from one another.

Figure 9: Examples of answering results.

17

You might also like