0% found this document useful (0 votes)

10 views

Beyond Captioning - Task-Specific Prompting For Improved VLM Performance in Mathematical Reasoning

Uploaded by

quocbao0905461606

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

10 views

Beyond Captioning - Task-Specific Prompting For Improved VLM Performance in Mathematical Reasoning

Uploaded by

quocbao0905461606

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 10

Beyond Captioning: Task-Specific Prompting for

Improved VLM Performance in Mathematical

Reasoning

Ayush Singh, Mansi Gupta, Shivank Garg, Abhinav Kumar, Vansh Agrawal
Vision and Language Group
Indian Institute of Technology, Roorkee
{ayush_s@mt,m_gupta@ma,shivank_g@mfs,abhinav_k@ma,vansh_a@ph}.iitr.ac.in
arXiv:2410.05928v1 [cs.CV] 8 Oct 2024

Abstract
Vision-Language Models (VLMs) have transformed tasks requiring visual and
reasoning abilities, such as image retrieval and Visual Question Answering (VQA).
Despite their success, VLMs face significant challenges with tasks involving
geometric reasoning, algebraic problem-solving, and counting. These limitations
stem from difficulties in effectively integrating multiple modalities and accurately
interpreting geometry-related tasks [1]. Various works claim that introducing a
captioning pipeline before VQA tasks enhances performance [2]. We incorporated
this pipeline for tasks involving geometry, algebra, and counting and found that
captioning results are not generalizable specifically with larger VLMs primarily
trained on downstream QnA tasks showing random performance on math-related
challenges. However, we present a promising alternative: task-based prompting,
enriching the prompt with task-specific guidance. This approach shows promise and
proves more effective than direct captioning methods for math-heavy problems.

1 Introduction
With the rise of Large Language Models, which demonstrate the ability to understand and generate
text for tasks beyond their explicit training, Vision-Language Models have extended these capabilities
to multimodal tasks involving images and text [3]. These models excel in tasks like Visual Question
Answering (VQA), image captioning, and object segmentation [4]. However, recent studies reveal
that VLMs struggle with simple, low-level visual tasks that humans easily solve, highlighting a need
to enhance their visual reasoning and understanding [5].
Some works try to improve VLMs’ reasoning abilities by fine-tuning the VLMs [6], [7],[8]. Addi-
tionally, some research on VLMs focuses on improving their question-answering abilities through
a two-step process: captioning followed by question-answering. This method takes advantage of
VLMs’ pre-training in text generation, as many tasks require generating text descriptions. The main
challenge lies in the model’s ability to effectively combine and interpret multimodal information,
understanding visual and textual inputs while capturing their interactions. One area where VLMs
consistently underperform is counting [9], primarily due to the scarcity of training data that accurately
labels object counts, especially as the number of objects increases. While captioning has improved
performance in some tasks [2], we hypothesize that these improvements are not generalizable and
depend on various factors, which we aim to explore through our experiments. Further, captioning
fails to capture all the attributes of the image, which is especially crucial in mathematical tasks.
To address these limitations, we introduce prompting techniques designed to enhance the models’
reasoning capabilities. we specifically constructed prompts based solely on the question, excluding
any direct information about the answer. These approach-based prompts were tested in both direct

Preprint. Under review.

QnA tasks and as guides for captioning, with the expectation of improving performance. Additionally,
we assessed robustness by using adversarial prompts, which provide incorrect problem-solving
strategies but relevant to the problem, and random prompts to introduce irrelevant text, evaluating the
models’ responses and robustness to these perturbations.

1.1 Background and Related work

Several studies have examined the reasoning and comprehension abilities of Vision-Language Models
in various tasks requiring spatial understanding and reasoning capabilities. These studies have
demonstrated that multimodal language models rely less on visual information and perform better
when they are given adequate textual cues.[10], [11]. While methods like few-shot prompting [12]
have been shown to improve the performance of VLMs [13], these models continue to struggle with
mathematical tasks, particularly counting, leading some to describe them as “blind” to numbers [5].
Recent research suggests that much of the reasoning performed by VLMs may stem more from the
phrasing of the questions rather than the images themselves. This is evident in tasks that heavily rely
on visual information, such as counting nested squares or identifying line intersections, where VLMs
consistently underperform. Datasets like Math Vision [14] and Count Bench[9] have been developed
specifically to test these visual reasoning abilities.
To enhance Visual Questioning (VQA) performance, various techniques have been proposed, includ-
ing the use of question-driven image captions, which are subsequently fed into language models.
These approaches have demonstrated potential to enhance outcomes in specific tasks, such as direct
image-based QnA. [2]. However, whether such captioning-based techniques can reliably enhance
VLM performance on math-related tasks has been explored in our work.

2 Method
We assessed the Vision-Language Models on a range of geometry-related tasks. We take tasks from
four datasets containing different questions to ensure our tests’ robustness and generalization. These
datasets contain a variety of tasks related to geometry, counting, algebra, and mathematical reasoning.
We use a diverse set of VLMs in our experiments to assess the generalizability of our approach. One
closed-source large model was taken, Gemini-1.5-Flash, and three open-sourced smaller models,
LLaVa, Florence-2, and Phi 3.5 Vision Instruct, were chosen. Such a range of models was chosen to
ensure variation in size, from smaller ones with fewer parameters to larger, more complex models.
They were tested across eight distinct tasks, divided into two main categories:
1. Question-Answering (QnA): Using a classical zero-shot approach, each model was directly
queried with questions related to images from the datasets.
2. Captioning: We generated captions for the image using the base model. After generating
the captions, we fed them back into the LLM, and QnA was performed on the generated
caption using the LLM.

Random

Approach: Advesarial: Random:

Figure 1: Example of our QnA Approach

We further tested the impact of incorporating additional information and context into the prompts.
Specifically, we provided explicit guidance to solve the problem, which was generated using an

2
LLM(Gemini). This method aimed to determine how effectively the models could leverage explicit
procedural guidance to enhance their performance on QnA. Additionally, we tested two other variants,
random prompts and adversarial prompts(Figure: 2). Exact details regarding which are mentioned in
Appendix: A.
In our captioning experiments, first, we generated image captions by inputting task-specific keywords
derived from the Llama 3.1-Instruct model [15]. These keywords were extracted by prompting the
model to produce concise, 1-2-word summaries that encapsulate the essence of each question. After
generating the captions, we fed them back to an LLM and asked the corresponding questions for
each task. Similar to Direct QnA, we try approach-based captioning where, along with the image and
keywords, the approach was passed to the model to generate the image caption, which is further used
to extract the final answer.

VLM A blue circle separated

by a pink circle

Image Caption

Llama No

Approach

Keyword
extraction using Llama Are the two circles
Intersection,Circle overlapping? Answer
with Yes/No.

Question
Gemini

Figure 2: Example of our Caption Based Approach

3 Experiments and Results

We chose diverse models and techniques to prove and test our hypothesis. We chose four datasets,
Geo170k, CountBench, Blind, and MathVision, containing various tasks related to geometry, rea-
soning, algebra, and counting. Also, we split the MathVision dataset into three subparts: mainly
vision-based, geometry-based, and mathematics-based (exact details of the datasets in Appendix:
C). The approach-based prompting improves overall results for both Direct VQA and Caption-based
QA (Figure: 3). Further, we observe a drop in performance when prompted with the adversarial
approach and an overall increase in performance compared to baseline when prompted with the
random approach(Table: 1). We also observe that the performance of the models varies across
different datasets(Table: 2). Models perform best on the CountBench dataset, which focuses on
counting tasks, while they perform poorly on the MathVision Dataset due to the complexity of the
tasks. Further, on the MathVision dataset, we observe better performance on visual-based tasks as
compared to mathematics-related tasks. Also, performance on geometry related tasks is observed to
be relatively poor. (Appendix: C)

Dataset Base Approach Random Adv. Caption Caption Caption Caption

Approach Adv. Random
Gemini-1.5-Flash 42.73 44.32 41.30 43.86 35.18 42.53 37.70 38.62
LLaVa 09.70 12.54 09.89 10.61 24.06 25.54 23.15 25.07
Florence-2 07.12 07.49 04.89 02.89 14.03 16.86 14.56 16.35
Phi-3.5-Vision 28.09 32.49 30.58 28.92 31.44 34.76 29.27 28.71

Table 1: Model-wise comparison of accuracy of the different approaches

4 Conclusion
The results of our study align with our initial hypothesis: VLMs exhibit significant limitations
when it comes to mathematical tasks, particularly those involving numbers and counting. While

3
40 40
Accuracy

Accuracy
20 20

0 0

a
i

.5
in

in
aV

aV
i-3

i-3
en

en
em

em
LL

LL
or

or
Ph

Ph
G

G
Fl

Fl
Base Approach Base Approach

Figure 3: Left: Comparison of captioned methods,Right: Comparison of base methods,

Dataset Base Approach Random Adv. Caption Caption Caption Caption

Approach Adv. Random
Math-vision 10.13 14.06 11.92 13.31 19.28 22.31 20.43 19.61
CountBench 32.58 33.33 31.67 34.50 27.16 34.16 28.00 31.00
Geo 16.22 20.46 19.11 14.67 24.00 28.56 26.00 23.00
Blind 31.52 32.12 27.08 25.98 31.94 32.81 28.60 31.90

Table 2: Dataset-wise comparison of accuracy of the different approaches

captioning using task-specific keywords is a first step to improve performance in some contexts, our
findings suggest that its effectiveness is inconsistent, varying greatly depending on the dataset, task
complexity, and model size. Larger models, often pre-trained on QnA tasks, inherently perform
better in QnA-related tasks, a trend not observed in smaller models highlighting the influence of
pre-training, as observed in Gemini.(Further refer Appendix:B)
Our experiments demonstrate the potential for improving VLMs through techniques that enhance
their reasoning capabilities and consistently improve performance. Additionally, in our assessment
for testing the models’ robustness and generalization capabilities using adversarial and random
approach techniques, we observed a drop in performance, supporting our claim that VLMs incorporate
additional information into their reasoning. Meanwhile, the performance stability when random
prompts were provided underscores the robustness of these models’ reasoning abilities(Appendix:
D). In conclusion, improving the reasoning capabilities of VLMs presents a promising path forward,
especially given their inherent blindness in perceiving numbers and handling mathematical tasks.
By leveraging structured prompts constructed solely by leveraging information from the question,
and approaches that guide reasoning, we can mitigate some of these weaknesses and move closer to
enhancing their performance in more complex problem-solving scenarios.

5 Limitations and Future Work

Testing the reasoning capabilities of multimodal models is a broad area of research, and we propose
ways to improve the generalizability of these models in our research. Due to resource constraints,
we couldn’t experiment with many mainstream and large-scale models, especially state-of-the-art
Vision-Language Models (VLMs). With more computational resources and funding, future work
could focus on scalability and robustness across a wider range of model architectures such as Claude
Sonnet[16], GPT 4[17], etc.
Additionally, advanced methods, such as sophisticated prompt engineering techniques or incorporating
domain-specific knowledge—could enhance the models’ ability to generate more accurate and
contextually relevant captions. Moreover, replacing the QA model with interpretable alternatives
could offer greater transparency and insights into the decision-making process, thus shedding light on
the model’s reasoning and performance in a more understandable way.

4
References
[1] Jingyi Zhang, Jiaxing Huang, Sheng Jin, and Shijian Lu. Vision-language models for vision
tasks: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2024.
[2] Övgü Özdemir and Erdem Akagündüz. Enhancing visual question answering through question-
driven image captions as prompts. In Proceedings of the IEEE/CVF Conference on Computer
Vision and Pattern Recognition, pages 1562–1571, 2024.
[3] Jingyi Zhang, Jiaxing Huang, Sheng Jin, and Shijian Lu. Vision-language models for vision
tasks: A survey, 2024.
[4] Yuxuan Qiao, Haodong Duan, Xinyu Fang, Junming Yang, Lin Chen, Songyang Zhang, Jiaqi
Wang, Dahua Lin, and Kai Chen. Prism: A framework for decoupling and assessing the
capabilities of vlms. arXiv preprint arXiv:2406.14544, 2024.
[5] Pooyan Rahmanzadehgervi, Logan Bolton, Mohammad Reza Taesiri, and Anh Totti Nguyen.
Vision language models are blind. arXiv preprint arXiv:2407.06581, 2024.
[6] Denisa Roberts and Lucas Roberts. Smart vision-language reasoners. In AI for Math Workshop
@ ICML 2024, 2024.
[7] Zhen Yang, Jinhao Chen, Zhengxiao Du, Wenmeng Yu, Weihan Wang, Wenyi Hong, Zhihuan
Jiang, Bin Xu, Yuxiao Dong, and Jie Tang. Mathglm-vision: Solving mathematical problems
with multi-modal large language model. arXiv preprint arXiv:2409.13729, 2024.
[8] Yihe Deng Deng, Pan Lu, Fan Yin, Ziniu Hu, Sheng Shen, James Zou, Kai-Wei Chang, and Wei
Wang. Enhancing large vision language models with self-training on image comprehension.
arXiv preprint arXiv:2405.19716, 2024.
[9] Roni Paiss, Ariel Ephrat, Omer Tov, Shiran Zada, Inbar Mosseri, Michal Irani, and Tali Dekel.
Teaching clip to count to ten. In Proceedings of the IEEE/CVF International Conference on
Computer Vision, pages 3170–3180, 2023.
[10] Gengyuan Zhang, Yurui Zhang, Kerui Zhang, and Volker Tresp. Can vision-language models
be a good guesser? exploring vlms for times and location reasoning. In Proceedings of the
IEEE/CVF Winter Conference on Applications of Computer Vision, pages 636–645, 2024.
[11] Jiayu Wang, Yifei Ming, Zhenmei Shi, Vibhav Vineet, Xin Wang, and Neel Joshi. Is a picture
worth a thousand words? delving into spatial reasoning for vision language models. arXiv
preprint arXiv:2406.14852, 2024.
[12] Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal,
Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel
Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel Ziegler,
Jeffrey Wu, Clemens Winter, Chris Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott
Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya
Sutskever, and Dario Amodei. Language models are few-shot learners. In H. Larochelle,
M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin, editors, Advances in Neural Information
Processing Systems, volume 33, pages 1877–1901. Curran Associates, Inc., 2020.
[13] Yangyi Chen, Karan Sikka, Michael Cogswell, Heng Ji, and Ajay Divakaran. Measur-
ing and improving chain-of-thought reasoning in vision-language models. arXiv preprint
arXiv:2309.04461, 2023.
[14] Ke Wang, Junting Pan, Weikang Shi, Zimu Lu, Mingjie Zhan, and Hongsheng Li. Measuring
multimodal mathematical reasoning with math-vision dataset. arXiv preprint arXiv:2402.14804,
2024.
[15] Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle,
Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, et al. The llama 3 herd
of models. arXiv preprint arXiv:2407.21783, 2024.
[16] Introducing the next generation of claude. 2024.

5
[17] Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni
Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4
technical report. arXiv preprint arXiv:2303.08774, 2023.

6
A Prompting
To enrich the prompts used in our experiments, we employed Gemini 1.5 Flash. The question was
presented to the model without an accompanying image. The model was tasked with generating
responses under three different conditions:

1. Approach-Based: The model was asked to provide an approach for solving the question.
2. Adversarial: The model was prompted to generate a misleading or incorrect approach to
solving the question. Although inaccurate, the response needed to be plausible.
3. Random: The model was asked to generate a random string.

In the Captioning-based experiments, we utilized the LLaMA 3.1 Instruct 8B1 model to generate
keywords based solely on the question. These keywords provided initial guidance for the captioning
task. For the zero-shot captioning, we instructed the respective model being tested to generate a
caption for the image using the keywords provided. Similarly, for the rest of the captioning tasks,
we asked the QnA base model to generate image captions, using both keywords and additional hints
generated in the same manner as described above (Approach-Based, Adversarial, Random), following
which the answer was generated by passing the caption to another LLM(Llama 3.1 Instruct).
Note: For Florence-2 captions were generated using <DETAILED CAPTION>, and the approach
was passed onto the detailed caption. For other models, the approach was passed during the caption
generation stage. Additionally, the Florence-2 direct checkpoint was unable to perform QnA-related
tasks, so we used Florence-2 DocQnA for QnA-related tasks.

B Experiment Details
Certain experimental details worth mentioning for the respective models used are:

• LLaVa: For LLaVa, we used the GrokAPI 2 to access the model.

• Gemini-1.5-Flash: For Gemini, we used Google AI studio3 to access the model.
• Florence-2:45 For Florence-2, we used the open-sourced model available on huggingface .
• Phi 3.5 Vision Instruct:6 we used the open-sourced model available on huggingface.

C Datasets
The following datasets were used for our experiments:

• Math Vision:7 The Math Vision dataset is a curated collection of 3,040 high-quality mathe-
matical problems with visual contexts from real math competitions. For our experiments,
we broadly divided the dataset into three categories:
– Visual Based: This was originally split into Area, Angles, and Length-related tasks.
– Geometry Based: This was originally split into categories: Analytical Geometry,
Combinatorial Geometry, Transformation Geometry, Descriptive Geometry and Solid
Geometry.
– General Mathematics: This was originally split into categories: Graph Theory, Logic,
Algebra, Combinatorics, Statistics, and Arithmetic.
• Blind:8 The Blind dataset consisted of images and question-answer pairs about visual tasks.
We used a subset of 150 images per task. The tasks include counting the number of
1
https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct
2
https://developers.x.ai/python-sdk/grok/
3
https://aistudio.google.com/
4
https://huggingface.co/microsoft/Florence-2-large
5
https://huggingface.co/HuggingFaceM4/Florence-2-DocVQA
6
https://huggingface.co/microsoft/Phi-3.5-vision-instruct
7
https://huggingface.co/datasets/MathLLMs/MathVision
8
https://huggingface.co/datasets/XAI/vlmsareblind

7
intersections of 2 circles or lines, checking if 2 lines are intersecting, counting the number
of rows and columns in a grid, finding the number of overlapping circles in an image, and
finding the number of paths between 2 points in a subway connection image.
• Countbench:9 The CountBench dataset contained a total of 540 images containing between
two and ten instances of a particular object, where their corresponding captions reflect this
number. This dataset is a benchmark dataset for counting related tasks.
• Geo170k:10 The Geo dataset contained more than 170K geometric image-caption and
question-answer pairs. We used a subset of 500 images to conduct our experiments.

D The Random Approach

When using the random approach, we observed that performance often surpassed the baseline, with
significant improvements in some cases. We hypothesize that the model tends to disregard random
information, but in doing so, it becomes more cautious and focused on providing a correct response.
This contrasts with the baseline, where such behavior is less apparent. While these findings are
promising, there is still potential for further research to understand and refine this approach fully.

E Task Wise Keywords

Dataset name Task in dataset Keywords for task

Analytic Geometry analytic geometry

Algebra algebra, mathematics, logic
Transformation Geo transformation, geometry
Statistics statistics, graph
Angle metric geometry, angles, mathematics, logic
Combinatorics combinatorics, logic
Descriptive Geo descriptive geometry, mathematics
MathVision
Logic logics, reasonings
Length lengths, geometry
Arithmetic arithmetic, logics, mathematics
Area area, geometry
Combinatorial Geo combinatorial, geometry
Solid Geometry solids, geometry
Graph Theory logics, connections, graphs

Countbench Counting objects counting, objects

GEO170K Geometry problems geometry, mathematics

Line intersection count number, intersections

Two line intersection lines, intersecting
BLIND Interior pentagon count, number, pentagons
Subway subway lines, count, paths
Rows/columns count, rows, columns
Two circles circles, touching

Table 3: Dataset Information

9
https://huggingface.co/datasets/nielsr/countbench
10
https://huggingface.co/datasets/Luckyjhg/Geo170K

8
F Results Table

Task 0-Shot Adv Random Caption App-Capt Adv-Cap Rand-Cap Caption

MathVision
Angles 36 36 24 38 36 30 34 32
Area 27 28 24 28 32 34 28 22
Length 30 32 24 36 30 26 32 36
Descriptive Geo 34 26 28 22 14 20 24 20
Analytic Geo 16 18 22 14 20 22 20 12
Combinatorial Geo 22 26 14 26 20 24 10 6
Transformation Geo 18 24 20 22 20 14 28 22
Solid Geo 20 32 20 24 16 28 16 14
Graph Theory 28 24 22 26 24 26 22 18
Arithmetic 26 26 28 20 22 36 26 18
Logic 32 32 36 14 18 36 22 30
Combinatorics 20 20 12 22 22 28 14 12
Algebra 26 22 30 22 18 32 28 10
Statistics 26 24 24 24 22 24 22 12
Dataset Average 26.44 27.29 23.38 25.64 23.89 27.31 24.42 20.49
GEO170K
Geometry problems 30 33.33 32 30 30 36 34 28
CountBench
Counting objects 62 68 64.67 68 43 62.66 52 64
Blind
Line Intersection 50 58 44 59 43 43 35 38
Two line intersection 72 70 54 72 73 74 70 71
pentagon 53 42 24 51 45 41 43.34 42
Subway 23 23 26 16 0 0 0 0
Rows/columns 28 32 31 31 19 24 11 18
Two circles 89 67 92 82 83 83 83 83
Dataset Average 52.50 48.67 45.16 51.84 43.87 44.16 40.39 42.00
Table 4: Gemini Model Performance Across Various Tasks

Task 0-Shot Adv Random Caption App-Capt Adv-Cap Rand-Cap Caption

MathVision
Angles 6 14 8 14 38 34 28 26
Area 2 4 2 8 24 26 26 28
Length 6 14 14 14 20 26 28 26
Descriptive Geo 20 12 16 16 20 22 24 26
Analytic Geo 4 12 8 10 22 12 26 20
Combinatorial Geo 12 14 12 14 16 12 8 12
Transformation Geo 18 16 14 14 24 24 28 42
Solid Geo 4 10 6 6 22 26 20 16
Graph Theory 12 6 6 4 18 24 28 12
Arithmetic 2 14 10 6 16 22 14 22
Logic 10 16 8 8 16 24 18 26
Combinatorics 4 2 8 4 16 14 20 20
Algebra 0 4 4 4 26 28 16 20
Statistics 6 12 6 12 20 12 14 20
Dataset Average 7.31 10.82 8.73 10.11 22.27 22.84 22.29 23.29
GEO170K
Geometry problems 5 8 6.67 6 33.33 36 32.66 36
CountBench
Counting objects 12 14 10 12 14.66 16 14 15
Blind
Line Intersection 4 1 3 8 6 5 6 8
Two line intersection 32 42 30 23 64 68 55 60
pentagon 2 3 2 5 11 10 9 3
Subway 6 8 7 8 28 25 26 26
Rows/columns 5 9 7 9 6 10 6 8
Two circles 38 41 36 33 41 46 40 51
Dataset Average 14.50 17.33 14.16 14.33 26 27.33 23.67 26
Table 5: LLAVA Model Performance Across Various Tasks

9
Task 0-Shot Adv Random Caption App-Capt Adv-Cap Rand-Cap Caption
MathVision
Angles 5 2 0 3 22 24 17 26
Area 4 3 0 4 22 26 24 28
Length 6 4 0 2 16 26 25 26
Descriptive Geo 3 3 1 4 16 24 0 27
Analytic Geo 2 5 2 6 17 16 20 21
Combinatorial Geo 3 2 1 3 19 15 8 10
Transformation Geo 1 2 1 3 20 23 18 22
Solid Geo 4 3 0 2 19 21 19 18
Graph Theory 1 2 0 2 17 23 28 12
Arithmetic 2 3 1 3 15 23 18 21
Logic 2 2 1 3 15 24 20 25
Combinatorics 2 4 1 2 15 15 17 19
Algebra 2 3 0 2 24 25 18 19
Statistics 1 1 1 2 18 16 16 19
Dataset Average 3.09 2.83 0.56 2.98 18.51 22.04 19.25 21.81
GEO170K
Geometry problems 0 1.34 0 0 4 9 6 5
CountBench
Counting objects 3 4 2 4 6 8 8 9
Blind
Line Intersection 15 10 31 0 37 38 30 39
Two line intersection 74 76 31 0 69 70 69 73
pentagon 0 0 0 0 0 0 0 0
Rows/columns 0 0 0 0 0 0 0 0
Two circles 23 23 23 23 32 34 26 36
Dataset Average 22.4 21.8 17 4.6 27.6 28.4 25 29.6
Table 6: Florance-2 Model Performance Across Various Tasks

Task 0-Shot Adv Random Caption App-Capt Adv-Cap Rand-Cap Caption

MathVision
Angles 0 12 18 14 10 20 18 14
Area 2 18 16 14 14 20 18 16
Length 6 22 14 14 14 24 18 10
Descriptive Geo 6 16 20 16 16 22 22 16
Analytic Geo 2 10 14 12 14 14 26 14
Combinatorial Geo 4 18 10 18 12 12 8 12
Transformation Geo 4 18 24 16 14 10 8 12
Solid Geo 6 14 12 14 4 6 4 4
Graph Theory 0 16 14 14 10 16 14 8
Arithmetic 8 20 16 18 20 16 18 20
Logic 6 12 14 14 10 22 18 14
Combinatorics 6 14 12 14 14 14 14 12
Algebra 2 12 18 14 4 10 12 8
Statistics 2 6 4 12 18 24 18 20
Dataset Average 3.68 15.28 15 14.51 12.44 17.04 15.75 12.87
GEO170K
Geometry problems 18.67 26.67 25.33 14 38 40.67 38 36
CountBench
Counting objects 53.33 47.33 50 54 45 50 38 36
Blind
Line Intersection 35 39 32 37 33 33 31 40
Two line intersection 59 62 55 47 70 69 54 65
pentagon 12 13 2 13 6 9 4 2
Subway 22 44 17 18 31 24 18 29
Rows/columns 6 9 9 7 4 2 0 7
Two circles 86 77 77 77 38 51 45 37
Dataset Average 36.66 40.67 32 33.16 30.33 31.33 25.33 30
Table 7: Phi-3.5-Vision Model Performance Across Various Tasks

Assessment of The Strong Interest Inventory
No ratings yet
Assessment of The Strong Interest Inventory
4 pages
9412 TOA Task Oriented Active
No ratings yet
9412 TOA Task Oriented Active
14 pages
A Simple Baseline For Knowledge-Based Visual Question Answering
No ratings yet
A Simple Baseline For Knowledge-Based Visual Question Answering
7 pages
Guiding Vision-Language Model Selection For Visual Question-Answering Across Tasks, Domains, and Knowledge Types
No ratings yet
Guiding Vision-Language Model Selection For Visual Question-Answering Across Tasks, Domains, and Knowledge Types
19 pages
Improving Zero Shot Visual Question Answering Via Large Language Models With Reasoning Question Prompts
No ratings yet
Improving Zero Shot Visual Question Answering Via Large Language Models With Reasoning Question Prompts
12 pages
Prophet Prompting LLM With Complementary Answer Heuristics 4 Knowledge Based Vqa
No ratings yet
Prophet Prompting LLM With Complementary Answer Heuristics 4 Knowledge Based Vqa
16 pages
2406.17294v3
No ratings yet
2406.17294v3
18 pages
27999-Article Text-32053-1-2-20240324
No ratings yet
27999-Article Text-32053-1-2-20240324
9 pages
BLIVA_ A Simple Multimodal LLM for Better Handling of Text-Rich Visual Questions
No ratings yet
BLIVA_ A Simple Multimodal LLM for Better Handling of Text-Rich Visual Questions
12 pages
VQA: Visual Question Answering
No ratings yet
VQA: Visual Question Answering
25 pages
Instant download (Ebook) The Dawn of LMMs: Preliminary Explorations with GPT-4V(ision) arXiv:2309.17421v2 [cs.CV] 11 Oct 2023 by Zhengyuan Yang∗, Linjie Li∗, Kevin Lin∗, Jianfeng Wang∗, Chung-Ching Lin∗, Zicheng Liu, Lijuan Wang∗♠ Microsoft Corporation ∗ Core Contributor ♠ Project Lead ISBN 230917421V2 pdf all chapter
100% (3)
Instant download (Ebook) The Dawn of LMMs: Preliminary Explorations with GPT-4V(ision) arXiv:2309.17421v2 [cs.CV] 11 Oct 2023 by Zhengyuan Yang∗, Linjie Li∗, Kevin Lin∗, Jianfeng Wang∗, Chung-Ching Lin∗, Zicheng Liu, Lijuan Wang∗♠ Microsoft Corporation ∗ Core Contributor ♠ Project Lead ISBN 230917421V2 pdf all chapter
81 pages
W L L M B T - Vqa?: HAT Arge Anguage Odels Ring To EXT Rich
No ratings yet
W L L M B T - Vqa?: HAT Arge Anguage Odels Ring To EXT Rich
11 pages
Modular Visual Question Answering Via Code Generation: Our Code and Annotated Programs Are Available at
No ratings yet
Modular Visual Question Answering Via Code Generation: Our Code and Annotated Programs Are Available at
14 pages
Open Ended VQA Models Using Transformers
No ratings yet
Open Ended VQA Models Using Transformers
10 pages
2305.13782v1
No ratings yet
2305.13782v1
13 pages
2023 Acl-Short 65
No ratings yet
2023 Acl-Short 65
15 pages
Where can buy The Dawn of LMMs Preliminary Explorations with GPT 4V ision arXiv 2309 17421v2 cs CV 11 Oct 2023 1st Edition Zhengyuan Yang∗ ebook with cheap price
100% (1)
Where can buy The Dawn of LMMs Preliminary Explorations with GPT 4V ision arXiv 2309 17421v2 cs CV 11 Oct 2023 1st Edition Zhengyuan Yang∗ ebook with cheap price
40 pages
paper1
No ratings yet
paper1
17 pages
Complete Download The Dawn of LMMs Preliminary Explorations with GPT 4V ision arXiv 2309 17421v2 cs CV 11 Oct 2023 1st Edition Zhengyuan Yang∗ PDF All Chapters
100% (2)
Complete Download The Dawn of LMMs Preliminary Explorations with GPT 4V ision arXiv 2309 17421v2 cs CV 11 Oct 2023 1st Edition Zhengyuan Yang∗ PDF All Chapters
40 pages
VQA: Visual Question Answering
No ratings yet
VQA: Visual Question Answering
28 pages
Incorporating Visual Information Into Natural Language Processing
No ratings yet
Incorporating Visual Information Into Natural Language Processing
151 pages
2210.02441v3
No ratings yet
2210.02441v3
63 pages
Ask Me Anything
No ratings yet
Ask Me Anything
59 pages
Fewvlm
No ratings yet
Fewvlm
13 pages
visual_T5
No ratings yet
visual_T5
15 pages
SC-ML
No ratings yet
SC-ML
6 pages
Major Project Phase 1
No ratings yet
Major Project Phase 1
18 pages
Neural Module Networks
No ratings yet
Neural Module Networks
10 pages
Prompting_Large_Language_Models_with_Knowledge-Injection_for_Knowledge-Based_Visual_Question_Answering
No ratings yet
Prompting_Large_Language_Models_with_Knowledge-Injection_for_Knowledge-Based_Visual_Question_Answering
15 pages
The Dawn of LMMS: Preliminary Explorations With Gpt-4V (Ision)
No ratings yet
The Dawn of LMMS: Preliminary Explorations With Gpt-4V (Ision)
166 pages
Hu PromptCap Prompt-Guided Image Captioning For VQA With GPT-3 ICCV 2023 Paper
No ratings yet
Hu PromptCap Prompt-Guided Image Captioning For VQA With GPT-3 ICCV 2023 Paper
13 pages
Survey on VQA
No ratings yet
Survey on VQA
30 pages
Prompt Distribution Learning
No ratings yet
Prompt Distribution Learning
13 pages
1709.08203v1
No ratings yet
1709.08203v1
7 pages
From Image To Language A Critical Analysis of Visual Question Answering (VQA)
No ratings yet
From Image To Language A Critical Analysis of Visual Question Answering (VQA)
40 pages
VisuLogic: A Benchmark for Evaluating Visual Reasoning in Multi-modal Large Language Models
No ratings yet
VisuLogic: A Benchmark for Evaluating Visual Reasoning in Multi-modal Large Language Models
34 pages
(Jan 2024) OCRBench
No ratings yet
(Jan 2024) OCRBench
13 pages
Synthesize Step-by-Step Tools, Templates and LLMs As Data Generators For Reasoning-Based Chart VQA
No ratings yet
Synthesize Step-by-Step Tools, Templates and LLMs As Data Generators For Reasoning-Based Chart VQA
16 pages
(2023-Arxiv) VisionLLM Large Language Model Is Also An Open-Ended Decoder For Vision-Centric Tasks
No ratings yet
(2023-Arxiv) VisionLLM Large Language Model Is Also An Open-Ended Decoder For Vision-Centric Tasks
15 pages
Towards Multimodal In-Context Learning For Vision and Language Models
No ratings yet
Towards Multimodal In-Context Learning For Vision and Language Models
34 pages
From Show To Tell: A Survey On Image Captioning
No ratings yet
From Show To Tell: A Survey On Image Captioning
22 pages
Teaching CLIP To Count To Ten
No ratings yet
Teaching CLIP To Count To Ten
23 pages
2410.14690v1
No ratings yet
2410.14690v1
23 pages
Teney Tips and Tricks CVPR 2018 Paper
No ratings yet
Teney Tips and Tricks CVPR 2018 Paper
10 pages
1-2024-arxiv-MobileVLM V2：更快更强的视觉语言模型基线
No ratings yet
1-2024-arxiv-MobileVLM V2：更快更强的视觉语言模型基线
15 pages
Improved Baselines With Visual Instruction Tuning
No ratings yet
Improved Baselines With Visual Instruction Tuning
15 pages
2017-3-R2
No ratings yet
2017-3-R2
9 pages
CoOp
No ratings yet
CoOp
13 pages
Beyond Task Performance: Evaluating and Readucing The Flaws of Large Multimodal Models With In-Context Learning
No ratings yet
Beyond Task Performance: Evaluating and Readucing The Flaws of Large Multimodal Models With In-Context Learning
31 pages
Visual GPT
No ratings yet
Visual GPT
17 pages
ON THE HIDDEN MYSTERY OF OCR IN LARGE
No ratings yet
ON THE HIDDEN MYSTERY OF OCR IN LARGE
16 pages
Jia Et Al. - 2021 - Scaling Up Visual and Vision-Language Representati
No ratings yet
Jia Et Al. - 2021 - Scaling Up Visual and Vision-Language Representati
11 pages
Llavao 1
No ratings yet
Llavao 1
11 pages
03enhancing Text Book Question Answering Using Rag
No ratings yet
03enhancing Text Book Question Answering Using Rag
19 pages
Lost in The Middle How Language Models Use Long Contexts
No ratings yet
Lost in The Middle How Language Models Use Long Contexts
15 pages
Boosting Theory-of-Mind Performance in Large Language Models
No ratings yet
Boosting Theory-of-Mind Performance in Large Language Models
27 pages
simpleaug
No ratings yet
simpleaug
16 pages
Misra Learning by Asking CVPR 2018 Paper
No ratings yet
Misra Learning by Asking CVPR 2018 Paper
10 pages
sar
No ratings yet
sar
10 pages
Mastering Partial Least Squares Structural Equation Modeling (Pls-Sem) with Smartpls in 38 Hours
From Everand
Mastering Partial Least Squares Structural Equation Modeling (Pls-Sem) with Smartpls in 38 Hours
Ken Kwong-Kay Wong
3/5 (1)
AI for Everyone: An Intermediate Guide to Artificial Intelligence
From Everand
AI for Everyone: An Intermediate Guide to Artificial Intelligence
Nova Clarke
No ratings yet
NITOLI Fernando Pini Efi
No ratings yet
NITOLI Fernando Pini Efi
3 pages
Centralization & Decentralization
No ratings yet
Centralization & Decentralization
14 pages
JFtimetable
No ratings yet
JFtimetable
2 pages
Promoting Your Country Site
No ratings yet
Promoting Your Country Site
10 pages
Antolo-Eng-9 - Final LP
No ratings yet
Antolo-Eng-9 - Final LP
4 pages
Planning For Big Data
No ratings yet
Planning For Big Data
84 pages
Get Beyond Foundations Developing As a Master Academic Advisor 1st Edition Thomas J Grites Marsha A Miller Julie Givans Voler free all chapters
No ratings yet
Get Beyond Foundations Developing As a Master Academic Advisor 1st Edition Thomas J Grites Marsha A Miller Julie Givans Voler free all chapters
40 pages
The Art and Science of Spending Money Housel en 46957
No ratings yet
The Art and Science of Spending Money Housel en 46957
4 pages
DR Narshing Kadam
No ratings yet
DR Narshing Kadam
3 pages
My Resume
No ratings yet
My Resume
3 pages
1314-4402 PC Programm
No ratings yet
1314-4402 PC Programm
3 pages
ABAP and SE80 To Consume A Web Service External
No ratings yet
ABAP and SE80 To Consume A Web Service External
7 pages
Solutions For Anti-Money Laundering: 2018 T M Abcd V V
100% (1)
Solutions For Anti-Money Laundering: 2018 T M Abcd V V
31 pages
Speak Easy
No ratings yet
Speak Easy
116 pages
Grammar: 6A The Passive Be + Past Participle
No ratings yet
Grammar: 6A The Passive Be + Past Participle
1 page
Synopsis OfeMANDI
No ratings yet
Synopsis OfeMANDI
16 pages
D97 NTE450 Automated Cloud and Pour Point Tester
No ratings yet
D97 NTE450 Automated Cloud and Pour Point Tester
2 pages
Human Rights Topics For Preschool and Lower Primary School
No ratings yet
Human Rights Topics For Preschool and Lower Primary School
9 pages
JNC 8 - 2014 PDF
No ratings yet
JNC 8 - 2014 PDF
14 pages
M43
No ratings yet
M43
2 pages
Joni Jane Asignacion Wianne Mae Balatongle: Preterm Births
No ratings yet
Joni Jane Asignacion Wianne Mae Balatongle: Preterm Births
2 pages
Medical Store Management System
0% (4)
Medical Store Management System
30 pages
30 Exam Tests
No ratings yet
30 Exam Tests
44 pages
Additional Mathematics Project 4
100% (1)
Additional Mathematics Project 4
21 pages
Observation As A Method of Data Collection
No ratings yet
Observation As A Method of Data Collection
23 pages
Minitab Tutorial
No ratings yet
Minitab Tutorial
64 pages
22.origin of Modern Astronomy
100% (2)
22.origin of Modern Astronomy
44 pages
CBDRM Course Curriculum
No ratings yet
CBDRM Course Curriculum
2 pages
Zoning and Neighhourhood Design
No ratings yet
Zoning and Neighhourhood Design
36 pages

Beyond Captioning - Task-Specific Prompting For Improved VLM Performance in Mathematical Reasoning

Uploaded by

Beyond Captioning - Task-Specific Prompting For Improved VLM Performance in Mathematical Reasoning

Uploaded by

Beyond Captioning: Task-Specific Prompting for

Improved VLM Performance in Mathematical

Preprint. Under review.

1.1 Background and Related work

Approach: Advesarial: Random:

Figure 1: Example of our QnA Approach

VLM A blue circle separated

Figure 2: Example of our Caption Based Approach

3 Experiments and Results

Dataset Base Approach Random Adv. Caption Caption Caption Caption

Table 1: Model-wise comparison of accuracy of the different approaches

Figure 3: Left: Comparison of captioned methods,Right: Comparison of base methods,

Dataset Base Approach Random Adv. Caption Caption Caption Caption

Table 2: Dataset-wise comparison of accuracy of the different approaches

5 Limitations and Future Work

• LLaVa: For LLaVa, we used the GrokAPI 2 to access the model.

D The Random Approach

E Task Wise Keywords

Dataset name Task in dataset Keywords for task

Analytic Geometry analytic geometry

Countbench Counting objects counting, objects

GEO170K Geometry problems geometry, mathematics

Line intersection count number, intersections

Table 3: Dataset Information

Task 0-Shot Adv Random Caption App-Capt Adv-Cap Rand-Cap Caption

Task 0-Shot Adv Random Caption App-Capt Adv-Cap Rand-Cap Caption

Task 0-Shot Adv Random Caption App-Capt Adv-Cap Rand-Cap Caption

You might also like