0% found this document useful (0 votes)
10 views

Beyond Captioning - Task-Specific Prompting For Improved VLM Performance in Mathematical Reasoning

Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views

Beyond Captioning - Task-Specific Prompting For Improved VLM Performance in Mathematical Reasoning

Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

Beyond Captioning: Task-Specific Prompting for

Improved VLM Performance in Mathematical


Reasoning

Ayush Singh, Mansi Gupta, Shivank Garg, Abhinav Kumar, Vansh Agrawal
Vision and Language Group
Indian Institute of Technology, Roorkee
{ayush_s@mt,m_gupta@ma,shivank_g@mfs,abhinav_k@ma,vansh_a@ph}.iitr.ac.in
arXiv:2410.05928v1 [cs.CV] 8 Oct 2024

Abstract
Vision-Language Models (VLMs) have transformed tasks requiring visual and
reasoning abilities, such as image retrieval and Visual Question Answering (VQA).
Despite their success, VLMs face significant challenges with tasks involving
geometric reasoning, algebraic problem-solving, and counting. These limitations
stem from difficulties in effectively integrating multiple modalities and accurately
interpreting geometry-related tasks [1]. Various works claim that introducing a
captioning pipeline before VQA tasks enhances performance [2]. We incorporated
this pipeline for tasks involving geometry, algebra, and counting and found that
captioning results are not generalizable specifically with larger VLMs primarily
trained on downstream QnA tasks showing random performance on math-related
challenges. However, we present a promising alternative: task-based prompting,
enriching the prompt with task-specific guidance. This approach shows promise and
proves more effective than direct captioning methods for math-heavy problems.

1 Introduction
With the rise of Large Language Models, which demonstrate the ability to understand and generate
text for tasks beyond their explicit training, Vision-Language Models have extended these capabilities
to multimodal tasks involving images and text [3]. These models excel in tasks like Visual Question
Answering (VQA), image captioning, and object segmentation [4]. However, recent studies reveal
that VLMs struggle with simple, low-level visual tasks that humans easily solve, highlighting a need
to enhance their visual reasoning and understanding [5].
Some works try to improve VLMs’ reasoning abilities by fine-tuning the VLMs [6], [7],[8]. Addi-
tionally, some research on VLMs focuses on improving their question-answering abilities through
a two-step process: captioning followed by question-answering. This method takes advantage of
VLMs’ pre-training in text generation, as many tasks require generating text descriptions. The main
challenge lies in the model’s ability to effectively combine and interpret multimodal information,
understanding visual and textual inputs while capturing their interactions. One area where VLMs
consistently underperform is counting [9], primarily due to the scarcity of training data that accurately
labels object counts, especially as the number of objects increases. While captioning has improved
performance in some tasks [2], we hypothesize that these improvements are not generalizable and
depend on various factors, which we aim to explore through our experiments. Further, captioning
fails to capture all the attributes of the image, which is especially crucial in mathematical tasks.
To address these limitations, we introduce prompting techniques designed to enhance the models’
reasoning capabilities. we specifically constructed prompts based solely on the question, excluding
any direct information about the answer. These approach-based prompts were tested in both direct

Preprint. Under review.


QnA tasks and as guides for captioning, with the expectation of improving performance. Additionally,
we assessed robustness by using adversarial prompts, which provide incorrect problem-solving
strategies but relevant to the problem, and random prompts to introduce irrelevant text, evaluating the
models’ responses and robustness to these perturbations.

1.1 Background and Related work

Several studies have examined the reasoning and comprehension abilities of Vision-Language Models
in various tasks requiring spatial understanding and reasoning capabilities. These studies have
demonstrated that multimodal language models rely less on visual information and perform better
when they are given adequate textual cues.[10], [11]. While methods like few-shot prompting [12]
have been shown to improve the performance of VLMs [13], these models continue to struggle with
mathematical tasks, particularly counting, leading some to describe them as “blind” to numbers [5].
Recent research suggests that much of the reasoning performed by VLMs may stem more from the
phrasing of the questions rather than the images themselves. This is evident in tasks that heavily rely
on visual information, such as counting nested squares or identifying line intersections, where VLMs
consistently underperform. Datasets like Math Vision [14] and Count Bench[9] have been developed
specifically to test these visual reasoning abilities.
To enhance Visual Questioning (VQA) performance, various techniques have been proposed, includ-
ing the use of question-driven image captions, which are subsequently fed into language models.
These approaches have demonstrated potential to enhance outcomes in specific tasks, such as direct
image-based QnA. [2]. However, whether such captioning-based techniques can reliably enhance
VLM performance on math-related tasks has been explored in our work.

2 Method
We assessed the Vision-Language Models on a range of geometry-related tasks. We take tasks from
four datasets containing different questions to ensure our tests’ robustness and generalization. These
datasets contain a variety of tasks related to geometry, counting, algebra, and mathematical reasoning.
We use a diverse set of VLMs in our experiments to assess the generalizability of our approach. One
closed-source large model was taken, Gemini-1.5-Flash, and three open-sourced smaller models,
LLaVa, Florence-2, and Phi 3.5 Vision Instruct, were chosen. Such a range of models was chosen to
ensure variation in size, from smaller ones with fewer parameters to larger, more complex models.
They were tested across eight distinct tasks, divided into two main categories:
1. Question-Answering (QnA): Using a classical zero-shot approach, each model was directly
queried with questions related to images from the datasets.
2. Captioning: We generated captions for the image using the base model. After generating
the captions, we fed them back into the LLM, and QnA was performed on the generated
caption using the LLM.

Random

Approach: Advesarial: Random:

Figure 1: Example of our QnA Approach

We further tested the impact of incorporating additional information and context into the prompts.
Specifically, we provided explicit guidance to solve the problem, which was generated using an

2
LLM(Gemini). This method aimed to determine how effectively the models could leverage explicit
procedural guidance to enhance their performance on QnA. Additionally, we tested two other variants,
random prompts and adversarial prompts(Figure: 2). Exact details regarding which are mentioned in
Appendix: A.
In our captioning experiments, first, we generated image captions by inputting task-specific keywords
derived from the Llama 3.1-Instruct model [15]. These keywords were extracted by prompting the
model to produce concise, 1-2-word summaries that encapsulate the essence of each question. After
generating the captions, we fed them back to an LLM and asked the corresponding questions for
each task. Similar to Direct QnA, we try approach-based captioning where, along with the image and
keywords, the approach was passed to the model to generate the image caption, which is further used
to extract the final answer.

VLM A blue circle separated


by a pink circle

Image Caption

Llama No

Approach

Keyword
extraction using Llama Are the two circles
Intersection,Circle overlapping? Answer
with Yes/No.

Question
Gemini

Figure 2: Example of our Caption Based Approach

3 Experiments and Results


We chose diverse models and techniques to prove and test our hypothesis. We chose four datasets,
Geo170k, CountBench, Blind, and MathVision, containing various tasks related to geometry, rea-
soning, algebra, and counting. Also, we split the MathVision dataset into three subparts: mainly
vision-based, geometry-based, and mathematics-based (exact details of the datasets in Appendix:
C). The approach-based prompting improves overall results for both Direct VQA and Caption-based
QA (Figure: 3). Further, we observe a drop in performance when prompted with the adversarial
approach and an overall increase in performance compared to baseline when prompted with the
random approach(Table: 1). We also observe that the performance of the models varies across
different datasets(Table: 2). Models perform best on the CountBench dataset, which focuses on
counting tasks, while they perform poorly on the MathVision Dataset due to the complexity of the
tasks. Further, on the MathVision dataset, we observe better performance on visual-based tasks as
compared to mathematics-related tasks. Also, performance on geometry related tasks is observed to
be relatively poor. (Appendix: C)

Dataset Base Approach Random Adv. Caption Caption Caption Caption


Approach Adv. Random
Gemini-1.5-Flash 42.73 44.32 41.30 43.86 35.18 42.53 37.70 38.62
LLaVa 09.70 12.54 09.89 10.61 24.06 25.54 23.15 25.07
Florence-2 07.12 07.49 04.89 02.89 14.03 16.86 14.56 16.35
Phi-3.5-Vision 28.09 32.49 30.58 28.92 31.44 34.76 29.27 28.71

Table 1: Model-wise comparison of accuracy of the different approaches

4 Conclusion
The results of our study align with our initial hypothesis: VLMs exhibit significant limitations
when it comes to mathematical tasks, particularly those involving numbers and counting. While

3
40 40
Accuracy

Accuracy
20 20

0 0

a
i

ce

.5

ce

.5
in

in
aV

aV
i-3

i-3
en

en
em

em
LL

LL
or

or
Ph

Ph
G

G
Fl

Fl
Base Approach Base Approach

Figure 3: Left: Comparison of captioned methods,Right: Comparison of base methods,

Dataset Base Approach Random Adv. Caption Caption Caption Caption


Approach Adv. Random
Math-vision 10.13 14.06 11.92 13.31 19.28 22.31 20.43 19.61
CountBench 32.58 33.33 31.67 34.50 27.16 34.16 28.00 31.00
Geo 16.22 20.46 19.11 14.67 24.00 28.56 26.00 23.00
Blind 31.52 32.12 27.08 25.98 31.94 32.81 28.60 31.90

Table 2: Dataset-wise comparison of accuracy of the different approaches

captioning using task-specific keywords is a first step to improve performance in some contexts, our
findings suggest that its effectiveness is inconsistent, varying greatly depending on the dataset, task
complexity, and model size. Larger models, often pre-trained on QnA tasks, inherently perform
better in QnA-related tasks, a trend not observed in smaller models highlighting the influence of
pre-training, as observed in Gemini.(Further refer Appendix:B)
Our experiments demonstrate the potential for improving VLMs through techniques that enhance
their reasoning capabilities and consistently improve performance. Additionally, in our assessment
for testing the models’ robustness and generalization capabilities using adversarial and random
approach techniques, we observed a drop in performance, supporting our claim that VLMs incorporate
additional information into their reasoning. Meanwhile, the performance stability when random
prompts were provided underscores the robustness of these models’ reasoning abilities(Appendix:
D). In conclusion, improving the reasoning capabilities of VLMs presents a promising path forward,
especially given their inherent blindness in perceiving numbers and handling mathematical tasks.
By leveraging structured prompts constructed solely by leveraging information from the question,
and approaches that guide reasoning, we can mitigate some of these weaknesses and move closer to
enhancing their performance in more complex problem-solving scenarios.

5 Limitations and Future Work

Testing the reasoning capabilities of multimodal models is a broad area of research, and we propose
ways to improve the generalizability of these models in our research. Due to resource constraints,
we couldn’t experiment with many mainstream and large-scale models, especially state-of-the-art
Vision-Language Models (VLMs). With more computational resources and funding, future work
could focus on scalability and robustness across a wider range of model architectures such as Claude
Sonnet[16], GPT 4[17], etc.
Additionally, advanced methods, such as sophisticated prompt engineering techniques or incorporating
domain-specific knowledge—could enhance the models’ ability to generate more accurate and
contextually relevant captions. Moreover, replacing the QA model with interpretable alternatives
could offer greater transparency and insights into the decision-making process, thus shedding light on
the model’s reasoning and performance in a more understandable way.

4
References
[1] Jingyi Zhang, Jiaxing Huang, Sheng Jin, and Shijian Lu. Vision-language models for vision
tasks: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2024.
[2] Övgü Özdemir and Erdem Akagündüz. Enhancing visual question answering through question-
driven image captions as prompts. In Proceedings of the IEEE/CVF Conference on Computer
Vision and Pattern Recognition, pages 1562–1571, 2024.
[3] Jingyi Zhang, Jiaxing Huang, Sheng Jin, and Shijian Lu. Vision-language models for vision
tasks: A survey, 2024.
[4] Yuxuan Qiao, Haodong Duan, Xinyu Fang, Junming Yang, Lin Chen, Songyang Zhang, Jiaqi
Wang, Dahua Lin, and Kai Chen. Prism: A framework for decoupling and assessing the
capabilities of vlms. arXiv preprint arXiv:2406.14544, 2024.
[5] Pooyan Rahmanzadehgervi, Logan Bolton, Mohammad Reza Taesiri, and Anh Totti Nguyen.
Vision language models are blind. arXiv preprint arXiv:2407.06581, 2024.
[6] Denisa Roberts and Lucas Roberts. Smart vision-language reasoners. In AI for Math Workshop
@ ICML 2024, 2024.
[7] Zhen Yang, Jinhao Chen, Zhengxiao Du, Wenmeng Yu, Weihan Wang, Wenyi Hong, Zhihuan
Jiang, Bin Xu, Yuxiao Dong, and Jie Tang. Mathglm-vision: Solving mathematical problems
with multi-modal large language model. arXiv preprint arXiv:2409.13729, 2024.
[8] Yihe Deng Deng, Pan Lu, Fan Yin, Ziniu Hu, Sheng Shen, James Zou, Kai-Wei Chang, and Wei
Wang. Enhancing large vision language models with self-training on image comprehension.
arXiv preprint arXiv:2405.19716, 2024.
[9] Roni Paiss, Ariel Ephrat, Omer Tov, Shiran Zada, Inbar Mosseri, Michal Irani, and Tali Dekel.
Teaching clip to count to ten. In Proceedings of the IEEE/CVF International Conference on
Computer Vision, pages 3170–3180, 2023.
[10] Gengyuan Zhang, Yurui Zhang, Kerui Zhang, and Volker Tresp. Can vision-language models
be a good guesser? exploring vlms for times and location reasoning. In Proceedings of the
IEEE/CVF Winter Conference on Applications of Computer Vision, pages 636–645, 2024.
[11] Jiayu Wang, Yifei Ming, Zhenmei Shi, Vibhav Vineet, Xin Wang, and Neel Joshi. Is a picture
worth a thousand words? delving into spatial reasoning for vision language models. arXiv
preprint arXiv:2406.14852, 2024.
[12] Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal,
Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel
Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel Ziegler,
Jeffrey Wu, Clemens Winter, Chris Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott
Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya
Sutskever, and Dario Amodei. Language models are few-shot learners. In H. Larochelle,
M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin, editors, Advances in Neural Information
Processing Systems, volume 33, pages 1877–1901. Curran Associates, Inc., 2020.
[13] Yangyi Chen, Karan Sikka, Michael Cogswell, Heng Ji, and Ajay Divakaran. Measur-
ing and improving chain-of-thought reasoning in vision-language models. arXiv preprint
arXiv:2309.04461, 2023.
[14] Ke Wang, Junting Pan, Weikang Shi, Zimu Lu, Mingjie Zhan, and Hongsheng Li. Measuring
multimodal mathematical reasoning with math-vision dataset. arXiv preprint arXiv:2402.14804,
2024.
[15] Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle,
Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, et al. The llama 3 herd
of models. arXiv preprint arXiv:2407.21783, 2024.
[16] Introducing the next generation of claude. 2024.

5
[17] Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni
Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4
technical report. arXiv preprint arXiv:2303.08774, 2023.

6
A Prompting
To enrich the prompts used in our experiments, we employed Gemini 1.5 Flash. The question was
presented to the model without an accompanying image. The model was tasked with generating
responses under three different conditions:

1. Approach-Based: The model was asked to provide an approach for solving the question.
2. Adversarial: The model was prompted to generate a misleading or incorrect approach to
solving the question. Although inaccurate, the response needed to be plausible.
3. Random: The model was asked to generate a random string.

In the Captioning-based experiments, we utilized the LLaMA 3.1 Instruct 8B1 model to generate
keywords based solely on the question. These keywords provided initial guidance for the captioning
task. For the zero-shot captioning, we instructed the respective model being tested to generate a
caption for the image using the keywords provided. Similarly, for the rest of the captioning tasks,
we asked the QnA base model to generate image captions, using both keywords and additional hints
generated in the same manner as described above (Approach-Based, Adversarial, Random), following
which the answer was generated by passing the caption to another LLM(Llama 3.1 Instruct).
Note: For Florence-2 captions were generated using <DETAILED CAPTION>, and the approach
was passed onto the detailed caption. For other models, the approach was passed during the caption
generation stage. Additionally, the Florence-2 direct checkpoint was unable to perform QnA-related
tasks, so we used Florence-2 DocQnA for QnA-related tasks.

B Experiment Details
Certain experimental details worth mentioning for the respective models used are:

• LLaVa: For LLaVa, we used the GrokAPI 2 to access the model.


• Gemini-1.5-Flash: For Gemini, we used Google AI studio3 to access the model.
• Florence-2:45 For Florence-2, we used the open-sourced model available on huggingface .
• Phi 3.5 Vision Instruct:6 we used the open-sourced model available on huggingface.

C Datasets
The following datasets were used for our experiments:

• Math Vision:7 The Math Vision dataset is a curated collection of 3,040 high-quality mathe-
matical problems with visual contexts from real math competitions. For our experiments,
we broadly divided the dataset into three categories:
– Visual Based: This was originally split into Area, Angles, and Length-related tasks.
– Geometry Based: This was originally split into categories: Analytical Geometry,
Combinatorial Geometry, Transformation Geometry, Descriptive Geometry and Solid
Geometry.
– General Mathematics: This was originally split into categories: Graph Theory, Logic,
Algebra, Combinatorics, Statistics, and Arithmetic.
• Blind:8 The Blind dataset consisted of images and question-answer pairs about visual tasks.
We used a subset of 150 images per task. The tasks include counting the number of
1
https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct
2
https://developers.x.ai/python-sdk/grok/
3
https://aistudio.google.com/
4
https://huggingface.co/microsoft/Florence-2-large
5
https://huggingface.co/HuggingFaceM4/Florence-2-DocVQA
6
https://huggingface.co/microsoft/Phi-3.5-vision-instruct
7
https://huggingface.co/datasets/MathLLMs/MathVision
8
https://huggingface.co/datasets/XAI/vlmsareblind

7
intersections of 2 circles or lines, checking if 2 lines are intersecting, counting the number
of rows and columns in a grid, finding the number of overlapping circles in an image, and
finding the number of paths between 2 points in a subway connection image.
• Countbench:9 The CountBench dataset contained a total of 540 images containing between
two and ten instances of a particular object, where their corresponding captions reflect this
number. This dataset is a benchmark dataset for counting related tasks.
• Geo170k:10 The Geo dataset contained more than 170K geometric image-caption and
question-answer pairs. We used a subset of 500 images to conduct our experiments.

D The Random Approach

When using the random approach, we observed that performance often surpassed the baseline, with
significant improvements in some cases. We hypothesize that the model tends to disregard random
information, but in doing so, it becomes more cautious and focused on providing a correct response.
This contrasts with the baseline, where such behavior is less apparent. While these findings are
promising, there is still potential for further research to understand and refine this approach fully.

E Task Wise Keywords

Dataset name Task in dataset Keywords for task

Analytic Geometry analytic geometry


Algebra algebra, mathematics, logic
Transformation Geo transformation, geometry
Statistics statistics, graph
Angle metric geometry, angles, mathematics, logic
Combinatorics combinatorics, logic
Descriptive Geo descriptive geometry, mathematics
MathVision
Logic logics, reasonings
Length lengths, geometry
Arithmetic arithmetic, logics, mathematics
Area area, geometry
Combinatorial Geo combinatorial, geometry
Solid Geometry solids, geometry
Graph Theory logics, connections, graphs

Countbench Counting objects counting, objects

GEO170K Geometry problems geometry, mathematics

Line intersection count number, intersections


Two line intersection lines, intersecting
BLIND Interior pentagon count, number, pentagons
Subway subway lines, count, paths
Rows/columns count, rows, columns
Two circles circles, touching

Table 3: Dataset Information

9
https://huggingface.co/datasets/nielsr/countbench
10
https://huggingface.co/datasets/Luckyjhg/Geo170K

8
F Results Table

Task 0-Shot Adv Random Caption App-Capt Adv-Cap Rand-Cap Caption


MathVision
Angles 36 36 24 38 36 30 34 32
Area 27 28 24 28 32 34 28 22
Length 30 32 24 36 30 26 32 36
Descriptive Geo 34 26 28 22 14 20 24 20
Analytic Geo 16 18 22 14 20 22 20 12
Combinatorial Geo 22 26 14 26 20 24 10 6
Transformation Geo 18 24 20 22 20 14 28 22
Solid Geo 20 32 20 24 16 28 16 14
Graph Theory 28 24 22 26 24 26 22 18
Arithmetic 26 26 28 20 22 36 26 18
Logic 32 32 36 14 18 36 22 30
Combinatorics 20 20 12 22 22 28 14 12
Algebra 26 22 30 22 18 32 28 10
Statistics 26 24 24 24 22 24 22 12
Dataset Average 26.44 27.29 23.38 25.64 23.89 27.31 24.42 20.49
GEO170K
Geometry problems 30 33.33 32 30 30 36 34 28
CountBench
Counting objects 62 68 64.67 68 43 62.66 52 64
Blind
Line Intersection 50 58 44 59 43 43 35 38
Two line intersection 72 70 54 72 73 74 70 71
pentagon 53 42 24 51 45 41 43.34 42
Subway 23 23 26 16 0 0 0 0
Rows/columns 28 32 31 31 19 24 11 18
Two circles 89 67 92 82 83 83 83 83
Dataset Average 52.50 48.67 45.16 51.84 43.87 44.16 40.39 42.00
Table 4: Gemini Model Performance Across Various Tasks

Task 0-Shot Adv Random Caption App-Capt Adv-Cap Rand-Cap Caption


MathVision
Angles 6 14 8 14 38 34 28 26
Area 2 4 2 8 24 26 26 28
Length 6 14 14 14 20 26 28 26
Descriptive Geo 20 12 16 16 20 22 24 26
Analytic Geo 4 12 8 10 22 12 26 20
Combinatorial Geo 12 14 12 14 16 12 8 12
Transformation Geo 18 16 14 14 24 24 28 42
Solid Geo 4 10 6 6 22 26 20 16
Graph Theory 12 6 6 4 18 24 28 12
Arithmetic 2 14 10 6 16 22 14 22
Logic 10 16 8 8 16 24 18 26
Combinatorics 4 2 8 4 16 14 20 20
Algebra 0 4 4 4 26 28 16 20
Statistics 6 12 6 12 20 12 14 20
Dataset Average 7.31 10.82 8.73 10.11 22.27 22.84 22.29 23.29
GEO170K
Geometry problems 5 8 6.67 6 33.33 36 32.66 36
CountBench
Counting objects 12 14 10 12 14.66 16 14 15
Blind
Line Intersection 4 1 3 8 6 5 6 8
Two line intersection 32 42 30 23 64 68 55 60
pentagon 2 3 2 5 11 10 9 3
Subway 6 8 7 8 28 25 26 26
Rows/columns 5 9 7 9 6 10 6 8
Two circles 38 41 36 33 41 46 40 51
Dataset Average 14.50 17.33 14.16 14.33 26 27.33 23.67 26
Table 5: LLAVA Model Performance Across Various Tasks

9
Task 0-Shot Adv Random Caption App-Capt Adv-Cap Rand-Cap Caption
MathVision
Angles 5 2 0 3 22 24 17 26
Area 4 3 0 4 22 26 24 28
Length 6 4 0 2 16 26 25 26
Descriptive Geo 3 3 1 4 16 24 0 27
Analytic Geo 2 5 2 6 17 16 20 21
Combinatorial Geo 3 2 1 3 19 15 8 10
Transformation Geo 1 2 1 3 20 23 18 22
Solid Geo 4 3 0 2 19 21 19 18
Graph Theory 1 2 0 2 17 23 28 12
Arithmetic 2 3 1 3 15 23 18 21
Logic 2 2 1 3 15 24 20 25
Combinatorics 2 4 1 2 15 15 17 19
Algebra 2 3 0 2 24 25 18 19
Statistics 1 1 1 2 18 16 16 19
Dataset Average 3.09 2.83 0.56 2.98 18.51 22.04 19.25 21.81
GEO170K
Geometry problems 0 1.34 0 0 4 9 6 5
CountBench
Counting objects 3 4 2 4 6 8 8 9
Blind
Line Intersection 15 10 31 0 37 38 30 39
Two line intersection 74 76 31 0 69 70 69 73
pentagon 0 0 0 0 0 0 0 0
Rows/columns 0 0 0 0 0 0 0 0
Two circles 23 23 23 23 32 34 26 36
Dataset Average 22.4 21.8 17 4.6 27.6 28.4 25 29.6
Table 6: Florance-2 Model Performance Across Various Tasks

Task 0-Shot Adv Random Caption App-Capt Adv-Cap Rand-Cap Caption


MathVision
Angles 0 12 18 14 10 20 18 14
Area 2 18 16 14 14 20 18 16
Length 6 22 14 14 14 24 18 10
Descriptive Geo 6 16 20 16 16 22 22 16
Analytic Geo 2 10 14 12 14 14 26 14
Combinatorial Geo 4 18 10 18 12 12 8 12
Transformation Geo 4 18 24 16 14 10 8 12
Solid Geo 6 14 12 14 4 6 4 4
Graph Theory 0 16 14 14 10 16 14 8
Arithmetic 8 20 16 18 20 16 18 20
Logic 6 12 14 14 10 22 18 14
Combinatorics 6 14 12 14 14 14 14 12
Algebra 2 12 18 14 4 10 12 8
Statistics 2 6 4 12 18 24 18 20
Dataset Average 3.68 15.28 15 14.51 12.44 17.04 15.75 12.87
GEO170K
Geometry problems 18.67 26.67 25.33 14 38 40.67 38 36
CountBench
Counting objects 53.33 47.33 50 54 45 50 38 36
Blind
Line Intersection 35 39 32 37 33 33 31 40
Two line intersection 59 62 55 47 70 69 54 65
pentagon 12 13 2 13 6 9 4 2
Subway 22 44 17 18 31 24 18 29
Rows/columns 6 9 9 7 4 2 0 7
Two circles 86 77 77 77 38 51 45 37
Dataset Average 36.66 40.67 32 33.16 30.33 31.33 25.33 30
Table 7: Phi-3.5-Vision Model Performance Across Various Tasks

10

You might also like