0% found this document useful (0 votes)
4 views

Token-by-Token Regeneration and Domain Biases- A Benchmark of LLMs on Advanced Mathematical Problem-Solving

This study evaluates 10 large language models (LLMs) on their ability to solve complex mathematical problems, revealing a significant performance gap of 34.5% between the best (gpt-4o-mini) and worst (open-codestral-mamba:v0.1) models. The research highlights challenges in symbolic reasoning and output consistency, with findings indicating that token-by-token regeneration slightly improves accuracy but reduces execution time. The results underscore the need for enhanced reasoning techniques, particularly in difficult areas like Number Theory.

Uploaded by

Tech Guy RS
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views

Token-by-Token Regeneration and Domain Biases- A Benchmark of LLMs on Advanced Mathematical Problem-Solving

This study evaluates 10 large language models (LLMs) on their ability to solve complex mathematical problems, revealing a significant performance gap of 34.5% between the best (gpt-4o-mini) and worst (open-codestral-mamba:v0.1) models. The research highlights challenges in symbolic reasoning and output consistency, with findings indicating that token-by-token regeneration slightly improves accuracy but reduces execution time. The results underscore the need for enhanced reasoning techniques, particularly in difficult areas like Number Theory.

Uploaded by

Tech Guy RS
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

CC BY: Creative Commons Attribution

Evgenii Evstafev

Token-by-Token Regeneration and Domain Biases:


A Benchmark of LLMs on Advanced Mathematical
Problem-Solving.
Evgenii Evstafev A
A
University Information Services (UIS), University of Cambridge,
Roger Needham Building, 7 JJ Thomson Ave, Cambridge CB3 0RB, UK, [email protected]

ABSTRACT

Large language models (LLMs) excel in many natural language tasks, yet they struggle with complex mathematical
problem-solving, particularly in symbolic reasoning and maintaining consistent output. This study evaluates 10
LLMs with 7 to 8 billion parameters using 945 competition-level problems from the MATH dataset. The focus is
on their ability to generate executable Python code as a step in their reasoning process, involving over 9,450 code
executions. The research introduces an evaluation framework using mistral-large-2411 to rate answers on a 5-
point scale, which helps address inconsistencies in mathematical notation. It also examines the impact of regen-
erating output token-by-token on refining results. The findings reveal a significant 34.5% performance gap be-
tween the top commercial model (gpt-4o-mini, scoring 83.7%) and the least effective open-source model (open-
codestral-mamba:v0.1, scoring 49.2%). This disparity is especially noticeable in complex areas like Number The-
ory. While token-by-token regeneration slightly improved accuracy (+0.8%) for the model llama3.1:8b, it also
reduced code execution time by 36.7%, highlighting a trade-off between efficiency and precision. The study also
noted a consistent trend where harder problems correlated with lower accuracy across all models. Despite using
controlled execution environments, less than 1% of the generated code was unsafe, and 3.17% of problems re-
mained unsolved after 10 attempts, suggesting that hybrid reasoning methods may be beneficial.

TYPE OF PAPER AND KEYWORDS

Empirical Research Paper (Comparative Analysis and Benchmarking Study), Large Language Models, Mathe-
matical Reasoning, Code Generation, Benchmarking, Token-by-Token Regeneration, Computational Efficiency,
MATH Dataset, Domain-Specific Performance, Automated Evaluation, Safety Constraints.

1 INTRODUCTION • A granular evaluation framework using mistral-


large-2411 [6] for automated answer scoring, ad-
Large language models (LLMs) have demonstrated dressing inconsistencies in mathematical notation
remarkable proficiency in natural language tasks [1], yet (e.g., fractions, symbolic constants).
their ability to solve complex mathematical problems re- • An empirical analysis of token-by-token regenera-
mains constrained by challenges in symbolic reasoning tion—a dynamic output refinement technique—ap-
and precise output formatting [2]. While recent ad- plied to llama3.1:8b [7] to assess its impact on accu-
vances, such as code-augmented problem-solving, offer racy and computational efficiency.
promising pathways [3], systematic evaluations of
LLMs’ mathematical capabilities—particularly across 2. BACKGROUND AND RELATED WORK
diverse architectures and difficulty levels—remain un-
derexplored [4]. This study addresses this gap by bench- Modern LLMs employ diverse strategies for mathe-
marking 10 LLMs (7B–8B parameters) on 945 compe- matical problem-solving, including chain-of-thought
tition-level mathematics problems from the MATH da- prompting [10], program-aided language models (PAL)
taset [4, 5], focusing on their ability to generate execut- [11], and symbolic equation generation. Code-based ap-
able Python code as a reasoning intermediate. The in- proaches, where models generate executable programs
vestigation introduces two contributions: to derive solutions, have gained traction for their ability
to enforce logical rigor and mitigate hallucination [12].

1
However, output variability—such as inconsistent nu- for their computational efficiency and diversity in train-
merical formatting or symbolic representation (e.g., π ing methodologies:
vs. 3.1416)—complicates automated evaluation, neces-
sitating robust scoring mechanism. Key limitations per- 1. llama3.1:8b [7]: General-purpose LLM with strong
sist: NLP capabilities.
2. olmo2:7b [16]: Open-source model optimized for
• Performance degrades nonlinearly with problem research reproducibility.
complexity, as seen in GPT-4’s 23% accuracy drop 3. codestral-2501 [17]: Code-focused model for code
on Level 5 MATH problems [4]. generation tasks.
• Models exhibit uneven proficiency across mathe- 4. gpt-4o-mini-2024-07-18 [8]: Compact variant of
matical subjects, often struggling with combinatorics GPT-4.
and modular arithmetic [13]. 5. granite3.1-dense:8b [18]: Dense model trained on
• Unrestricted code generation introduces vulnerabili- large-scale datasets.
ties like infinite loops or unsafe system calls, requir- 6. open-codestral-mamba:v0.1 [9]: Hybrid architec-
ing sandboxed execution environments [14]. ture combining code and general capabilities.
7. ministral-8b-2410 [19]: Lightweight model for on-
These challenges underscore the need for standard- device applications.
ized evaluation protocols and architectural innovations 8. gemini-1.5-flash-8b [20]: Efficient model with
to enhance reliability. Existing benchmarks like MATH strong task performance.
[4] rely on binary correctness scoring, overlooking par- 9. mistral-small-2409 [21]: Smaller variant of Mis-
tial solutions. tral’s architecture.
10. command-r7b:7b [22]: General-purpose conversa-
3. METHODOLOGY tional model.

Token-by-token regeneration, a technique to refine


3.1 DATASET CREATION outputs iteratively, was exclusively tested on
llama3.1:8b to isolate its impact.
This study uses the MATH dataset, a publicly avail- Evaluation Constraints:
able collection of 12,500 challenging competition-level
mathematics problems sourced from competitions such
• Each model independently solved all 945 problems.
as AMC 10, AMC 12, and AIME. The dataset, described
• Models were instructed to “generate Python code
in the paper "Measuring Mathematical Problem Solving
that prints the final answer to the console.” [23]
With the MATH Dataset" [4], includes step-by-step so-
lutions and spans diverse mathematical domains. • Code Generation Limit: 2 minutes per attempt.
• Execution Timeout: 1 minute per code run to prevent
A stratified subset of 945 problems [15] was curated infinite loops.
to ensure balanced representation across 7 mathematical • Retry Mechanism: Up to 10 attempts per problem,
subjects (Algebra, Counting & Probability, Geometry, with error messages (e.g., syntax/runtime errors) fed
Intermediate Algebra, Number Theory, Prealgebra, Pre- back to the model for iterative refinement.
calculus) and 5 difficulty levels (Level 1: simplest,
Level 5: most complex). Each subject-level combination A restricted execution environment permitted only
contains 27 problems, yielding a total of 7 subjects × 5 safe built-in functions (e.g., mathematical operations,
levels × 27 problems = 945 problems. This structured control flow structures) to mitigate security risks [14].
sampling guarantees diversity in both topic coverage 3.3 EVALUATION METRICS
and problem complexity.
To address variability in mathematical answer for-
To streamline evaluation, the dataset was augmented mats (e.g., fractions, symbolic notation like π or decimal
with a dedicated field containing final numerical an- representations), the correctness of console outputs was
swers (without explanations). This design enables direct evaluated using mistral-large-2411 [6], a high-perfor-
comparison between model-generated outputs and mance language model. This approach ensures robust
ground-truth solutions. and consistent scoring despite differences in output for-
matting.
3.2 MODEL SELECTION
For every problem, the console output generated by
The study evaluates 10 language models of varying a model’s Python code was independently assessed by
architectures and scales (7B–8B parameters), selected

2
Evgenii Evstafev

mistral-large-2411. The evaluator model was blinded to 4. RESULTS


the source model’s identity to eliminate bias.
4.1 OVERALL MODEL PERFORMANCE
Answers were scored on a 5-point scale by compar-
ing the generated output to the ground-truth answer from
the MATH dataset:

• 5 (Correct): Exact match or mathematically equiva-


lent result.
• 4 (Almost Correct): Minor formatting discrepancies
or rounding errors.
• 3 (Partially Correct): Partial solution with significant
inaccuracies.
• 2 (Incorrect): Wrong answer but relevant to the prob-
lem.
• 1 (Completely Incorrect): Irrelevant or nonsensical
output.

The primary metric is the weighted accuracy, calcu-


lated as the percentage of answers scoring 4 or 5 across
the dataset. Scores ≤3 are treated as incorrect.
Chart 1: Model Performance – Average Success
The evaluator model received only the ground-truth Rates (%) with 95% Confidence Intervals
answer, and generated console output, without metadata
about the source model. A bar chart comparing the success rates of all models
in solving the 945 mathematical problems reveals sig-
This method quantifies solution quality more granu- nificant performance disparities. The gpt-4o-mini-2024-
larly than binary correctness, enabling future analysis of 07-18 achieved the highest accuracy at 83.7%, while
incremental improvements (e.g., from "partially correct" open-codestral-mamba:v0.1 ranked lowest at 49.2%.
to "almost correct").
4.2 PERFORMANCE BY DIFFICULTY LEVEL
To address variability in mathematical answer for-
mats (e.g., fractions vs. decimals, symbolic notation like
(pi vs. numerical approximations), direct string compar-
ison proves insufficient for robust evaluation. Such dis-
crepancies necessitate expert judgment to assess equiv-
alence, particularly for context-dependent representa-
tions.

The mistral-large-2411 model was employed as a


proxy for domain expertise, evaluating console outputs
against ground-truth answers through semantic equiva-
lence rather than syntactic exactness. Its 5-point scoring
scale accommodates partial correctness and formatting
nuances, mirroring human expert evaluation. This ap-
proach avoids the brittleness of exact matching while en-
suring systematic, bias-free assessment across diverse
problem types and answer styles.
Chart 2: Performance by Difficulty Level (%)

A line graph illustrates a consistent downward trend


in accuracy across all models as problem difficulty in-
creases from Level 1 (simplest) to Level 5 (most com-
plex).

3
This inverse correlation between difficulty and per- 4.5 IMPACT OF TOKEN-BY-TOKEN REGENERA-
formance highlights persistent challenges in solving ad- TION ON LLAMA3.1:8B
vanced mathematical problems.
4.5.1 OVERALL PERFORMANCE COMPARISON
4.3 PERFORMANCE ACROSS MATHEMATICAL
DOMAINS

Chart 5: Original vs Improved Scores

Implementing token-by-token regeneration for


llama3.1:8b yielded a marginal improvement in accu-
Chart 3: Performance by Task Type (%) racy:
A heatmap (Figure 3) visualizes model-specific
• Original: 63.3%
strengths and weaknesses across seven mathematical
• Improved: 64.1%
domains.
This 0.8% gain suggests limited efficacy of the
4.4 COMPUTATIONAL EFFICIENCY method for general problem-solving (Figure 5).

4.5.2 DIFFICULTY-LEVEL ANALYSIS

Chart 4: Average Execution Time per Model Chart 6: Performance by Difficulty Level – Original
vs Improved
A histogram (Figure 4) quantifies the computational
efficiency of generated code. ministral-8b-2410 pro- Improvements were concentrated in Level 1 prob-
duced the fastest-executing programs (mean: 0.94 sec- lems. Levels 2–5: Statistically insignificant changes.
onds), while olmo2:7b generated the slowest code This indicates the technique primarily enhances perfor-
(mean: 2.27 seconds). mance on simpler tasks (Figure 6).

4
Evgenii Evstafev

4.5.3 COMPUTATIONAL OVERHEAD 5. KEY OBSERVATIONS

The gpt-4o-mini-2024-07-18 (83.7%) outperformed


all other models, while open-codestral-mamba:v0.1
(49.2%) exhibited the lowest accuracy, highlighting
substantial variability in problem-solving capabilities
across architectures.

All models demonstrated a consistent inverse rela-


tionship between accuracy and problem difficulty, with
performance dropping by 10-32% from Level 1 to Level
5 tasks.

Algebra emerged as the strongest domain, while


Number Theory posed the greatest challenge, suggesting
domain-specific architectural biases. Faster-executing
models (e.g., ministral-8b-2410: 0.94s) did not correlate
with higher accuracy, indicating computational effi-
ciency does not inherently improve solution quality.
Chart 7: Average Execution Time Comparison
Marginal accuracy gains (+0.8%) for llama3.1:8b
The improved version reduced code execution time were accompanied by a 36.7% reduction in resulted pro-
by 36.7%: gram execution time, implying potential for resource-
• Original: 1.99 seconds optimized code generation despite limited problem-
• Improved: 1.26 seconds solving improvements.

Despite minimal accuracy gains, the optimization <1% of generated code contained unsafe constructs
significantly enhanced computational efficiency. (e.g., infinite loops), necessitating stricter runtime sand-
boxing.
4.5.4 DOMAIN-SPECIFIC EFFECTS
3.17% of problems remained unsolved by all models
after 10 attempts (the mark is <=3), underscoring the
need for enhanced reasoning techniques.

6 SUMMARY AND CONCLUSIONS

This study evaluated 10 language models on 945


competition-level mathematical problems, revealing
critical insights [24] into their problem-solving capabil-
ities. Commercial models (e.g., gpt-4o-mini) signifi-
cantly outperformed open-source counterparts, with a
34.5% accuracy gap between the best and worst per-
formers. Domain-specific weaknesses, particularly in
Number Theory, persist across architectures.

While iterative regeneration provided minimal accu-


racy improvements for llama3.1:8b, its 36.7% faster ex-
Chart 8: Heatmap Comparison
ecution time of generated code suggests utility in la-
tency-sensitive applications [25]. The technique’s do-
Token-by-token regeneration improved Algebra per-
main-specific efficacy warrants further investigation.
formance but showed neutral or negative effects in other
Future Directions:
domains.
• Exploration of RAG frameworks (or similar) to ad-
dress unsolved problems.

5
• Domain-specific fine-tuning to bridge performance [12] Li, Y., et al. A Comprehensive Survey of
gaps in challenging areas like Number Theory. Hallucination Mitigation Techniques in Large
Language Models. arXiv (2024).
These findings [24] establish a benchmark for LLM- https://arxiv.org/abs/2401.01313 (Accessed
based mathematical problem-solving while identifying January 28, 2025, 4:42 PM).
actionable pathways for improving accuracy, and com- [13] Pan, Z., et al. Can Language Models Rival
putational efficiency. Mathematics Students? Evaluating Mathematical
Reasoning through Textual Manipulation and
REFERENCES Human Experiments. arXiv (2024).
https://arxiv.org/abs/2412.11908 (Accessed
[1] Brown, T., et al. Language Models are Few-Shot January 28, 2025, 4:43 PM).
Learners. NeurIPS (2020). [14] Python Sandbox Execution Bulletins. GitHub Gist
https://arxiv.org/abs/2005.14165 (Accessed (2024).
January 28, 2025, 4:31 PM). https://gist.github.com/chigwell/9fbf572d9280080
[2] Lewkowycz, A., et al. Solving Quantitative 10a1a8b6c81809331 (Accessed January 28, 2025,
Reasoning Problems with Language Models. 4:44 PM).
arXiv (2022). https://arxiv.org/abs/2206.14858 [15] 945 Problems. GitHub Gist (2024).
(Accessed January 28, 2025, 4:32 PM). https://gist.github.com/chigwell/e033c29b5e9e48
[3] Chen, M., et al. Evaluating Large Language a968d5fdf0d4c1d131 (Accessed January 28,
Models Trained on Code. OpenAI (2021). 2025, 4:45 PM).
https://arxiv.org/abs/2107.03374 (Accessed [16] OLMo 2: 7B. Ollama (2024).
January 28, 2025, 4:33 PM). https://ollama.com/library/olmo2:7b (Accessed
[4] Hendrycks, D., et al. Measuring Mathematical January 28, 2025, 4:46 PM).
Problem Solving with the MATH Dataset. [17] Introducing Codestral. Mistral AI Blog (2024).
NeurIPS (2021). https://arxiv.org/abs/2103.03874 https://mistral.ai/news/codestral-2501/ (Accessed
(Accessed January 28, 2025, 4:34 PM). January 28, 2025, 4:47 PM).
[5] Hendrycks, D., et al. The MATH Dataset. GitHub [18] Granite 3.1 Dense: 8B. Ollama (2024).
(2021). https://github.com/hendrycks/math https://ollama.com/library/granite3.1-dense
(Accessed January 28, 2025, 4:35 PM). (Accessed January 28, 2025, 4:48 PM).
[6] Introducing Mistral Large. Mistral AI Blog [19] Ministral-8B-Instruct-2410. Hugging Face
(2024). https://mistral.ai/news/mistral-large/ System Card (2024).
(Accessed January 28, 2025, 4:36 PM). https://huggingface.co/mistralai/Ministral-8B-
[7] Introducing Meta Llama 3.1: The New Standard Instruct-2410 (Accessed January 28, 2025, 4:49
for Open LLMs. Meta AI Blog (2024). PM).
https://ai.meta.com/blog/meta-llama-3-1/ [20] Gemini 1.5 Flash 8B is Now Generally Available
(Accessed January 28, 2025, 4:37 PM). for Use. Google Developers Blog (2024).
[8] GPT-4o-mini: Advancing Cost-Efficient https://developers.googleblog.com/en/gemini-15-
Intelligence. OpenAI Blog (2024). flash-8b-is-now-generally-available-for-use/
https://openai.com/index/gpt-4o-mini-advancing- (Accessed January 28, 2025, 4:50 PM).
cost-efficient-intelligence/ (Accessed January 28, [21] Mixtral of Experts. Mistral AI Blog (2024).
2025, 4:38 PM). https://mistral.ai/news/mixtral-of-experts/
[9] Mamba-Codestral-7B-v0.1. Hugging Face System (Accessed January 28, 2025, 4:51 PM).
Card (2024). [22] Command-R7B: 7B. Ollama (2024).
https://huggingface.co/mistralai/Mamba- https://ollama.com/library/command-r7b:7b
Codestral-7B-v0.1 (Accessed January 28, 2025, (Accessed January 28, 2025, 4:52 PM).
4:39 PM). [23] Prompt File. GitHub Gist (2024).
[10] Chain-of-Thought Prompting. Prompting Guide https://gist.github.com/chigwell/33e0ac10a4d004
(2023). 53d3b3aaf195bab3e7 (Accessed January 28,
https://www.promptingguide.ai/techniques/cot 2025, 4:53 PM).
(Accessed January 28, 2025, 4:40 PM). [24] Results Marks. GitHub Gist (2024).
[11] Gao, L., et al. PAL: Program-aided Language https://gist.github.com/chigwell/738336bd466dcf
Models. arXiv (2022). 4bfbd4cf4e00d3bb20 (Accessed January 28,
https://arxiv.org/abs/2211.10435 (Accessed 2025, 4:54 PM).
January 28, 2025, 4:41 PM). [25] Improved Results Marks. GitHub Gist (2024).
https://gist.github.com/chigwell/51a06b04f6673b

6
Evgenii Evstafev

d2cb07e38850fbbb78 (Accessed January 28,


2025, 4:55 PM).

7
AUTHOR BIOGRAPHIES

Evgenii Evstafev is a skilled


software developer at the Uni-
versity of Cambridge, where
he has been working since
September 2022, specializing
in identity and access manage-
ment. He earned a Bachelor’s
degree in Business Informat-
ics from the Higher School of
Economics (2010-2014) and a Master's degree in Pro-
gramming from Perm National Research Polytechnic
University (2014-2016). Evgenii also taught at the Fac-
ulty of Mechanics and Mathematics at Perm State Uni-
versity while engaged in postgraduate studies (2016-
2019). His professional journey spans over 11 years
across various industries, including roles such as System
Architect at L'Etoile (2021-2022) focusing on product
development, the Head of Analytics at Gazprombank
(2020-2021), and Head of the Department for System
Analysis and Software Design at Center 2M (2019-
2020). Additionally, he worked on system development
at the energy company T Plus (2016-2019).

You might also like