Token-by-Token Regeneration and Domain Biases- A Benchmark of LLMs on Advanced Mathematical Problem-Solving
Token-by-Token Regeneration and Domain Biases- A Benchmark of LLMs on Advanced Mathematical Problem-Solving
Evgenii Evstafev
ABSTRACT
Large language models (LLMs) excel in many natural language tasks, yet they struggle with complex mathematical
problem-solving, particularly in symbolic reasoning and maintaining consistent output. This study evaluates 10
LLMs with 7 to 8 billion parameters using 945 competition-level problems from the MATH dataset. The focus is
on their ability to generate executable Python code as a step in their reasoning process, involving over 9,450 code
executions. The research introduces an evaluation framework using mistral-large-2411 to rate answers on a 5-
point scale, which helps address inconsistencies in mathematical notation. It also examines the impact of regen-
erating output token-by-token on refining results. The findings reveal a significant 34.5% performance gap be-
tween the top commercial model (gpt-4o-mini, scoring 83.7%) and the least effective open-source model (open-
codestral-mamba:v0.1, scoring 49.2%). This disparity is especially noticeable in complex areas like Number The-
ory. While token-by-token regeneration slightly improved accuracy (+0.8%) for the model llama3.1:8b, it also
reduced code execution time by 36.7%, highlighting a trade-off between efficiency and precision. The study also
noted a consistent trend where harder problems correlated with lower accuracy across all models. Despite using
controlled execution environments, less than 1% of the generated code was unsafe, and 3.17% of problems re-
mained unsolved after 10 attempts, suggesting that hybrid reasoning methods may be beneficial.
Empirical Research Paper (Comparative Analysis and Benchmarking Study), Large Language Models, Mathe-
matical Reasoning, Code Generation, Benchmarking, Token-by-Token Regeneration, Computational Efficiency,
MATH Dataset, Domain-Specific Performance, Automated Evaluation, Safety Constraints.
1
However, output variability—such as inconsistent nu- for their computational efficiency and diversity in train-
merical formatting or symbolic representation (e.g., π ing methodologies:
vs. 3.1416)—complicates automated evaluation, neces-
sitating robust scoring mechanism. Key limitations per- 1. llama3.1:8b [7]: General-purpose LLM with strong
sist: NLP capabilities.
2. olmo2:7b [16]: Open-source model optimized for
• Performance degrades nonlinearly with problem research reproducibility.
complexity, as seen in GPT-4’s 23% accuracy drop 3. codestral-2501 [17]: Code-focused model for code
on Level 5 MATH problems [4]. generation tasks.
• Models exhibit uneven proficiency across mathe- 4. gpt-4o-mini-2024-07-18 [8]: Compact variant of
matical subjects, often struggling with combinatorics GPT-4.
and modular arithmetic [13]. 5. granite3.1-dense:8b [18]: Dense model trained on
• Unrestricted code generation introduces vulnerabili- large-scale datasets.
ties like infinite loops or unsafe system calls, requir- 6. open-codestral-mamba:v0.1 [9]: Hybrid architec-
ing sandboxed execution environments [14]. ture combining code and general capabilities.
7. ministral-8b-2410 [19]: Lightweight model for on-
These challenges underscore the need for standard- device applications.
ized evaluation protocols and architectural innovations 8. gemini-1.5-flash-8b [20]: Efficient model with
to enhance reliability. Existing benchmarks like MATH strong task performance.
[4] rely on binary correctness scoring, overlooking par- 9. mistral-small-2409 [21]: Smaller variant of Mis-
tial solutions. tral’s architecture.
10. command-r7b:7b [22]: General-purpose conversa-
3. METHODOLOGY tional model.
2
Evgenii Evstafev
3
This inverse correlation between difficulty and per- 4.5 IMPACT OF TOKEN-BY-TOKEN REGENERA-
formance highlights persistent challenges in solving ad- TION ON LLAMA3.1:8B
vanced mathematical problems.
4.5.1 OVERALL PERFORMANCE COMPARISON
4.3 PERFORMANCE ACROSS MATHEMATICAL
DOMAINS
Chart 4: Average Execution Time per Model Chart 6: Performance by Difficulty Level – Original
vs Improved
A histogram (Figure 4) quantifies the computational
efficiency of generated code. ministral-8b-2410 pro- Improvements were concentrated in Level 1 prob-
duced the fastest-executing programs (mean: 0.94 sec- lems. Levels 2–5: Statistically insignificant changes.
onds), while olmo2:7b generated the slowest code This indicates the technique primarily enhances perfor-
(mean: 2.27 seconds). mance on simpler tasks (Figure 6).
4
Evgenii Evstafev
Despite minimal accuracy gains, the optimization <1% of generated code contained unsafe constructs
significantly enhanced computational efficiency. (e.g., infinite loops), necessitating stricter runtime sand-
boxing.
4.5.4 DOMAIN-SPECIFIC EFFECTS
3.17% of problems remained unsolved by all models
after 10 attempts (the mark is <=3), underscoring the
need for enhanced reasoning techniques.
5
• Domain-specific fine-tuning to bridge performance [12] Li, Y., et al. A Comprehensive Survey of
gaps in challenging areas like Number Theory. Hallucination Mitigation Techniques in Large
Language Models. arXiv (2024).
These findings [24] establish a benchmark for LLM- https://arxiv.org/abs/2401.01313 (Accessed
based mathematical problem-solving while identifying January 28, 2025, 4:42 PM).
actionable pathways for improving accuracy, and com- [13] Pan, Z., et al. Can Language Models Rival
putational efficiency. Mathematics Students? Evaluating Mathematical
Reasoning through Textual Manipulation and
REFERENCES Human Experiments. arXiv (2024).
https://arxiv.org/abs/2412.11908 (Accessed
[1] Brown, T., et al. Language Models are Few-Shot January 28, 2025, 4:43 PM).
Learners. NeurIPS (2020). [14] Python Sandbox Execution Bulletins. GitHub Gist
https://arxiv.org/abs/2005.14165 (Accessed (2024).
January 28, 2025, 4:31 PM). https://gist.github.com/chigwell/9fbf572d9280080
[2] Lewkowycz, A., et al. Solving Quantitative 10a1a8b6c81809331 (Accessed January 28, 2025,
Reasoning Problems with Language Models. 4:44 PM).
arXiv (2022). https://arxiv.org/abs/2206.14858 [15] 945 Problems. GitHub Gist (2024).
(Accessed January 28, 2025, 4:32 PM). https://gist.github.com/chigwell/e033c29b5e9e48
[3] Chen, M., et al. Evaluating Large Language a968d5fdf0d4c1d131 (Accessed January 28,
Models Trained on Code. OpenAI (2021). 2025, 4:45 PM).
https://arxiv.org/abs/2107.03374 (Accessed [16] OLMo 2: 7B. Ollama (2024).
January 28, 2025, 4:33 PM). https://ollama.com/library/olmo2:7b (Accessed
[4] Hendrycks, D., et al. Measuring Mathematical January 28, 2025, 4:46 PM).
Problem Solving with the MATH Dataset. [17] Introducing Codestral. Mistral AI Blog (2024).
NeurIPS (2021). https://arxiv.org/abs/2103.03874 https://mistral.ai/news/codestral-2501/ (Accessed
(Accessed January 28, 2025, 4:34 PM). January 28, 2025, 4:47 PM).
[5] Hendrycks, D., et al. The MATH Dataset. GitHub [18] Granite 3.1 Dense: 8B. Ollama (2024).
(2021). https://github.com/hendrycks/math https://ollama.com/library/granite3.1-dense
(Accessed January 28, 2025, 4:35 PM). (Accessed January 28, 2025, 4:48 PM).
[6] Introducing Mistral Large. Mistral AI Blog [19] Ministral-8B-Instruct-2410. Hugging Face
(2024). https://mistral.ai/news/mistral-large/ System Card (2024).
(Accessed January 28, 2025, 4:36 PM). https://huggingface.co/mistralai/Ministral-8B-
[7] Introducing Meta Llama 3.1: The New Standard Instruct-2410 (Accessed January 28, 2025, 4:49
for Open LLMs. Meta AI Blog (2024). PM).
https://ai.meta.com/blog/meta-llama-3-1/ [20] Gemini 1.5 Flash 8B is Now Generally Available
(Accessed January 28, 2025, 4:37 PM). for Use. Google Developers Blog (2024).
[8] GPT-4o-mini: Advancing Cost-Efficient https://developers.googleblog.com/en/gemini-15-
Intelligence. OpenAI Blog (2024). flash-8b-is-now-generally-available-for-use/
https://openai.com/index/gpt-4o-mini-advancing- (Accessed January 28, 2025, 4:50 PM).
cost-efficient-intelligence/ (Accessed January 28, [21] Mixtral of Experts. Mistral AI Blog (2024).
2025, 4:38 PM). https://mistral.ai/news/mixtral-of-experts/
[9] Mamba-Codestral-7B-v0.1. Hugging Face System (Accessed January 28, 2025, 4:51 PM).
Card (2024). [22] Command-R7B: 7B. Ollama (2024).
https://huggingface.co/mistralai/Mamba- https://ollama.com/library/command-r7b:7b
Codestral-7B-v0.1 (Accessed January 28, 2025, (Accessed January 28, 2025, 4:52 PM).
4:39 PM). [23] Prompt File. GitHub Gist (2024).
[10] Chain-of-Thought Prompting. Prompting Guide https://gist.github.com/chigwell/33e0ac10a4d004
(2023). 53d3b3aaf195bab3e7 (Accessed January 28,
https://www.promptingguide.ai/techniques/cot 2025, 4:53 PM).
(Accessed January 28, 2025, 4:40 PM). [24] Results Marks. GitHub Gist (2024).
[11] Gao, L., et al. PAL: Program-aided Language https://gist.github.com/chigwell/738336bd466dcf
Models. arXiv (2022). 4bfbd4cf4e00d3bb20 (Accessed January 28,
https://arxiv.org/abs/2211.10435 (Accessed 2025, 4:54 PM).
January 28, 2025, 4:41 PM). [25] Improved Results Marks. GitHub Gist (2024).
https://gist.github.com/chigwell/51a06b04f6673b
6
Evgenii Evstafev
7
AUTHOR BIOGRAPHIES