0% found this document useful (0 votes)

4 views

Token-by-Token Regeneration and Domain Biases- A Benchmark of LLMs on Advanced Mathematical Problem-Solving

This study evaluates 10 large language models (LLMs) on their ability to solve complex mathematical problems, revealing a significant performance gap of 34.5% between the best (gpt-4o-mini) and worst (open-codestral-mamba:v0.1) models. The research highlights challenges in symbolic reasoning and output consistency, with findings indicating that token-by-token regeneration slightly improves accuracy but reduces execution time. The results underscore the need for enhanced reasoning techniques, particularly in difficult areas like Number Theory.

Uploaded by

Tech Guy RS

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

4 views

Token-by-Token Regeneration and Domain Biases- A Benchmark of LLMs on Advanced Mathematical Problem-Solving

Uploaded by

Tech Guy RS

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 8

CC BY: Creative Commons Attribution

Evgenii Evstafev

Token-by-Token Regeneration and Domain Biases:

A Benchmark of LLMs on Advanced Mathematical
Problem-Solving.
Evgenii Evstafev A
A
University Information Services (UIS), University of Cambridge,
Roger Needham Building, 7 JJ Thomson Ave, Cambridge CB3 0RB, UK, [email protected]

ABSTRACT

Large language models (LLMs) excel in many natural language tasks, yet they struggle with complex mathematical
problem-solving, particularly in symbolic reasoning and maintaining consistent output. This study evaluates 10
LLMs with 7 to 8 billion parameters using 945 competition-level problems from the MATH dataset. The focus is
on their ability to generate executable Python code as a step in their reasoning process, involving over 9,450 code
executions. The research introduces an evaluation framework using mistral-large-2411 to rate answers on a 5-
point scale, which helps address inconsistencies in mathematical notation. It also examines the impact of regen-
erating output token-by-token on refining results. The findings reveal a significant 34.5% performance gap be-
tween the top commercial model (gpt-4o-mini, scoring 83.7%) and the least effective open-source model (open-
codestral-mamba:v0.1, scoring 49.2%). This disparity is especially noticeable in complex areas like Number The-
ory. While token-by-token regeneration slightly improved accuracy (+0.8%) for the model llama3.1:8b, it also
reduced code execution time by 36.7%, highlighting a trade-off between efficiency and precision. The study also
noted a consistent trend where harder problems correlated with lower accuracy across all models. Despite using
controlled execution environments, less than 1% of the generated code was unsafe, and 3.17% of problems re-
mained unsolved after 10 attempts, suggesting that hybrid reasoning methods may be beneficial.

TYPE OF PAPER AND KEYWORDS

Empirical Research Paper (Comparative Analysis and Benchmarking Study), Large Language Models, Mathe-
matical Reasoning, Code Generation, Benchmarking, Token-by-Token Regeneration, Computational Efficiency,
MATH Dataset, Domain-Specific Performance, Automated Evaluation, Safety Constraints.

1 INTRODUCTION • A granular evaluation framework using mistral-

large-2411 [6] for automated answer scoring, ad-
Large language models (LLMs) have demonstrated dressing inconsistencies in mathematical notation
remarkable proficiency in natural language tasks [1], yet (e.g., fractions, symbolic constants).
their ability to solve complex mathematical problems re- • An empirical analysis of token-by-token regenera-
mains constrained by challenges in symbolic reasoning tion—a dynamic output refinement technique—ap-
and precise output formatting [2]. While recent ad- plied to llama3.1:8b [7] to assess its impact on accu-
vances, such as code-augmented problem-solving, offer racy and computational efficiency.
promising pathways [3], systematic evaluations of
LLMs’ mathematical capabilities—particularly across 2. BACKGROUND AND RELATED WORK
diverse architectures and difficulty levels—remain un-
derexplored [4]. This study addresses this gap by bench- Modern LLMs employ diverse strategies for mathe-
marking 10 LLMs (7B–8B parameters) on 945 compe- matical problem-solving, including chain-of-thought
tition-level mathematics problems from the MATH da- prompting [10], program-aided language models (PAL)
taset [4, 5], focusing on their ability to generate execut- [11], and symbolic equation generation. Code-based ap-
able Python code as a reasoning intermediate. The in- proaches, where models generate executable programs
vestigation introduces two contributions: to derive solutions, have gained traction for their ability
to enforce logical rigor and mitigate hallucination [12].

1
However, output variability—such as inconsistent nu- for their computational efficiency and diversity in train-
merical formatting or symbolic representation (e.g., π ing methodologies:
vs. 3.1416)—complicates automated evaluation, neces-
sitating robust scoring mechanism. Key limitations per- 1. llama3.1:8b [7]: General-purpose LLM with strong
sist: NLP capabilities.
2. olmo2:7b [16]: Open-source model optimized for
• Performance degrades nonlinearly with problem research reproducibility.
complexity, as seen in GPT-4’s 23% accuracy drop 3. codestral-2501 [17]: Code-focused model for code
on Level 5 MATH problems [4]. generation tasks.
• Models exhibit uneven proficiency across mathe- 4. gpt-4o-mini-2024-07-18 [8]: Compact variant of
matical subjects, often struggling with combinatorics GPT-4.
and modular arithmetic [13]. 5. granite3.1-dense:8b [18]: Dense model trained on
• Unrestricted code generation introduces vulnerabili- large-scale datasets.
ties like infinite loops or unsafe system calls, requir- 6. open-codestral-mamba:v0.1 [9]: Hybrid architec-
ing sandboxed execution environments [14]. ture combining code and general capabilities.
7. ministral-8b-2410 [19]: Lightweight model for on-
These challenges underscore the need for standard- device applications.
ized evaluation protocols and architectural innovations 8. gemini-1.5-flash-8b [20]: Efficient model with
to enhance reliability. Existing benchmarks like MATH strong task performance.
[4] rely on binary correctness scoring, overlooking par- 9. mistral-small-2409 [21]: Smaller variant of Mis-
tial solutions. tral’s architecture.
10. command-r7b:7b [22]: General-purpose conversa-
3. METHODOLOGY tional model.

Token-by-token regeneration, a technique to refine

3.1 DATASET CREATION outputs iteratively, was exclusively tested on
llama3.1:8b to isolate its impact.
This study uses the MATH dataset, a publicly avail- Evaluation Constraints:
able collection of 12,500 challenging competition-level
mathematics problems sourced from competitions such
• Each model independently solved all 945 problems.
as AMC 10, AMC 12, and AIME. The dataset, described
• Models were instructed to “generate Python code
in the paper "Measuring Mathematical Problem Solving
that prints the final answer to the console.” [23]
With the MATH Dataset" [4], includes step-by-step so-
lutions and spans diverse mathematical domains. • Code Generation Limit: 2 minutes per attempt.
• Execution Timeout: 1 minute per code run to prevent
A stratified subset of 945 problems [15] was curated infinite loops.
to ensure balanced representation across 7 mathematical • Retry Mechanism: Up to 10 attempts per problem,
subjects (Algebra, Counting & Probability, Geometry, with error messages (e.g., syntax/runtime errors) fed
Intermediate Algebra, Number Theory, Prealgebra, Pre- back to the model for iterative refinement.
calculus) and 5 difficulty levels (Level 1: simplest,
Level 5: most complex). Each subject-level combination A restricted execution environment permitted only
contains 27 problems, yielding a total of 7 subjects × 5 safe built-in functions (e.g., mathematical operations,
levels × 27 problems = 945 problems. This structured control flow structures) to mitigate security risks [14].
sampling guarantees diversity in both topic coverage 3.3 EVALUATION METRICS
and problem complexity.
To address variability in mathematical answer for-
To streamline evaluation, the dataset was augmented mats (e.g., fractions, symbolic notation like π or decimal
with a dedicated field containing final numerical an- representations), the correctness of console outputs was
swers (without explanations). This design enables direct evaluated using mistral-large-2411 [6], a high-perfor-
comparison between model-generated outputs and mance language model. This approach ensures robust
ground-truth solutions. and consistent scoring despite differences in output for-
matting.
3.2 MODEL SELECTION
For every problem, the console output generated by
The study evaluates 10 language models of varying a model’s Python code was independently assessed by
architectures and scales (7B–8B parameters), selected

2
Evgenii Evstafev

mistral-large-2411. The evaluator model was blinded to 4. RESULTS

the source model’s identity to eliminate bias.
4.1 OVERALL MODEL PERFORMANCE
Answers were scored on a 5-point scale by compar-
ing the generated output to the ground-truth answer from
the MATH dataset:

• 5 (Correct): Exact match or mathematically equiva-

lent result.
• 4 (Almost Correct): Minor formatting discrepancies
or rounding errors.
• 3 (Partially Correct): Partial solution with significant
inaccuracies.
• 2 (Incorrect): Wrong answer but relevant to the prob-
lem.
• 1 (Completely Incorrect): Irrelevant or nonsensical
output.

The primary metric is the weighted accuracy, calcu-

lated as the percentage of answers scoring 4 or 5 across
the dataset. Scores ≤3 are treated as incorrect.
Chart 1: Model Performance – Average Success
The evaluator model received only the ground-truth Rates (%) with 95% Confidence Intervals
answer, and generated console output, without metadata
about the source model. A bar chart comparing the success rates of all models
in solving the 945 mathematical problems reveals sig-
This method quantifies solution quality more granu- nificant performance disparities. The gpt-4o-mini-2024-
larly than binary correctness, enabling future analysis of 07-18 achieved the highest accuracy at 83.7%, while
incremental improvements (e.g., from "partially correct" open-codestral-mamba:v0.1 ranked lowest at 49.2%.
to "almost correct").
4.2 PERFORMANCE BY DIFFICULTY LEVEL
To address variability in mathematical answer for-
mats (e.g., fractions vs. decimals, symbolic notation like
(pi vs. numerical approximations), direct string compar-
ison proves insufficient for robust evaluation. Such dis-
crepancies necessitate expert judgment to assess equiv-
alence, particularly for context-dependent representa-
tions.

The mistral-large-2411 model was employed as a

proxy for domain expertise, evaluating console outputs
against ground-truth answers through semantic equiva-
lence rather than syntactic exactness. Its 5-point scoring
scale accommodates partial correctness and formatting
nuances, mirroring human expert evaluation. This ap-
proach avoids the brittleness of exact matching while en-
suring systematic, bias-free assessment across diverse
problem types and answer styles.
Chart 2: Performance by Difficulty Level (%)

A line graph illustrates a consistent downward trend

in accuracy across all models as problem difficulty in-
creases from Level 1 (simplest) to Level 5 (most com-
plex).

3
This inverse correlation between difficulty and per- 4.5 IMPACT OF TOKEN-BY-TOKEN REGENERA-
formance highlights persistent challenges in solving ad- TION ON LLAMA3.1:8B
vanced mathematical problems.
4.5.1 OVERALL PERFORMANCE COMPARISON
4.3 PERFORMANCE ACROSS MATHEMATICAL
DOMAINS

Chart 5: Original vs Improved Scores

Implementing token-by-token regeneration for

llama3.1:8b yielded a marginal improvement in accu-
Chart 3: Performance by Task Type (%) racy:
A heatmap (Figure 3) visualizes model-specific
• Original: 63.3%
strengths and weaknesses across seven mathematical
• Improved: 64.1%
domains.
This 0.8% gain suggests limited efficacy of the
4.4 COMPUTATIONAL EFFICIENCY method for general problem-solving (Figure 5).

4.5.2 DIFFICULTY-LEVEL ANALYSIS

Chart 4: Average Execution Time per Model Chart 6: Performance by Difficulty Level – Original
vs Improved
A histogram (Figure 4) quantifies the computational
efficiency of generated code. ministral-8b-2410 pro- Improvements were concentrated in Level 1 prob-
duced the fastest-executing programs (mean: 0.94 sec- lems. Levels 2–5: Statistically insignificant changes.
onds), while olmo2:7b generated the slowest code This indicates the technique primarily enhances perfor-
(mean: 2.27 seconds). mance on simpler tasks (Figure 6).

4
Evgenii Evstafev

4.5.3 COMPUTATIONAL OVERHEAD 5. KEY OBSERVATIONS

The gpt-4o-mini-2024-07-18 (83.7%) outperformed

all other models, while open-codestral-mamba:v0.1
(49.2%) exhibited the lowest accuracy, highlighting
substantial variability in problem-solving capabilities
across architectures.

All models demonstrated a consistent inverse rela-

tionship between accuracy and problem difficulty, with
performance dropping by 10-32% from Level 1 to Level
5 tasks.

Algebra emerged as the strongest domain, while

Number Theory posed the greatest challenge, suggesting
domain-specific architectural biases. Faster-executing
models (e.g., ministral-8b-2410: 0.94s) did not correlate
with higher accuracy, indicating computational effi-
ciency does not inherently improve solution quality.
Chart 7: Average Execution Time Comparison
Marginal accuracy gains (+0.8%) for llama3.1:8b
The improved version reduced code execution time were accompanied by a 36.7% reduction in resulted pro-
by 36.7%: gram execution time, implying potential for resource-
• Original: 1.99 seconds optimized code generation despite limited problem-
• Improved: 1.26 seconds solving improvements.

Despite minimal accuracy gains, the optimization <1% of generated code contained unsafe constructs
significantly enhanced computational efficiency. (e.g., infinite loops), necessitating stricter runtime sand-
boxing.
4.5.4 DOMAIN-SPECIFIC EFFECTS
3.17% of problems remained unsolved by all models
after 10 attempts (the mark is <=3), underscoring the
need for enhanced reasoning techniques.

6 SUMMARY AND CONCLUSIONS

This study evaluated 10 language models on 945

competition-level mathematical problems, revealing
critical insights [24] into their problem-solving capabil-
ities. Commercial models (e.g., gpt-4o-mini) signifi-
cantly outperformed open-source counterparts, with a
34.5% accuracy gap between the best and worst per-
formers. Domain-specific weaknesses, particularly in
Number Theory, persist across architectures.

While iterative regeneration provided minimal accu-

racy improvements for llama3.1:8b, its 36.7% faster ex-
Chart 8: Heatmap Comparison
ecution time of generated code suggests utility in la-
tency-sensitive applications [25]. The technique’s do-
Token-by-token regeneration improved Algebra per-
main-specific efficacy warrants further investigation.
formance but showed neutral or negative effects in other
Future Directions:
domains.
• Exploration of RAG frameworks (or similar) to ad-
dress unsolved problems.

5
• Domain-specific fine-tuning to bridge performance [12] Li, Y., et al. A Comprehensive Survey of
gaps in challenging areas like Number Theory. Hallucination Mitigation Techniques in Large
Language Models. arXiv (2024).
These findings [24] establish a benchmark for LLM- https://arxiv.org/abs/2401.01313 (Accessed
based mathematical problem-solving while identifying January 28, 2025, 4:42 PM).
actionable pathways for improving accuracy, and com- [13] Pan, Z., et al. Can Language Models Rival
putational efficiency. Mathematics Students? Evaluating Mathematical
Reasoning through Textual Manipulation and
REFERENCES Human Experiments. arXiv (2024).
https://arxiv.org/abs/2412.11908 (Accessed
[1] Brown, T., et al. Language Models are Few-Shot January 28, 2025, 4:43 PM).
Learners. NeurIPS (2020). [14] Python Sandbox Execution Bulletins. GitHub Gist
https://arxiv.org/abs/2005.14165 (Accessed (2024).
January 28, 2025, 4:31 PM). https://gist.github.com/chigwell/9fbf572d9280080
[2] Lewkowycz, A., et al. Solving Quantitative 10a1a8b6c81809331 (Accessed January 28, 2025,
Reasoning Problems with Language Models. 4:44 PM).
arXiv (2022). https://arxiv.org/abs/2206.14858 [15] 945 Problems. GitHub Gist (2024).
(Accessed January 28, 2025, 4:32 PM). https://gist.github.com/chigwell/e033c29b5e9e48
[3] Chen, M., et al. Evaluating Large Language a968d5fdf0d4c1d131 (Accessed January 28,
Models Trained on Code. OpenAI (2021). 2025, 4:45 PM).
https://arxiv.org/abs/2107.03374 (Accessed [16] OLMo 2: 7B. Ollama (2024).
January 28, 2025, 4:33 PM). https://ollama.com/library/olmo2:7b (Accessed
[4] Hendrycks, D., et al. Measuring Mathematical January 28, 2025, 4:46 PM).
Problem Solving with the MATH Dataset. [17] Introducing Codestral. Mistral AI Blog (2024).
NeurIPS (2021). https://arxiv.org/abs/2103.03874 https://mistral.ai/news/codestral-2501/ (Accessed
(Accessed January 28, 2025, 4:34 PM). January 28, 2025, 4:47 PM).
[5] Hendrycks, D., et al. The MATH Dataset. GitHub [18] Granite 3.1 Dense: 8B. Ollama (2024).
(2021). https://github.com/hendrycks/math https://ollama.com/library/granite3.1-dense
(Accessed January 28, 2025, 4:35 PM). (Accessed January 28, 2025, 4:48 PM).
[6] Introducing Mistral Large. Mistral AI Blog [19] Ministral-8B-Instruct-2410. Hugging Face
(2024). https://mistral.ai/news/mistral-large/ System Card (2024).
(Accessed January 28, 2025, 4:36 PM). https://huggingface.co/mistralai/Ministral-8B-
[7] Introducing Meta Llama 3.1: The New Standard Instruct-2410 (Accessed January 28, 2025, 4:49
for Open LLMs. Meta AI Blog (2024). PM).
https://ai.meta.com/blog/meta-llama-3-1/ [20] Gemini 1.5 Flash 8B is Now Generally Available
(Accessed January 28, 2025, 4:37 PM). for Use. Google Developers Blog (2024).
[8] GPT-4o-mini: Advancing Cost-Efficient https://developers.googleblog.com/en/gemini-15-
Intelligence. OpenAI Blog (2024). flash-8b-is-now-generally-available-for-use/
https://openai.com/index/gpt-4o-mini-advancing- (Accessed January 28, 2025, 4:50 PM).
cost-efficient-intelligence/ (Accessed January 28, [21] Mixtral of Experts. Mistral AI Blog (2024).
2025, 4:38 PM). https://mistral.ai/news/mixtral-of-experts/
[9] Mamba-Codestral-7B-v0.1. Hugging Face System (Accessed January 28, 2025, 4:51 PM).
Card (2024). [22] Command-R7B: 7B. Ollama (2024).
https://huggingface.co/mistralai/Mamba- https://ollama.com/library/command-r7b:7b
Codestral-7B-v0.1 (Accessed January 28, 2025, (Accessed January 28, 2025, 4:52 PM).
4:39 PM). [23] Prompt File. GitHub Gist (2024).
[10] Chain-of-Thought Prompting. Prompting Guide https://gist.github.com/chigwell/33e0ac10a4d004
(2023). 53d3b3aaf195bab3e7 (Accessed January 28,
https://www.promptingguide.ai/techniques/cot 2025, 4:53 PM).
(Accessed January 28, 2025, 4:40 PM). [24] Results Marks. GitHub Gist (2024).
[11] Gao, L., et al. PAL: Program-aided Language https://gist.github.com/chigwell/738336bd466dcf
Models. arXiv (2022). 4bfbd4cf4e00d3bb20 (Accessed January 28,
https://arxiv.org/abs/2211.10435 (Accessed 2025, 4:54 PM).
January 28, 2025, 4:41 PM). [25] Improved Results Marks. GitHub Gist (2024).
https://gist.github.com/chigwell/51a06b04f6673b

6
Evgenii Evstafev

d2cb07e38850fbbb78 (Accessed January 28,

2025, 4:55 PM).

7
AUTHOR BIOGRAPHIES

Evgenii Evstafev is a skilled

software developer at the Uni-
versity of Cambridge, where
he has been working since
September 2022, specializing
in identity and access manage-
ment. He earned a Bachelor’s
degree in Business Informat-
ics from the Higher School of
Economics (2010-2014) and a Master's degree in Pro-
gramming from Perm National Research Polytechnic
University (2014-2016). Evgenii also taught at the Fac-
ulty of Mechanics and Mathematics at Perm State Uni-
versity while engaged in postgraduate studies (2016-
2019). His professional journey spans over 11 years
across various industries, including roles such as System
Architect at L'Etoile (2021-2022) focusing on product
development, the Head of Analytics at Gazprombank
(2020-2021), and Head of the Department for System
Analysis and Software Design at Center 2M (2019-
2020). Additionally, he worked on system development
at the energy company T Plus (2016-2019).

Llemma - An Open Language Model For Mathematics
No ratings yet
Llemma - An Open Language Model For Mathematics
28 pages
Mathchat: Benchmarking Mathematical Reasoning and Instruction Following in Multi-Turn Interactions
No ratings yet
Mathchat: Benchmarking Mathematical Reasoning and Instruction Following in Multi-Turn Interactions
24 pages
End-to-End Bangla AI for Solving Math Olympiad Problem Benchmark:Leveraging Large Language Model Using Integrated Approach
No ratings yet
End-to-End Bangla AI for Solving Math Olympiad Problem Benchmark:Leveraging Large Language Model Using Integrated Approach
11 pages
End-to-End Bangla AI for Solving Math Olympiad Problem Benchmark:Leveraging Large Language Model Using Integrated Approach
No ratings yet
End-to-End Bangla AI for Solving Math Olympiad Problem Benchmark:Leveraging Large Language Model Using Integrated Approach
11 pages
2407.21009v3
No ratings yet
2407.21009v3
30 pages
Lemma An Open Language Model For Mathematics
No ratings yet
Lemma An Open Language Model For Mathematics
28 pages
NeurIPS 2023 Mathematical Capabilities of Chatgpt Paper Datasets and Benchmarks
No ratings yet
NeurIPS 2023 Mathematical Capabilities of Chatgpt Paper Datasets and Benchmarks
46 pages
2304.13187v1
No ratings yet
2304.13187v1
10 pages
Math Odyssey Benchmarks
No ratings yet
Math Odyssey Benchmarks
14 pages
How Well Do LLM Perform Iin Arithmetic Tasks
No ratings yet
How Well Do LLM Perform Iin Arithmetic Tasks
10 pages
MathScale Scaling Instruction Tuning For Mathematical Reasoning
No ratings yet
MathScale Scaling Instruction Tuning For Mathematical Reasoning
15 pages
Improving Large Language Model
No ratings yet
Improving Large Language Model
14 pages
2501.04425v1
No ratings yet
2501.04425v1
11 pages
OpenAI Codex Arxiv
No ratings yet
OpenAI Codex Arxiv
35 pages
TinyGSM: achieving >80% on GSM8k with small language models
No ratings yet
TinyGSM: achieving >80% on GSM8k with small language models
102 pages
S C M W P U GPT-4 C I C - S - V: Olving Hallenging ATH ORD Roblems Sing ODE Nterpreter With ODE Based ELF Erification
No ratings yet
S C M W P U GPT-4 C I C - S - V: Olving Hallenging ATH ORD Roblems Sing ODE Nterpreter With ODE Based ELF Erification
23 pages
Evaluating Large Language Models Trained On Code
No ratings yet
Evaluating Large Language Models Trained On Code
35 pages
pdf2306 08997 PDF
No ratings yet
pdf2306 08997 PDF
20 pages
Open Math Instruct
No ratings yet
Open Math Instruct
22 pages
A neural network solves, explains, and generates university math problems by program synthesis and few-shot learning at human level
No ratings yet
A neural network solves, explains, and generates university math problems by program synthesis and few-shot learning at human level
10 pages
Training Verifiers To Solve Math Word Problems
No ratings yet
Training Verifiers To Solve Math Word Problems
22 pages
T Ra: A T - I R A M P S: O OOL Ntegrated Easoning Gent FOR Athematical Roblem Olving
No ratings yet
T Ra: A T - I R A M P S: O OOL Ntegrated Easoning Gent FOR Athematical Roblem Olving
22 pages
Tinygsm: Achieving 80% On Gsm8K With Small Language Models
No ratings yet
Tinygsm: Achieving 80% On Gsm8K With Small Language Models
15 pages
Measuring Massive Multitask Language Understanding
No ratings yet
Measuring Massive Multitask Language Understanding
25 pages
Exp
No ratings yet
Exp
24 pages
S 001: N Q A C LLM E: Afurai EW Ualitative Pproach For ODE Valuation
No ratings yet
S 001: N Q A C LLM E: Afurai EW Ualitative Pproach For ODE Valuation
22 pages
86_Putnam_AXIOM_A_Functional_a
No ratings yet
86_Putnam_AXIOM_A_Functional_a
20 pages
T Ra: A T - I R A M P S: O OOL Ntegrated Easoning Gent FOR Athematical Roblem Olving
No ratings yet
T Ra: A T - I R A M P S: O OOL Ntegrated Easoning Gent FOR Athematical Roblem Olving
34 pages
Measuring Massive Multitask Language Understanding
No ratings yet
Measuring Massive Multitask Language Understanding
27 pages
Benchmark LLM
No ratings yet
Benchmark LLM
28 pages
Explaining Competitive-Level Programming Solutions Using LLMs
No ratings yet
Explaining Competitive-Level Programming Solutions Using LLMs
14 pages
2403.01342 (1)
No ratings yet
2403.01342 (1)
8 pages
2503.10573v1
No ratings yet
2503.10573v1
27 pages
Autonomous Data Selection With Language Models For Mathematical Texts
No ratings yet
Autonomous Data Selection With Language Models For Mathematical Texts
25 pages
Augmenting Math Word Problems Via Iterative Question Composing
No ratings yet
Augmenting Math Word Problems Via Iterative Question Composing
10 pages
Textbooks Are All You Need
No ratings yet
Textbooks Are All You Need
26 pages
Mathematical Language Models: A Survey
No ratings yet
Mathematical Language Models: A Survey
34 pages
Alice in Wonderland: Simple Tasks Showing Complete Reasoning Breakdown in State-Of-the-Art Large Language Models
No ratings yet
Alice in Wonderland: Simple Tasks Showing Complete Reasoning Breakdown in State-Of-the-Art Large Language Models
45 pages
Inference Efficiency by Learning Task Complexity
No ratings yet
Inference Efficiency by Learning Task Complexity
9 pages
An Empirical Study On Challenging Math Problem Solving With GPT-4
No ratings yet
An Empirical Study On Challenging Math Problem Solving With GPT-4
22 pages
Mathematical Capabilities of ChatGPT
100% (1)
Mathematical Capabilities of ChatGPT
29 pages
OlympiadBench - A Challenging Benchmark For Promoting AGI With Olympiad-Level Bilingual Multimodal Scientific Problems
No ratings yet
OlympiadBench - A Challenging Benchmark For Promoting AGI With Olympiad-Level Bilingual Multimodal Scientific Problems
20 pages
Devoir Machin Learning
No ratings yet
Devoir Machin Learning
4 pages
A Survey of Deep Learning For Mathematical Reasoning
No ratings yet
A Survey of Deep Learning For Mathematical Reasoning
24 pages
AI for Everyone: An Intermediate Guide to Artificial Intelligence
From Everand
AI for Everyone: An Intermediate Guide to Artificial Intelligence
Nova Clarke
No ratings yet
Large Language Models For Mathematical Reasoning - Progresses and Challenges
No ratings yet
Large Language Models For Mathematical Reasoning - Progresses and Challenges
14 pages
2503.21934v1
No ratings yet
2503.21934v1
13 pages
CIBench Evaluating Your LLMs With a Code Interpret
No ratings yet
CIBench Evaluating Your LLMs With a Code Interpret
22 pages
23ChatGPT As A Math Questioner
No ratings yet
23ChatGPT As A Math Questioner
9 pages
Are NLP Models Really Able To Solve Simple Math Word Problems?
No ratings yet
Are NLP Models Really Able To Solve Simple Math Word Problems?
15 pages
2305.01210v3
No ratings yet
2305.01210v3
15 pages
3649409.3691090
No ratings yet
3649409.3691090
2 pages
Llama Sciq
No ratings yet
Llama Sciq
13 pages
source
No ratings yet
source
16 pages
Demystifying LLMs
No ratings yet
Demystifying LLMs
53 pages
Chain of Thought Reasoning
No ratings yet
Chain of Thought Reasoning
14 pages
symmetry-15-00916-v2
No ratings yet
symmetry-15-00916-v2
13 pages
How Is ChatGPT's Behaviour Changing Over Time 2307.09009
No ratings yet
How Is ChatGPT's Behaviour Changing Over Time 2307.09009
8 pages
2305.00586v5 (1)
No ratings yet
2305.00586v5 (1)
26 pages
DATA MINING AND MACHINE LEARNING. PREDICTIVE TECHNIQUES: REGRESSION, GENERALIZED LINEAR MODELS, SUPPORT VECTOR MACHINE AND NEURAL NETWORKS
From Everand
DATA MINING AND MACHINE LEARNING. PREDICTIVE TECHNIQUES: REGRESSION, GENERALIZED LINEAR MODELS, SUPPORT VECTOR MACHINE AND NEURAL NETWORKS
César Pérez López
No ratings yet
Altronic Switch-Gauge Installation Instructions 45Phl Series FORM 45PHL II 3-03
No ratings yet
Altronic Switch-Gauge Installation Instructions 45Phl Series FORM 45PHL II 3-03
6 pages
5 Ire
No ratings yet
5 Ire
18 pages
PC Avid Media Composer Version Matrix: Avid Professional Editor Patch Updates
No ratings yet
PC Avid Media Composer Version Matrix: Avid Professional Editor Patch Updates
43 pages
Lorch ROUNDSEAM Rundnahtloesungen en View
No ratings yet
Lorch ROUNDSEAM Rundnahtloesungen en View
32 pages
GCC Lab Manual
No ratings yet
GCC Lab Manual
125 pages
The Catalogue
No ratings yet
The Catalogue
2 pages
Stop Child Labour Essay
100% (2)
Stop Child Labour Essay
3 pages
Hadoop The Definitive Guide 3rd Edition
100% (1)
Hadoop The Definitive Guide 3rd Edition
647 pages
Planning for and Administering Avaya Workplace Client for Android Ios Mac and Windows 4-26-2023
No ratings yet
Planning for and Administering Avaya Workplace Client for Android Ios Mac and Windows 4-26-2023
4 pages
Compact Muscle Stimulator3
No ratings yet
Compact Muscle Stimulator3
1 page
Oracle Mid Test
No ratings yet
Oracle Mid Test
17 pages
Pioneer Dj-Price List
No ratings yet
Pioneer Dj-Price List
8 pages
Chapter 8: Deadlocks: Silberschatz, Galvin and Gagne ©2018 Operating System Concepts - 10 Edition
No ratings yet
Chapter 8: Deadlocks: Silberschatz, Galvin and Gagne ©2018 Operating System Concepts - 10 Edition
33 pages
Lab - Manual FDS
No ratings yet
Lab - Manual FDS
12 pages
Question - Why Are Graphics Drivers in The Kernel and in Mesa - R - Linux - Gaming
No ratings yet
Question - Why Are Graphics Drivers in The Kernel and in Mesa - R - Linux - Gaming
7 pages
User Manual: Question? Contact Philips
No ratings yet
User Manual: Question? Contact Philips
14 pages
Ccie Ei Doo Workbook
No ratings yet
Ccie Ei Doo Workbook
91 pages
ANDHRA UNIVERSITY PUBLIC ADMIN Paper 4
No ratings yet
ANDHRA UNIVERSITY PUBLIC ADMIN Paper 4
2 pages
ICDL Online Course
No ratings yet
ICDL Online Course
3 pages
CSS Training Guide101
No ratings yet
CSS Training Guide101
4 pages
6 - Process Design
No ratings yet
6 - Process Design
29 pages
PFMIP 2017 New Format-LGUs
No ratings yet
PFMIP 2017 New Format-LGUs
32 pages
24.02.2007 20:41:54 F:/eagle/picpgm - Lvisp/picpgm - Lvisp - SCH (Sheet: 1/1)
No ratings yet
24.02.2007 20:41:54 F:/eagle/picpgm - Lvisp/picpgm - Lvisp - SCH (Sheet: 1/1)
1 page
CT Rev
No ratings yet
CT Rev
5 pages
Itil Cobit Iso20000 Alignment Isaca
No ratings yet
Itil Cobit Iso20000 Alignment Isaca
65 pages
Module8 ADGE
No ratings yet
Module8 ADGE
8 pages
ServiceManual For Data Communication EU PDF
No ratings yet
ServiceManual For Data Communication EU PDF
3 pages
User Guide PDF
No ratings yet
User Guide PDF
14 pages
Parallel and Distributed Databases
No ratings yet
Parallel and Distributed Databases
7 pages

Token-by-Token Regeneration and Domain Biases- A Benchmark of LLMs on Advanced Mathematical Problem-Solving

Uploaded by

Token-by-Token Regeneration and Domain Biases- A Benchmark of LLMs on Advanced Mathematical Problem-Solving

Uploaded by

CC BY: Creative Commons Attribution

Token-by-Token Regeneration and Domain Biases:

TYPE OF PAPER AND KEYWORDS

1 INTRODUCTION • A granular evaluation framework using mistral-

Token-by-token regeneration, a technique to refine

mistral-large-2411. The evaluator model was blinded to 4. RESULTS

• 5 (Correct): Exact match or mathematically equiva-

The primary metric is the weighted accuracy, calcu-

The mistral-large-2411 model was employed as a

A line graph illustrates a consistent downward trend

Chart 5: Original vs Improved Scores

Implementing token-by-token regeneration for

4.5.2 DIFFICULTY-LEVEL ANALYSIS

4.5.3 COMPUTATIONAL OVERHEAD 5. KEY OBSERVATIONS

The gpt-4o-mini-2024-07-18 (83.7%) outperformed

All models demonstrated a consistent inverse rela-

Algebra emerged as the strongest domain, while

6 SUMMARY AND CONCLUSIONS

This study evaluated 10 language models on 945

While iterative regeneration provided minimal accu-

d2cb07e38850fbbb78 (Accessed January 28,

Evgenii Evstafev is a skilled

You might also like