0% found this document useful (0 votes)
12 views8 pages

2403.01342 (1)

Uploaded by

wazouajeannice
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views8 pages

2403.01342 (1)

Uploaded by

wazouajeannice
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

LM4OPT: Unveiling the Potential of Large Language Models in Formulating

Mathematical Optimization Problems


Tasnim Ahmed, Salimur Choudhury
School of Computing, Queen’s University
Kingston, Ontario K7L 2N8, Canada
{tasnim.ahmed, s.choudhury}@queensu.ca
arXiv:2403.01342v1 [cs.CL] 2 Mar 2024

Abstract quently, optimization modeling would become accessible to


individuals who cannot afford experts to augment efficiency
In the rapidly evolving field of natural language processing, using optimization techniques. Provided the problem is cor-
the translation of linguistic descriptions into mathematical
rectly formulated, it can be readily solved by transcribing it
formulation of optimization problems presents a formidable
challenge, demanding intricate understanding and process- into an algebraic modeling language interpretable by solvers
ing capabilities from Large Language Models (LLMs). This (Ramamonjison et al. 2023).
study compares prominent LLMs, including GPT-3.5, GPT- The field of Natural Language Processing (NLP) presents
4, and Llama-2-7b, in zero-shot and one-shot settings for this a potent avenue for enhancing the accessibility and effi-
task. Our findings show GPT-4’s superior performance, par- ciency of optimization problem formulation. From the in-
ticularly in the one-shot scenario. A central part of this re- ception of word embeddings to the evolution of language
search is the introduction of ‘LM4OPT,’ a progressive fine- models, NLP has undergone transformative progress over
tuning framework for Llama-2-7b that utilizes noisy em- the years. Especially with the emergence of pre-trained lan-
beddings and specialized datasets. However, this research
highlights a notable gap in the contextual understanding ca-
guage models (Devlin et al. 2019), these models have at-
pabilities of smaller models such as Llama-2-7b compared tained state-of-the-art results on a multitude of NLP tasks
to larger counterparts, especially in processing lengthy and such as natural language inference (NLI), question answer-
complex input contexts. Our empirical investigation, utilizing ing, summarization, collaborative writing, etc., with min-
the NL4Opt dataset, unveils that GPT-4 surpasses the base- imal task-specific fine-tuning (Laskar, Hoque, and Huang
line performance established by previous research, achiev- 2021). The recent advancements in LLMs, including GPT
ing an F1-score of 0.63, solely based on the problem de- (OpenAI 2023), and Llama (Touvron et al. 2023), have sig-
scription in natural language, and without relying on any ad- nificantly reshaped the NLP landscape and practices. These
ditional named entity information. GPT-3.5 follows closely, LLMs, with parameter sizes exceeding several billions, and
both outperforming the fine-tuned Llama-2-7b. These find- even reaching hundreds of billions, have exhibited remark-
ings not only benchmark the current capabilities of LLMs in a
novel application area but also lay the groundwork for future
able generalization abilities in zero-shot and few-shot set-
improvements in mathematical formulation of optimization tings through prompting. Furthermore, these LLMs have
problems from natural language input. shown exceptional fine-tuning capabilities, even when fine-
tuned on datasets significantly smaller than those used by
their predecessors.
Introduction To this end, formal assessment of this specific
Numerous practical challenges originating from diverse do- task−mathematical formulation of optimization prob-
mains such as operations, economics, engineering, and com- lems from natural language descriptions using the latest
puter science can be articulated as optimization problems developments from the GPT series models, namely GPT-3.5
(AhmadiTeshnizi, Gao, and Udell 2023). Standard optimiza- and GPT-4, which have garnered widespread recognition,
tion algorithms, including the simplex (Nash 2000) and remains an uncharted territory. Additionally, this research
interior-point methods (Karmarkar 1984), can efficiently ad- aims to investigate the capabilities and limitations of a
dress these problems. Nevertheless, the translation of a real- smaller Large Language Model (LLM), Llama-2-7b, when
world situation into a mathematical formulation necessi- fine-tuned on this task. Consequently, this study offers the
tates specialized knowledge. This expertise barrier hinders following contributions:
many individuals from utilizing optimization algorithms,
• Comprehensive analysis of GPT-3.5, GPT-4, and Llama-
even when these could substantially enhance their opera-
2-7b in mathematical formulation of optimization prob-
tions. The advancement of automating problem formulation,
lems from natural language description.
which involves translating natural language descriptions into
decision variables, constraints, and objective functions, has • Evaluation in zero-shot and one-shot settings to under-
the potential to make these processes more accessible to in- stand the impact of few-shot prompt engineering and
dividuals beyond just operations research experts. Conse- learning adaptations of the models.
• Empirical study using the NL4Opt (Ramamonjison et al. solutions at each iteration, balancing the need to explore
2023) dataset, demonstrating the superior performance of different options with refining existing ones. The authors
GPT-4, followed by GPT-3.5. demonstrated encouraging preliminary outcomes when ap-
• Exploration of utilizing the LM4OPT framework to fine- plying their methods to the GSM8K (Cobbe et al. 2021) and
tune Llama-2-7b, revealing significant performance en- BBH (Suzgun et al. 2022) datasets, in addition to tasks such
hancements. as linear regression and the traveling salesman problem. The
effectiveness of OPRO for complex optimization tasks is
yet to be fully determined. In a recent study focused on
Related Work practical applications, researchers introduced the OptiGuide
Efforts to simplify combinatorial optimization using LLMs framework (Li et al. 2023), a novel integration of combi-
have seen diverse approaches, aiming to make the process natorial optimization technology with advanced Large Lan-
user-friendly for laypersons. The NL4Opt (Ramamonjison guage Models (LLMs), such as GPT-4, aimed at augment-
et al. 2023) competition stands out, exploring the trans- ing decision-making processes within supply chain man-
formation of natural language into structured optimization agement. This framework transforms user queries into in-
models. In Task 1 which is described in (Dakle et al. 2023), context learning (ICL) queries for LLM processing, gener-
the aim is to accurately identify and label the components ating code that is vetted for accuracy and reliability. Upon
of optimization models—such as objectives, variables, and validation, this code interfaces with specific components
constraints—within natural language texts. Researchers ap- like optimization solvers and databases to derive solutions.
proached this by using classical NER techniques that rely on The results, converted into understandable explanations by
the morphological and grammatical properties of the text. the LLM, simplify complex supply chain optimizations for
Additionally, modern methods were employed, involving non-technical users, fostering trust in automated decisions.
the use of pre-trained LLMs like BERT and GPT, which In practical deployments, such as Microsoft Azure’s supply
were further fine-tuned on optimization-specific datasets to chain, OptiGuide has exhibited promising outcomes, achiev-
better understand the unique language of optimization prob- ing an average accuracy of 93% with GPT-4, highlighting its
lems. Task 2 required building mathematical representations effectiveness in real-world settings. A summary of the recent
from these elements, a more complex step involving deeper works in the field of Optimization and Language Models is
model comprehension. The methodologies here included the shown in Table 1.
use of sequence-to-sequence models, which are adept at han- Despite these strides, a gap persists−an end-to-end sys-
dling such translation tasks. tem that allows users the flexibility to verify and mod-
The former two-step approach to generate mathemati- ify mathematical problem formulation, independent of the
cal formulation from optimization problem description re- solver or programming language used. Addressing this, our
quires training and dependency on two separate models. To research identifies a niche for benchmarking popular pre-
bridge the research gap, Tsouros et al. (Tsouros et al. 2023) trained LLMs on the specific task of optimization problem
proposed an all-in-one LLM-based model that creates opti- formulation and developing a tailored fine-tuning approach
mization models directly from prompts, showing early po- to enhance LLM specificity for this nuanced application.
tential on the dataset described in NL4Opt but without es- This work endeavors to bridge the research gap, offering
tablished benchmarks for comparison. Advancing this ap- a robust benchmark and a novel fine-tuning strategy that
proach, Teshinizi et al. (AhmadiTeshnizi, Gao, and Udell could significantly benefit the scientific community’s pursuit
2023) presented a novel framework named OptiMUS, which of democratizing optimization modeling.
utilizes LLMs (pre-trained GPT) to formulate and solve
Mixed Integer Linear Programming (MILP) problems from Task Formulation
natural language descriptions. They introduced a dataset, This research investigates a generative task in the field of
NLP4LP, containing linear programming and MILP prob- natural language processing, concentrating on the genera-
lems to benchmark OptiMUS, which shows significant im- tion of mathematical formulations for optimization prob-
provement over basic LLM prompting strategies. OptiMUS lems derived from textual descriptions. Our objective is to
integrates mathematical modeling, Gurobi solver code gen- derive structured representations - encompassing variables,
eration, automated testing, and debugging in a cohesive sys- constraints, and the objective function based on given natu-
tem that streamlines the optimization problem-solving pro- ral language descriptions. We utilize a dataset, denoted as S,
cess. The goal of this study is to democratize access to opti- comprising a series of problem descriptions, and C, repre-
mization techniques across various domains, thereby broad- senting their corresponding formulations in canonical math-
ening the use of optimization tools beyond expert circles. ematical form. At the core of our methodology is the in-
Furthermore, Yang et al. (Yang et al. 2023) introduced an- troduction of an intermediate representational set, R, which
other prompt-based framework, OPRO, which uses LLMs encapsulates the essential components of optimization prob-
to optimize problems without needing traditional solvers. lems (variables, constraints, and objective functions) in an
OPRO works by iteratively improving solutions using a equation-centric format, as opposed to the final matrix form
‘meta-prompt’ that incorporates both the problem descrip- depicted in C. For a given problem description s ∈ S, the
tion and feedback from previous solutions. It aims to learn primary goal of an LLM is to predict an intermediate rep-
continuously as it updates the meta-prompt with new infor- resentation r ∈ R. Finally, the predicted intermediate rep-
mation. To ensure stable results, OPRO generates several resentation, r, undergoes a systematic conversion into the
Research Work Dataset Input Framework Objective
Problem Type in Natural Language Human-in-the-loop Multiple LLMs Fine-tuning Prompt Engineering
NER4Opt NL4Opt Optimization × × X × Identifying named entitties
NL4Opt Competition NL4Opt Optimization × X X × Mathematical Formulation
Holy Grail 2.0 − Optimization − − − − Mathematical Formulation
OPRO GSM8K, BBH Math word, Common-sense, Optimization × × × X Problem Solution
Optimus NLP4LP Optimization X X × X Problem Solution
Optiguide Private Supply chain management × × × X Problem Solution (QA Session)
LM4OPT (ours) NL4Opt Optimization × × X X Mathematical Formulation

Table 1: Recent works in the field of Optimization and Language Models

canonical formulation, denoted as c ∈ C, to facilitate a com- Advanced Tuning of Llama-2-7b via LM4OPT
prehensive evaluation of the performance of LLM. This pro-
cess is exemplified in Figure 1, where an example of a prob- A progressive fine-tuning strategy was employed for the
lem description along with the corresponding intermediate Llama-2-7b model, enabling it to initially adapt to a broader
representation and canonical form is provided. It should be domain context related to the final task. This preliminary
noted that the constraints are transformed into a format em- adaptation phase is crucial in enhancing the model’s com-
bodying ‘less than or equal to’ conditions, and the objective prehension and performance capabilities. Following this, the
function is reformulated into a minimization paradigm. model undergoes further fine-tuning on a specialized, task-
specific dataset, where it applies the knowledge acquired in
the initial phase to achieve improved performance and gen-
Methodology eralization on the target task. Prior to its fine-tuning on the
NL4Opt dataset, the model was fine-tuned on GSM8K−a
In contemporary research, language models are conceptual- dataset comprising high-quality, linguistically diverse grade
ized as functions that accept a textual input context and yield school math word problems crafted by human problem writ-
a corresponding textual output. This paradigm is predomi- ers (Cobbe et al. 2021). This sequential fine-tuning approach
nantly instantiated through the use of transformer-based ar- effectively leverages the broader contextual understanding
chitectures, a concept introduced by Vaswani et al. (Vaswani gained from GSM8K, thereby refining the model’s perfor-
et al. 2017) in 2017, which has since revolutionized the mance on the NL4Opt tasks.
field of NLP. The quintessential aspect of transformer lan- In the fine-tuning phase, a methodological approach in-
guage models is their reliance on self-attention mechanisms. tegrating Low-Rank Adaptations (LoRA) (Hu et al. 2021)
These mechanisms are designed to encode input contexts with Parameter-Efficient Fine-Tuning (PEFT) (Liu et al.
by weighing the importance of different parts of the input 2022) was employed. The fine-tuning process involved care-
text relative to each other. However, these models face a no- fully adjusting the low-rank matrices introduced by LoRA,
table limitation in processing long text sequences due to the ensuring minimal yet strategic changes to the pre-existing
quadratic increase in computational complexity with longer weights. This method preserves the general linguistic under-
inputs (Devlin et al. 2019). This leads to a restricted context standing gained from pre-training while efficiently steering
window during pre-training, limiting the model’s ability to it toward the specialized task of mathematical problem for-
maintain and utilize long-term dependencies and integrate mulation. The effectiveness of this approach is evident in the
information from distant text segments. Consequently, this improved ability to parse and translate complex natural lan-
impacts the model’s effectiveness in tasks requiring exten- guage descriptions into structured mathematical representa-
sive contextual understanding (Brown et al. 2020). To this tions, a crucial requirement for the NL4Opt dataset. PEFT,
end, our experiments investigate the performance of LLMs on the other hand, extends this concept by focusing on selec-
in zero-shot and one-shot pre-trained settings, alongside a tively fine-tuning a small subset of the parameters. By adopt-
smaller LLM, specifically fine-tuned for the task of mathe- ing PEFT, the fine-tuning process becomes computationally
matical formulation of optimization problems. less demanding and more feasible on standard hardware,
For this purpose, we evaluate GPT-3.5, GPT-4, and while still achieving performance comparable to full-model
Llama-2-7b models. As fine-tuning is not a prerequisite for fine-tuning. The synergy between LoRA and PEFT in fine-
inference in these LLMs, our approach centers on the de- tuning Llama-2-7b is particularly effective in addressing the
velopment of optimal prompt instructions for both zero-shot challenges of large model adaptation to specific tasks.
and one-shot settings. This development is guided by the Furthermore, the inclusion of Noisy Embedding Instruc-
prompt optimization techniques delineated in (Yang et al. tion Fine Tuning (NEFTune) (Jain et al. 2023) further aug-
2023). Additionally, to explore the impact of fine-tuning on mented the fine-tuning process. NEFTune, by integrating
a task-specific dataset, we selected the Llama-2-7b model, controlled random noise into the embedding vectors during
primarily due to its comparatively lower resource demands. training prevents the model from overfitting to the specifics
This model was fine-tuned using the NL4Opt dataset, allow- of the training dataset, such as formatting details and exact
ing for an in-depth analysis of fine-tuning effects on model wording. Instead, it encourages the model to generate re-
performance within this specific context. Optimized instruc- sponses that are more coherent, longer, and more diverse. A
tions for fine-tuning, zero-shot, and one-shot prompts are detailed configuration of our experimental setup is described
provided in Figure 2. in the following subsection.
Problem Description
A hotel employs cleaners and receptionists. Intermediate Representation
Cleaners earn $500 per week and reception-
ists earn $350 per week. The hotel requires Variables: cleaners, receptionists
a minimum of 100 workers of whom at least Constraints: Canonical Form
20 must be receptionists. To keep the hotel (−1.0) ∗ cleaners + (−1.0) ∗ receptionists ≤ −100.0 [[-1.0, -1.0, -100.0],
clean and running smoothly, the number of (−0.0) ∗ cleaners + (−1.0) ∗ receptionists ≤ −20.0 [0.0, -1.0, -20.0],
receptionists should be at least a third of the (0.33) ∗ cleaners + (−1.0) ∗ receptionists ≤ −0.0 [0.33, -1.0, 0.0],
number of cleaners. The hotel wants to keep (500.0) ∗ cleaners + (350.0) ∗ receptionists ≤ 30000.0 [500.0, 350.0, 30000]],
the weekly wage bill below $30000. Formu- Objective Function:
late an LP to minimize the wage bill. minimize(500.0) ∗ cleaners + (350.0) ∗ receptionist [500.0, 350.0]

Figure 1: Task Representation

Fine-tuning Instruction
Imagine you are a combinatorial optimization problem solver. I will give you a problem description. Your task is to find the variables,
constraints, and objective functions from that description. In your response, all the constraints must be in the less than or equal to format.
Your response must contain only these 3 parts: - Variables, Constraints, and Objective Function. There must be no extra strings before or
after it.
Zero-shot Instruction
Imagine you are a combinatorial optimization problem solver. I will give you a problem description. Your task is to find the variables,
constraints, and objective functions from the description. I am giving you an example response format; your output should be formatted
like this. Example Response:
“Variables: cleaners, receptionists
Constraints:
(−1.0) ∗ cleaners + (−1.0) ∗ receptionists ≤ −100.0
(−0.0) ∗ cleaners + (−1.0) ∗ receptionists ≤ −20.0
(0.33) ∗ cleaners + (−1.0) ∗ receptionists ≤ −0.0
(500.0) ∗ cleaners + (350.0) ∗ receptionists ≤ 30000.0
Objective Function:
minimize(500.0) ∗ cleaners + (350.0) ∗ receptionist”.
Now, below is the actual problem description that you have to solve. In your response, all the constraints must be in the less than or equal
to format. Your response must contain only these 3 parts: Variables, Constraints, and Objective Function. There must be no extra strings
before or after it. Problem description to solve:

One-shot Instruction
Imagine you are a combinatorial optimization problem solver. I will give you a problem description. Your task is to find the variables,
constraints, and objective functions from that description. Before that, I am giving you an example problem description and response for
your understanding; Your response should be formatted like this. Example Problem Description:
“A hotel employs cleaners and receptionists. Cleaners earn $500 per week and receptionists earn $350 per week. The hotel requires
a minimum of 100 workers of whom at least 20 must be receptionists. To keep the hotel clean and running smoothly, the number of
receptionists should be at least a third of the number of cleaners. The hotel wants to keep the weekly wage bill below $30000. Formulate
an LP to minimize the wage bill.”
Example Response for the given example problem:
“Variables: cleaners, receptionists
Constraints:
(−1.0) ∗ cleaners + (−1.0) ∗ receptionists ≤ −100.0
(−0.0) ∗ cleaners + (−1.0) ∗ receptionists ≤ −20.0
(0.33) ∗ cleaners + (−1.0) ∗ receptionists ≤ −0.0
(500.0) ∗ cleaners + (350.0) ∗ receptionists ≤ 30000.0
Objective Function: minimize(500.0) ∗ cleaners + (350.0) ∗ receptionist”.
Now, below is the actual problem description that you have to solve. In your response, all the constraints must be in the less than or equal
to format. Your response must contain only these 3 parts: Variables, Constraints, and Objective Function. There must be no extra strings
before or after it. Problem description to solve:

Figure 2: Instruction set for the Prompts to LLMs

The incorporation of methodologies such as progressive tional fine-tuning framework of Large Language Models
fine-tuning, LoRA, PEFT, and NEFTune into the conven- (LLMs) has notably augmented the inferential efficacy of
the Llama-2-7b model. This strategic enhancement is partic- Language Model k-Shot F1-score
ularly salient for a generative language model of this scale,
with a parameter count of only 7 billion, especially in intri- Baseline (Ramamonjison et al. 2023) - 0.610
cate tasks that challenge even more extensive models like Llama-2-7b 0 0.1259
GPT-3.5 and GPT-4 in their capacity to comprehend and Llama-2-7b 1 0.1022
maintain prolonged and complex contexts. GPT-3.5 0 0.4381
GPT-3.5 1 0.4928
Experimental Setup GPT-4 0 0.6072
The fine-tuning of the Llama-2-7b model was conducted on GPT-4 1 0.6330
an NVIDIA A40 GPU, equipped with 48 GB of VRAM,
over a span of 7 epochs. This process leveraged the dataset Table 2: Performance evaluation of LLMs for opti-
division suggested by the authors of NL4Opt (Ramamonji- mization problem formulation. The best performance
son et al. 2023), segregating it into training, validation, and in terms of F1-score is highlighted in bold. GPT-3.5
evaluation subsets. A batch size of 4 was employed, cou- (gpt-3.5-turbo-0613) and GPT-4 (gpt-4-0613)
pled with a gradient accumulation step of 1, and the AdamW models are accessed through OpenAI api1 on November 1,
(Loshchilov and Hutter 2017) optimizer was utilized. The 2023. Llama-2-7b model is fine-tuned using the proposed
initial learning rate was set at 3e − 4, with a weight de- LM4OPT framework.
cay factor of 0.001. A random noisy embedding strength
of 5 provided the most satisfactory results during the fine- Model k-Shot Fine-tune NEFTune F1-Score
tuning process. A maximum response sequence length of
200 was designated, under the premise that model outputs 0 × × 0.0036
would not exceed this threshold for this specific task. Fur- 0 N × 0.0617
thermore, the implementation of Gradient Checkpointing 1 N × 0.0581
(Chen et al. 2016) facilitated a more resource-efficient fine- Llama-2-7b 0 N X 0.0770
tuning framework. 1 N X 0.0693
An additional aspect of this research involved estimating 0 P X 0.1259
the carbon footprint associated with the fine-tuning phase, 1 P X 0.1022
guided by the methodology proposed by Lannelongue et
al. (Lannelongue, Grealey, and Inouye 2021). This analy- Table 3: Performance comparison of fine-tuned Llama-
sis revealed that each fine-tuning session of the Llama-2-7b 2-7b. ‘N’ in the ‘Fine-tune’ column represents non-
model produced approximately 23.52 grams of CO2 emis- progressive fine-tuning, whereas, ‘P’ refers to progressive
sions. Notably, this finding underscores the relatively mod- finetuning. The best performance is highlighted in bold.
est environmental impact of fine-tuning the model for spe-
cialized tasks.
0.6330. This superior performance can be attributed to GPT-
Result and Discussion 4’s advanced architecture and larger dataset training, as sug-
A comprehensive assessment of various LLMs was con- gested by recent studies emphasizing the enhanced contex-
ducted, focusing on their capability in formulating optimiza- tual understanding and response accuracy in more extensive
tion problems. This evaluation was based on prompt-based models (OpenAI 2023). Conversely, Llama-2-7b, despite be-
zero-shot and one-shot learning experiments. The perfor- ing a smaller model, shows notable performance improve-
mances of these LLMs were meticulously compared against ments in the zero-shot setting compared to one-shot, which
the established baseline provided by Ramamonjison et al. aligns with the findings that smaller models might struggle
(Ramamonjison et al. 2023), as detailed in Table 2. For a with longer context prompts.
consistent and objective assessment, the same scoring mech- Table 3 showcases the performance comparison of the
anism employed in the baseline evaluation by Ramamonji- Llama-2-7b model under various fine-tuning conditions. It
son et al. was adopted. This approach ensures a fair and di- assesses the F1-Score across different configurations, in-
rect comparison of the performance of LLMs relative to the cluding zero-shot and one-shot settings (k-Shot), with and
existing benchmark in this task. without fine-tuning, and the application of Noisy Embed-
The baseline performance in Table 2 is derived from a dings Fine-tuning (NEFTune). Notably, progressive fine-
fine-tuned BART (Lewis et al. 2019) model, which oper- tuning using the LM4OPT framework (P), especially in the
ates under different input conditions compared to the LLMs. zero-shot setting, significantly enhances the performance,
While LLMs like Llama-2 and GPT receive instruction achieving the highest F1-Score of 0.1259. This indicates the
prompts and problem descriptions in natural language, the efficacy of progressive fine-tuning combined with NEFTune
baseline BART model is also provided with named en- in improving the ability to understand and solve optimiza-
tity information extracted from the natural language prob- tion problems, as opposed to non-progressive fine-tuning
lem descriptions. This additional data potentially contributes (N) and the baseline without any fine-tuning.
to the baseline’s competitive F1-score of 0.61. The GPT-4 A notable observation from Table 3 is the superior out-
model, especially in the one-shot setting, outperforms oth-
1
ers, including the baseline and GPT-3.5, with an F1-score of https://platform.openai.com/docs/models
comes in zero-shot settings compared to their one-shot coun- Limitations
terparts across all configurations. This phenomenon could be In this study, certain limitations have been identified that
attributed to the hypothesis that a smaller model like Llama- bear on the research outcomes. A noticeable constraint
2-7b struggles with longer contexts. The data suggests that within the dataset utilized for this research is its composi-
in scenarios involving extended contexts, the model tends to tion of straightforward, formally structured samples replete
exhibit behavior indicative of hallucinations and produces with specific optimization domain terminologies like ‘for-
repetitive responses that lack coherence with the broader mulate an LP.’ This framework diverges from our overarch-
context. Such patterns reinforce the notion that smaller mod- ing aim to assess the efficacy of LLMs in interpreting and
els may face challenges in maintaining consistency and rel- formulating optimization problems as presented in natural
evance in responses as the prompt length increases, a critical language by individuals unversed in domain-specific jargon.
consideration in optimizing model performance for complex It is posited that this dataset limitation might yield a dis-
tasks. crepancy between the documented performance of LLMs
and their practical application by domain-agnostic users.
Effect of Progressive Fine-tuning Moreover, resource constraints impeded the exploration of
progressive fine-tuning effects on larger LLMs, such as
As shown in Table 3, fine-tuning specifically for instruc- Llama-2-70b and GPT-3.5, which may have offered addi-
tion processing significantly enhanced the performance of tional insights. Furthermore, the adoption of a rule-based ap-
the Llama-2-7b model. Initially, the pre-trained Llama-2- proach for converting intermediate representations to canon-
7b, in both zero-shot and one-shot settings, exhibited sub- ical forms has its drawbacks. Upon meticulous review, it
stantial hallucination. A notable example of this was the was observed that some LLM-generated intermediate repre-
model generating two distinct sets of variables within a sin- sentations were inaccurately formatted, leading to canonical
gle response, and its output format often did not align with forms that diverged from the ground truth. While these dis-
the given prompt instructions, as demonstrated in Figure 3. crepancies influenced the LLMs’ performance metrics, it is
However, the performance significantly improved after pro- conjectured that such nuances would be within human inter-
gressively fine-tuning the model. As it is evident from the pretive capabilities, suggesting that a collaborative human-
response samples, the performance of the fine-tuned Llama- model approach might counterbalance the observed perfor-
2-7b model significantly declined due to its inability to con- mance degradation linked to format conversions. The inter-
sistently maintain a specific response format. It is hypoth- action between what the model produces and how humans
esized that involving human evaluators or a human-in-the- understand it highlights an important area for future studies.
loop approach for minor modifications to the outputs could It emphasizes the need to harmonize machine precision with
significantly improve its efficiency. Such interventions could human judgment.
potentially bring the performance of a smaller model like
Llama-2-7b closer to that of some of the larger models. Conclusion
In this study, we undertook a comprehensive evaluation of
Does Increased Instruction Length Always LLMs such as GPT-3.5, GPT-4, and Llama-2-7b, focusing
on their ability to translate natural language descriptions into
Enhance Performance? mathematical formulation of optimization problems. The re-
Upon a thorough examination of the results and the out- search highlights that while GPT-4 exhibits superior perfor-
puts from both GPT and Llama models, it became evi- mance in both zero-shot and one-shot scenarios, there is a
dent that longer instructions do not universally enhance re- notable capability gap with smaller models like Llama-2-7b,
sponses across all models. The study noted that extended, particularly in handling complex contexts. Progressive fine-
detailed instructions were beneficial for larger models like tuning of Llama-2-7b, especially with noisy embeddings and
GPT-3.5 and GPT-4. Longer instructions aided GPT-3.5 specialized datasets using our proposed LM4OPT frame-
and GPT-4 in resolving parsing issues common in scenar- work, significantly enhances its performance. These findings
ios where multiple formulations are correct but are scored contribute to understanding the capabilities and limitations
differently due to manual parsing in the scoring mecha- of LLMs in a novel application area, paving the way for
nism. By specifying the solution format, these larger mod- future improvements in the field of optimization and OR.
els were guided to respond in a particular way. For in- Drawing upon the foundational research by Teshnizi et al.
stance, GPT-3.5 in a zero-shot setting produced the equa- (AhmadiTeshnizi, Gao, and Udell 2023), this study lays the
tion largeships ≤ smallships , whereas in a one-shot set- groundwork for future extensions, wherein the intermediate
ting, it generated largeships − smallships ≤ 0. The latter mathematical representations derived from natural language
formulation, after parsing, yielded a higher score. However, descriptions in this research could serve as precursors for
with Llama-2-7b, a smaller model, longer instructions led LLMs to generate ILP solver code in Python.
to issues such as repetition and hallucination, particularly
noticeable in one-shot settings. As illustrated in Figure 4, References
Llama-2-7b not only repeated parts of the instructions but AhmadiTeshnizi, A.; Gao, W.; and Udell, M. 2023. Opti-
also generated nonexistent variables such as x1, x2, x3, de- MUS: Optimization Modeling Using mip Solvers and large
viating from the original problem description. language models. ArXiv, abs/2310.06116.
Pre-trained
### Solution
“‘
Variables: thin, stubby
Constraints:
(50.0) ∗ thin + (90.0) ∗ baking <= 3000.0
(30.0) ∗ thin + (150.0) ∗ baking <= 4000.0
(5.0) ∗ thin + (9.0) ∗ baking <= 0.0
(5.0) ∗ thin + (9.0) ∗ baking <= 0.0
Objective Function:
maximize(5.0) ∗ thin + (9.0) ∗ baking
“‘ Fine-tuned
### Solution Example Response: “Variables: thinjar, stubbyjar
“‘ Constraints:
Variables: cleaners, receptionists (50.0) ∗ thinjar + (30.0) ∗ stubbyjar <= 3000.0
Constraints: (90.0) ∗ thinjar + (150.0) ∗ stubbyjar <= 4000.0
(−1.0) ∗ cleaners + (−1.0) ∗ receptionists <= −100.0 Objective Function:
(−0.0) ∗ cleaners + (−1.0 maximize(5.0) ∗ thinjar + (9.0) ∗ stubbyjar”

Figure 3: Sample response from pre-trained and fine-tuned Llama-2-7b. Hallucinating behavior is highlighted in red.

Looping
Response:“Variables: apple, pear
Constraints:
(−1.0) ∗ apple + (−0.0) ∗ pear ≤ −5.0
(−0.0) ∗ apple + (−1.0) ∗ pear ≤ −10.0
(1.0) ∗ apple + (−0.0) ∗ pear ≤ −0.0
(1.0) ∗ apple + (2.0) ∗ pear ≤ 0.0 Hallucination
Objective Function: maximize(2.0) ∗ apple + (4.0) ∗ pear”. Variables: x1, x2, x3
## Part 2: An easy LP In this part, you are given a problem description and you have to Constraints:
find the variables and constraints. Your response must be in the following format: (−1.0) ∗ x1 + (−0.0)∗

Figure 4: Looping Behavior and Hallucination for few-shot instructions in Llama-2-7b. The repetitive and hallucinating
behaviors are highlighted in red.

Brown, T. B.; Mann, B.; Ryder, N.; Subbiah, M.; Kaplan, J.; Language Understanding. In North American Chapter of the
Dhariwal, P.; Neelakantan, A.; Shyam, P.; Sastry, G.; Askell, Association for Computational Linguistics.
A.; Agarwal, S.; Herbert-Voss, A.; Krueger, G.; Henighan, Hu, J. E.; Shen, Y.; Wallis, P.; Allen-Zhu, Z.; Li, Y.; Wang,
T. J.; Child, R.; Ramesh, A.; Ziegler, D. M.; Wu, J.; Winter, S.; and Chen, W. 2021. LoRA: Low-Rank Adaptation of
C.; Hesse, C.; Chen, M.; Sigler, E.; Litwin, M.; Gray, S.; Large Language Models. ArXiv, abs/2106.09685.
Chess, B.; Clark, J.; Berner, C.; McCandlish, S.; Radford,
A.; Sutskever, I.; and Amodei, D. 2020. Language Models Jain, N.; yeh Chiang, P.; Wen, Y.; Kirchenbauer, J.;
are Few-Shot Learners. ArXiv, abs/2005.14165. Chu, H.-M.; Somepalli, G.; Bartoldson, B.; Kailkhura, B.;
Schwarzschild, A.; Saha, A.; Goldblum, M.; Geiping, J.; and
Chen, T.; Xu, B.; Zhang, C.; and Guestrin, C. 2016. Train- Goldstein, T. 2023. NEFTune: Noisy Embeddings Improve
ing Deep Nets with Sublinear Memory Cost. ArXiv, Instruction Finetuning. ArXiv, abs/2310.05914.
abs/1604.06174.
Karmarkar, N. 1984. A new polynomial-time algorithm for
Cobbe, K.; Kosaraju, V.; Bavarian, M.; Chen, M.; Jun, H.; linear programming. Combinatorica, 4: 373–395.
Kaiser, L.; Plappert, M.; Tworek, J.; Hilton, J.; Nakano, R.; Lannelongue, L.; Grealey, J.; and Inouye, M. 2021. Green
Hesse, C.; and Schulman, J. 2021. Training Verifiers to algorithms: quantifying the carbon footprint of computation.
Solve Math Word Problems. ArXiv, abs/2110.14168. Advanced science, 8(12): 2100707.
Dakle, P.; Kadioğlu, S.; Uppuluri, K.; Politi, R.; Ragha- Laskar, M. T. R.; Hoque, E.; and Huang, J. 2021. Do-
van, P.; Rallabandi, S. K.; and Srinivasamurthy, R. S. 2023. main Adaptation with Pre-trained Transformers for Query-
Ner4Opt: Named Entity Recognition for Optimization Mod- Focused Abstractive Text Summarization. Computational
elling from Natural Language. In Integration of AI and OR Linguistics, 48: 279–320.
Techniques in Constraint Programming.
Lewis, M.; Liu, Y.; Goyal, N.; Ghazvininejad, M.; rahman
Devlin, J.; Chang, M.-W.; Lee, K.; and Toutanova, K. 2019. Mohamed, A.; Levy, O.; Stoyanov, V.; and Zettlemoyer, L.
BERT: Pre-training of Deep Bidirectional Transformers for 2019. BART: Denoising Sequence-to-Sequence Pre-training
for Natural Language Generation, Translation, and Compre-
hension. In Annual Meeting of the Association for Compu-
tational Linguistics.
Li, B.; Mellou, K.; qing Zhang, B.; Pathuri, J.; and Men-
ache, I. 2023. Large Language Models for Supply Chain
Optimization. ArXiv, abs/2307.03875.
Liu, H.; Tam, D.; Muqeeth, M.; Mohta, J.; Huang, T.;
Bansal, M.; and Raffel, C. 2022. Few-Shot Parameter-
Efficient Fine-Tuning is Better and Cheaper than In-Context
Learning. ArXiv, abs/2205.05638.
Loshchilov, I.; and Hutter, F. 2017. Decoupled Weight De-
cay Regularization. In International Conference on Learn-
ing Representations.
Nash, J. C. 2000. The (Dantzig) simplex method for linear
programming. Comput. Sci. Eng., 2: 29–31.
OpenAI. 2023. GPT-4 Technical Report. ArXiv,
abs/2303.08774.
Ramamonjison, R.; Yu, T. T.; Li, R.; Li, H.; Carenini,
G.; Ghaddar, B.; He, S.; Mostajabdaveh, M.; Banitalebi-
Dehkordi, A.; Zhou, Z.; and Zhang, Y. 2023. NL4Opt
Competition: Formulating Optimization Problems Based
on Their Natural Language Descriptions. ArXiv,
abs/2303.08233.
Suzgun, M.; Scales, N.; Scharli, N.; Gehrmann, S.; Tay, Y.;
Chung, H. W.; Chowdhery, A.; Le, Q. V.; hsin Chi, E. H.;
Zhou, D.; and Wei, J. 2022. Challenging BIG-Bench Tasks
and Whether Chain-of-Thought Can Solve Them. In Annual
Meeting of the Association for Computational Linguistics.
Touvron, H.; Martin, L.; Stone, K. R.; Albert, P.; Alma-
hairi, A.; Babaei, Y.; Bashlykov, N.; Batra, S.; Bhargava,
P.; Bhosale, S.; Bikel, D. M.; Blecher, L.; Ferrer, C. C.;
Chen, M.; Cucurull, G.; Esiobu, D.; Fernandes, J.; Fu, J.; Fu,
W.; Fuller, B.; Gao, C.; Goswami, V.; Goyal, N.; Hartshorn,
A. S.; Hosseini, S.; Hou, R.; Inan, H.; Kardas, M.; Kerkez,
V.; Khabsa, M.; Kloumann, I. M.; Korenev, A. V.; Koura,
P. S.; Lachaux, M.-A.; Lavril, T.; Lee, J.; Liskovich, D.; Lu,
Y.; Mao, Y.; Martinet, X.; Mihaylov, T.; Mishra, P.; Moly-
bog, I.; Nie, Y.; Poulton, A.; Reizenstein, J.; Rungta, R.;
Saladi, K.; Schelten, A.; Silva, R.; Smith, E. M.; Subrama-
nian, R.; Tan, X.; Tang, B.; Taylor, R.; Williams, A.; Kuan,
J. X.; Xu, P.; Yan, Z.; Zarov, I.; Zhang, Y.; Fan, A.; Kam-
badur, M.; Narang, S.; Rodriguez, A.; Stojnic, R.; Edunov,
S.; and Scialom, T. 2023. Llama 2: Open Foundation and
Fine-Tuned Chat Models. ArXiv, abs/2307.09288.
Tsouros, D. C.; Verhaeghe, H.; Kadiouglu, S.; and Guns, T.
2023. Holy Grail 2.0: From Natural Language to Constraint
Models. ArXiv, abs/2308.01589.
Vaswani, A.; Shazeer, N. M.; Parmar, N.; Uszkoreit, J.;
Jones, L.; Gomez, A. N.; Kaiser, L.; and Polosukhin, I. 2017.
Attention is All you Need. In Neural Information Process-
ing Systems.
Yang, C.; Wang, X.; Lu, Y.; Liu, H.; Le, Q. V.; Zhou, D.;
and Chen, X. 2023. Large Language Models as Optimizers.
ArXiv, abs/2309.03409.

You might also like