2403.01342 (1)
2403.01342 (1)
canonical formulation, denoted as c ∈ C, to facilitate a com- Advanced Tuning of Llama-2-7b via LM4OPT
prehensive evaluation of the performance of LLM. This pro-
cess is exemplified in Figure 1, where an example of a prob- A progressive fine-tuning strategy was employed for the
lem description along with the corresponding intermediate Llama-2-7b model, enabling it to initially adapt to a broader
representation and canonical form is provided. It should be domain context related to the final task. This preliminary
noted that the constraints are transformed into a format em- adaptation phase is crucial in enhancing the model’s com-
bodying ‘less than or equal to’ conditions, and the objective prehension and performance capabilities. Following this, the
function is reformulated into a minimization paradigm. model undergoes further fine-tuning on a specialized, task-
specific dataset, where it applies the knowledge acquired in
the initial phase to achieve improved performance and gen-
Methodology eralization on the target task. Prior to its fine-tuning on the
NL4Opt dataset, the model was fine-tuned on GSM8K−a
In contemporary research, language models are conceptual- dataset comprising high-quality, linguistically diverse grade
ized as functions that accept a textual input context and yield school math word problems crafted by human problem writ-
a corresponding textual output. This paradigm is predomi- ers (Cobbe et al. 2021). This sequential fine-tuning approach
nantly instantiated through the use of transformer-based ar- effectively leverages the broader contextual understanding
chitectures, a concept introduced by Vaswani et al. (Vaswani gained from GSM8K, thereby refining the model’s perfor-
et al. 2017) in 2017, which has since revolutionized the mance on the NL4Opt tasks.
field of NLP. The quintessential aspect of transformer lan- In the fine-tuning phase, a methodological approach in-
guage models is their reliance on self-attention mechanisms. tegrating Low-Rank Adaptations (LoRA) (Hu et al. 2021)
These mechanisms are designed to encode input contexts with Parameter-Efficient Fine-Tuning (PEFT) (Liu et al.
by weighing the importance of different parts of the input 2022) was employed. The fine-tuning process involved care-
text relative to each other. However, these models face a no- fully adjusting the low-rank matrices introduced by LoRA,
table limitation in processing long text sequences due to the ensuring minimal yet strategic changes to the pre-existing
quadratic increase in computational complexity with longer weights. This method preserves the general linguistic under-
inputs (Devlin et al. 2019). This leads to a restricted context standing gained from pre-training while efficiently steering
window during pre-training, limiting the model’s ability to it toward the specialized task of mathematical problem for-
maintain and utilize long-term dependencies and integrate mulation. The effectiveness of this approach is evident in the
information from distant text segments. Consequently, this improved ability to parse and translate complex natural lan-
impacts the model’s effectiveness in tasks requiring exten- guage descriptions into structured mathematical representa-
sive contextual understanding (Brown et al. 2020). To this tions, a crucial requirement for the NL4Opt dataset. PEFT,
end, our experiments investigate the performance of LLMs on the other hand, extends this concept by focusing on selec-
in zero-shot and one-shot pre-trained settings, alongside a tively fine-tuning a small subset of the parameters. By adopt-
smaller LLM, specifically fine-tuned for the task of mathe- ing PEFT, the fine-tuning process becomes computationally
matical formulation of optimization problems. less demanding and more feasible on standard hardware,
For this purpose, we evaluate GPT-3.5, GPT-4, and while still achieving performance comparable to full-model
Llama-2-7b models. As fine-tuning is not a prerequisite for fine-tuning. The synergy between LoRA and PEFT in fine-
inference in these LLMs, our approach centers on the de- tuning Llama-2-7b is particularly effective in addressing the
velopment of optimal prompt instructions for both zero-shot challenges of large model adaptation to specific tasks.
and one-shot settings. This development is guided by the Furthermore, the inclusion of Noisy Embedding Instruc-
prompt optimization techniques delineated in (Yang et al. tion Fine Tuning (NEFTune) (Jain et al. 2023) further aug-
2023). Additionally, to explore the impact of fine-tuning on mented the fine-tuning process. NEFTune, by integrating
a task-specific dataset, we selected the Llama-2-7b model, controlled random noise into the embedding vectors during
primarily due to its comparatively lower resource demands. training prevents the model from overfitting to the specifics
This model was fine-tuned using the NL4Opt dataset, allow- of the training dataset, such as formatting details and exact
ing for an in-depth analysis of fine-tuning effects on model wording. Instead, it encourages the model to generate re-
performance within this specific context. Optimized instruc- sponses that are more coherent, longer, and more diverse. A
tions for fine-tuning, zero-shot, and one-shot prompts are detailed configuration of our experimental setup is described
provided in Figure 2. in the following subsection.
Problem Description
A hotel employs cleaners and receptionists. Intermediate Representation
Cleaners earn $500 per week and reception-
ists earn $350 per week. The hotel requires Variables: cleaners, receptionists
a minimum of 100 workers of whom at least Constraints: Canonical Form
20 must be receptionists. To keep the hotel (−1.0) ∗ cleaners + (−1.0) ∗ receptionists ≤ −100.0 [[-1.0, -1.0, -100.0],
clean and running smoothly, the number of (−0.0) ∗ cleaners + (−1.0) ∗ receptionists ≤ −20.0 [0.0, -1.0, -20.0],
receptionists should be at least a third of the (0.33) ∗ cleaners + (−1.0) ∗ receptionists ≤ −0.0 [0.33, -1.0, 0.0],
number of cleaners. The hotel wants to keep (500.0) ∗ cleaners + (350.0) ∗ receptionists ≤ 30000.0 [500.0, 350.0, 30000]],
the weekly wage bill below $30000. Formu- Objective Function:
late an LP to minimize the wage bill. minimize(500.0) ∗ cleaners + (350.0) ∗ receptionist [500.0, 350.0]
Fine-tuning Instruction
Imagine you are a combinatorial optimization problem solver. I will give you a problem description. Your task is to find the variables,
constraints, and objective functions from that description. In your response, all the constraints must be in the less than or equal to format.
Your response must contain only these 3 parts: - Variables, Constraints, and Objective Function. There must be no extra strings before or
after it.
Zero-shot Instruction
Imagine you are a combinatorial optimization problem solver. I will give you a problem description. Your task is to find the variables,
constraints, and objective functions from the description. I am giving you an example response format; your output should be formatted
like this. Example Response:
“Variables: cleaners, receptionists
Constraints:
(−1.0) ∗ cleaners + (−1.0) ∗ receptionists ≤ −100.0
(−0.0) ∗ cleaners + (−1.0) ∗ receptionists ≤ −20.0
(0.33) ∗ cleaners + (−1.0) ∗ receptionists ≤ −0.0
(500.0) ∗ cleaners + (350.0) ∗ receptionists ≤ 30000.0
Objective Function:
minimize(500.0) ∗ cleaners + (350.0) ∗ receptionist”.
Now, below is the actual problem description that you have to solve. In your response, all the constraints must be in the less than or equal
to format. Your response must contain only these 3 parts: Variables, Constraints, and Objective Function. There must be no extra strings
before or after it. Problem description to solve:
One-shot Instruction
Imagine you are a combinatorial optimization problem solver. I will give you a problem description. Your task is to find the variables,
constraints, and objective functions from that description. Before that, I am giving you an example problem description and response for
your understanding; Your response should be formatted like this. Example Problem Description:
“A hotel employs cleaners and receptionists. Cleaners earn $500 per week and receptionists earn $350 per week. The hotel requires
a minimum of 100 workers of whom at least 20 must be receptionists. To keep the hotel clean and running smoothly, the number of
receptionists should be at least a third of the number of cleaners. The hotel wants to keep the weekly wage bill below $30000. Formulate
an LP to minimize the wage bill.”
Example Response for the given example problem:
“Variables: cleaners, receptionists
Constraints:
(−1.0) ∗ cleaners + (−1.0) ∗ receptionists ≤ −100.0
(−0.0) ∗ cleaners + (−1.0) ∗ receptionists ≤ −20.0
(0.33) ∗ cleaners + (−1.0) ∗ receptionists ≤ −0.0
(500.0) ∗ cleaners + (350.0) ∗ receptionists ≤ 30000.0
Objective Function: minimize(500.0) ∗ cleaners + (350.0) ∗ receptionist”.
Now, below is the actual problem description that you have to solve. In your response, all the constraints must be in the less than or equal
to format. Your response must contain only these 3 parts: Variables, Constraints, and Objective Function. There must be no extra strings
before or after it. Problem description to solve:
The incorporation of methodologies such as progressive tional fine-tuning framework of Large Language Models
fine-tuning, LoRA, PEFT, and NEFTune into the conven- (LLMs) has notably augmented the inferential efficacy of
the Llama-2-7b model. This strategic enhancement is partic- Language Model k-Shot F1-score
ularly salient for a generative language model of this scale,
with a parameter count of only 7 billion, especially in intri- Baseline (Ramamonjison et al. 2023) - 0.610
cate tasks that challenge even more extensive models like Llama-2-7b 0 0.1259
GPT-3.5 and GPT-4 in their capacity to comprehend and Llama-2-7b 1 0.1022
maintain prolonged and complex contexts. GPT-3.5 0 0.4381
GPT-3.5 1 0.4928
Experimental Setup GPT-4 0 0.6072
The fine-tuning of the Llama-2-7b model was conducted on GPT-4 1 0.6330
an NVIDIA A40 GPU, equipped with 48 GB of VRAM,
over a span of 7 epochs. This process leveraged the dataset Table 2: Performance evaluation of LLMs for opti-
division suggested by the authors of NL4Opt (Ramamonji- mization problem formulation. The best performance
son et al. 2023), segregating it into training, validation, and in terms of F1-score is highlighted in bold. GPT-3.5
evaluation subsets. A batch size of 4 was employed, cou- (gpt-3.5-turbo-0613) and GPT-4 (gpt-4-0613)
pled with a gradient accumulation step of 1, and the AdamW models are accessed through OpenAI api1 on November 1,
(Loshchilov and Hutter 2017) optimizer was utilized. The 2023. Llama-2-7b model is fine-tuned using the proposed
initial learning rate was set at 3e − 4, with a weight de- LM4OPT framework.
cay factor of 0.001. A random noisy embedding strength
of 5 provided the most satisfactory results during the fine- Model k-Shot Fine-tune NEFTune F1-Score
tuning process. A maximum response sequence length of
200 was designated, under the premise that model outputs 0 × × 0.0036
would not exceed this threshold for this specific task. Fur- 0 N × 0.0617
thermore, the implementation of Gradient Checkpointing 1 N × 0.0581
(Chen et al. 2016) facilitated a more resource-efficient fine- Llama-2-7b 0 N X 0.0770
tuning framework. 1 N X 0.0693
An additional aspect of this research involved estimating 0 P X 0.1259
the carbon footprint associated with the fine-tuning phase, 1 P X 0.1022
guided by the methodology proposed by Lannelongue et
al. (Lannelongue, Grealey, and Inouye 2021). This analy- Table 3: Performance comparison of fine-tuned Llama-
sis revealed that each fine-tuning session of the Llama-2-7b 2-7b. ‘N’ in the ‘Fine-tune’ column represents non-
model produced approximately 23.52 grams of CO2 emis- progressive fine-tuning, whereas, ‘P’ refers to progressive
sions. Notably, this finding underscores the relatively mod- finetuning. The best performance is highlighted in bold.
est environmental impact of fine-tuning the model for spe-
cialized tasks.
0.6330. This superior performance can be attributed to GPT-
Result and Discussion 4’s advanced architecture and larger dataset training, as sug-
A comprehensive assessment of various LLMs was con- gested by recent studies emphasizing the enhanced contex-
ducted, focusing on their capability in formulating optimiza- tual understanding and response accuracy in more extensive
tion problems. This evaluation was based on prompt-based models (OpenAI 2023). Conversely, Llama-2-7b, despite be-
zero-shot and one-shot learning experiments. The perfor- ing a smaller model, shows notable performance improve-
mances of these LLMs were meticulously compared against ments in the zero-shot setting compared to one-shot, which
the established baseline provided by Ramamonjison et al. aligns with the findings that smaller models might struggle
(Ramamonjison et al. 2023), as detailed in Table 2. For a with longer context prompts.
consistent and objective assessment, the same scoring mech- Table 3 showcases the performance comparison of the
anism employed in the baseline evaluation by Ramamonji- Llama-2-7b model under various fine-tuning conditions. It
son et al. was adopted. This approach ensures a fair and di- assesses the F1-Score across different configurations, in-
rect comparison of the performance of LLMs relative to the cluding zero-shot and one-shot settings (k-Shot), with and
existing benchmark in this task. without fine-tuning, and the application of Noisy Embed-
The baseline performance in Table 2 is derived from a dings Fine-tuning (NEFTune). Notably, progressive fine-
fine-tuned BART (Lewis et al. 2019) model, which oper- tuning using the LM4OPT framework (P), especially in the
ates under different input conditions compared to the LLMs. zero-shot setting, significantly enhances the performance,
While LLMs like Llama-2 and GPT receive instruction achieving the highest F1-Score of 0.1259. This indicates the
prompts and problem descriptions in natural language, the efficacy of progressive fine-tuning combined with NEFTune
baseline BART model is also provided with named en- in improving the ability to understand and solve optimiza-
tity information extracted from the natural language prob- tion problems, as opposed to non-progressive fine-tuning
lem descriptions. This additional data potentially contributes (N) and the baseline without any fine-tuning.
to the baseline’s competitive F1-score of 0.61. The GPT-4 A notable observation from Table 3 is the superior out-
model, especially in the one-shot setting, outperforms oth-
1
ers, including the baseline and GPT-3.5, with an F1-score of https://platform.openai.com/docs/models
comes in zero-shot settings compared to their one-shot coun- Limitations
terparts across all configurations. This phenomenon could be In this study, certain limitations have been identified that
attributed to the hypothesis that a smaller model like Llama- bear on the research outcomes. A noticeable constraint
2-7b struggles with longer contexts. The data suggests that within the dataset utilized for this research is its composi-
in scenarios involving extended contexts, the model tends to tion of straightforward, formally structured samples replete
exhibit behavior indicative of hallucinations and produces with specific optimization domain terminologies like ‘for-
repetitive responses that lack coherence with the broader mulate an LP.’ This framework diverges from our overarch-
context. Such patterns reinforce the notion that smaller mod- ing aim to assess the efficacy of LLMs in interpreting and
els may face challenges in maintaining consistency and rel- formulating optimization problems as presented in natural
evance in responses as the prompt length increases, a critical language by individuals unversed in domain-specific jargon.
consideration in optimizing model performance for complex It is posited that this dataset limitation might yield a dis-
tasks. crepancy between the documented performance of LLMs
and their practical application by domain-agnostic users.
Effect of Progressive Fine-tuning Moreover, resource constraints impeded the exploration of
progressive fine-tuning effects on larger LLMs, such as
As shown in Table 3, fine-tuning specifically for instruc- Llama-2-70b and GPT-3.5, which may have offered addi-
tion processing significantly enhanced the performance of tional insights. Furthermore, the adoption of a rule-based ap-
the Llama-2-7b model. Initially, the pre-trained Llama-2- proach for converting intermediate representations to canon-
7b, in both zero-shot and one-shot settings, exhibited sub- ical forms has its drawbacks. Upon meticulous review, it
stantial hallucination. A notable example of this was the was observed that some LLM-generated intermediate repre-
model generating two distinct sets of variables within a sin- sentations were inaccurately formatted, leading to canonical
gle response, and its output format often did not align with forms that diverged from the ground truth. While these dis-
the given prompt instructions, as demonstrated in Figure 3. crepancies influenced the LLMs’ performance metrics, it is
However, the performance significantly improved after pro- conjectured that such nuances would be within human inter-
gressively fine-tuning the model. As it is evident from the pretive capabilities, suggesting that a collaborative human-
response samples, the performance of the fine-tuned Llama- model approach might counterbalance the observed perfor-
2-7b model significantly declined due to its inability to con- mance degradation linked to format conversions. The inter-
sistently maintain a specific response format. It is hypoth- action between what the model produces and how humans
esized that involving human evaluators or a human-in-the- understand it highlights an important area for future studies.
loop approach for minor modifications to the outputs could It emphasizes the need to harmonize machine precision with
significantly improve its efficiency. Such interventions could human judgment.
potentially bring the performance of a smaller model like
Llama-2-7b closer to that of some of the larger models. Conclusion
In this study, we undertook a comprehensive evaluation of
Does Increased Instruction Length Always LLMs such as GPT-3.5, GPT-4, and Llama-2-7b, focusing
on their ability to translate natural language descriptions into
Enhance Performance? mathematical formulation of optimization problems. The re-
Upon a thorough examination of the results and the out- search highlights that while GPT-4 exhibits superior perfor-
puts from both GPT and Llama models, it became evi- mance in both zero-shot and one-shot scenarios, there is a
dent that longer instructions do not universally enhance re- notable capability gap with smaller models like Llama-2-7b,
sponses across all models. The study noted that extended, particularly in handling complex contexts. Progressive fine-
detailed instructions were beneficial for larger models like tuning of Llama-2-7b, especially with noisy embeddings and
GPT-3.5 and GPT-4. Longer instructions aided GPT-3.5 specialized datasets using our proposed LM4OPT frame-
and GPT-4 in resolving parsing issues common in scenar- work, significantly enhances its performance. These findings
ios where multiple formulations are correct but are scored contribute to understanding the capabilities and limitations
differently due to manual parsing in the scoring mecha- of LLMs in a novel application area, paving the way for
nism. By specifying the solution format, these larger mod- future improvements in the field of optimization and OR.
els were guided to respond in a particular way. For in- Drawing upon the foundational research by Teshnizi et al.
stance, GPT-3.5 in a zero-shot setting produced the equa- (AhmadiTeshnizi, Gao, and Udell 2023), this study lays the
tion largeships ≤ smallships , whereas in a one-shot set- groundwork for future extensions, wherein the intermediate
ting, it generated largeships − smallships ≤ 0. The latter mathematical representations derived from natural language
formulation, after parsing, yielded a higher score. However, descriptions in this research could serve as precursors for
with Llama-2-7b, a smaller model, longer instructions led LLMs to generate ILP solver code in Python.
to issues such as repetition and hallucination, particularly
noticeable in one-shot settings. As illustrated in Figure 4, References
Llama-2-7b not only repeated parts of the instructions but AhmadiTeshnizi, A.; Gao, W.; and Udell, M. 2023. Opti-
also generated nonexistent variables such as x1, x2, x3, de- MUS: Optimization Modeling Using mip Solvers and large
viating from the original problem description. language models. ArXiv, abs/2310.06116.
Pre-trained
### Solution
“‘
Variables: thin, stubby
Constraints:
(50.0) ∗ thin + (90.0) ∗ baking <= 3000.0
(30.0) ∗ thin + (150.0) ∗ baking <= 4000.0
(5.0) ∗ thin + (9.0) ∗ baking <= 0.0
(5.0) ∗ thin + (9.0) ∗ baking <= 0.0
Objective Function:
maximize(5.0) ∗ thin + (9.0) ∗ baking
“‘ Fine-tuned
### Solution Example Response: “Variables: thinjar, stubbyjar
“‘ Constraints:
Variables: cleaners, receptionists (50.0) ∗ thinjar + (30.0) ∗ stubbyjar <= 3000.0
Constraints: (90.0) ∗ thinjar + (150.0) ∗ stubbyjar <= 4000.0
(−1.0) ∗ cleaners + (−1.0) ∗ receptionists <= −100.0 Objective Function:
(−0.0) ∗ cleaners + (−1.0 maximize(5.0) ∗ thinjar + (9.0) ∗ stubbyjar”
Figure 3: Sample response from pre-trained and fine-tuned Llama-2-7b. Hallucinating behavior is highlighted in red.
Looping
Response:“Variables: apple, pear
Constraints:
(−1.0) ∗ apple + (−0.0) ∗ pear ≤ −5.0
(−0.0) ∗ apple + (−1.0) ∗ pear ≤ −10.0
(1.0) ∗ apple + (−0.0) ∗ pear ≤ −0.0
(1.0) ∗ apple + (2.0) ∗ pear ≤ 0.0 Hallucination
Objective Function: maximize(2.0) ∗ apple + (4.0) ∗ pear”. Variables: x1, x2, x3
## Part 2: An easy LP In this part, you are given a problem description and you have to Constraints:
find the variables and constraints. Your response must be in the following format: (−1.0) ∗ x1 + (−0.0)∗
Figure 4: Looping Behavior and Hallucination for few-shot instructions in Llama-2-7b. The repetitive and hallucinating
behaviors are highlighted in red.
Brown, T. B.; Mann, B.; Ryder, N.; Subbiah, M.; Kaplan, J.; Language Understanding. In North American Chapter of the
Dhariwal, P.; Neelakantan, A.; Shyam, P.; Sastry, G.; Askell, Association for Computational Linguistics.
A.; Agarwal, S.; Herbert-Voss, A.; Krueger, G.; Henighan, Hu, J. E.; Shen, Y.; Wallis, P.; Allen-Zhu, Z.; Li, Y.; Wang,
T. J.; Child, R.; Ramesh, A.; Ziegler, D. M.; Wu, J.; Winter, S.; and Chen, W. 2021. LoRA: Low-Rank Adaptation of
C.; Hesse, C.; Chen, M.; Sigler, E.; Litwin, M.; Gray, S.; Large Language Models. ArXiv, abs/2106.09685.
Chess, B.; Clark, J.; Berner, C.; McCandlish, S.; Radford,
A.; Sutskever, I.; and Amodei, D. 2020. Language Models Jain, N.; yeh Chiang, P.; Wen, Y.; Kirchenbauer, J.;
are Few-Shot Learners. ArXiv, abs/2005.14165. Chu, H.-M.; Somepalli, G.; Bartoldson, B.; Kailkhura, B.;
Schwarzschild, A.; Saha, A.; Goldblum, M.; Geiping, J.; and
Chen, T.; Xu, B.; Zhang, C.; and Guestrin, C. 2016. Train- Goldstein, T. 2023. NEFTune: Noisy Embeddings Improve
ing Deep Nets with Sublinear Memory Cost. ArXiv, Instruction Finetuning. ArXiv, abs/2310.05914.
abs/1604.06174.
Karmarkar, N. 1984. A new polynomial-time algorithm for
Cobbe, K.; Kosaraju, V.; Bavarian, M.; Chen, M.; Jun, H.; linear programming. Combinatorica, 4: 373–395.
Kaiser, L.; Plappert, M.; Tworek, J.; Hilton, J.; Nakano, R.; Lannelongue, L.; Grealey, J.; and Inouye, M. 2021. Green
Hesse, C.; and Schulman, J. 2021. Training Verifiers to algorithms: quantifying the carbon footprint of computation.
Solve Math Word Problems. ArXiv, abs/2110.14168. Advanced science, 8(12): 2100707.
Dakle, P.; Kadioğlu, S.; Uppuluri, K.; Politi, R.; Ragha- Laskar, M. T. R.; Hoque, E.; and Huang, J. 2021. Do-
van, P.; Rallabandi, S. K.; and Srinivasamurthy, R. S. 2023. main Adaptation with Pre-trained Transformers for Query-
Ner4Opt: Named Entity Recognition for Optimization Mod- Focused Abstractive Text Summarization. Computational
elling from Natural Language. In Integration of AI and OR Linguistics, 48: 279–320.
Techniques in Constraint Programming.
Lewis, M.; Liu, Y.; Goyal, N.; Ghazvininejad, M.; rahman
Devlin, J.; Chang, M.-W.; Lee, K.; and Toutanova, K. 2019. Mohamed, A.; Levy, O.; Stoyanov, V.; and Zettlemoyer, L.
BERT: Pre-training of Deep Bidirectional Transformers for 2019. BART: Denoising Sequence-to-Sequence Pre-training
for Natural Language Generation, Translation, and Compre-
hension. In Annual Meeting of the Association for Compu-
tational Linguistics.
Li, B.; Mellou, K.; qing Zhang, B.; Pathuri, J.; and Men-
ache, I. 2023. Large Language Models for Supply Chain
Optimization. ArXiv, abs/2307.03875.
Liu, H.; Tam, D.; Muqeeth, M.; Mohta, J.; Huang, T.;
Bansal, M.; and Raffel, C. 2022. Few-Shot Parameter-
Efficient Fine-Tuning is Better and Cheaper than In-Context
Learning. ArXiv, abs/2205.05638.
Loshchilov, I.; and Hutter, F. 2017. Decoupled Weight De-
cay Regularization. In International Conference on Learn-
ing Representations.
Nash, J. C. 2000. The (Dantzig) simplex method for linear
programming. Comput. Sci. Eng., 2: 29–31.
OpenAI. 2023. GPT-4 Technical Report. ArXiv,
abs/2303.08774.
Ramamonjison, R.; Yu, T. T.; Li, R.; Li, H.; Carenini,
G.; Ghaddar, B.; He, S.; Mostajabdaveh, M.; Banitalebi-
Dehkordi, A.; Zhou, Z.; and Zhang, Y. 2023. NL4Opt
Competition: Formulating Optimization Problems Based
on Their Natural Language Descriptions. ArXiv,
abs/2303.08233.
Suzgun, M.; Scales, N.; Scharli, N.; Gehrmann, S.; Tay, Y.;
Chung, H. W.; Chowdhery, A.; Le, Q. V.; hsin Chi, E. H.;
Zhou, D.; and Wei, J. 2022. Challenging BIG-Bench Tasks
and Whether Chain-of-Thought Can Solve Them. In Annual
Meeting of the Association for Computational Linguistics.
Touvron, H.; Martin, L.; Stone, K. R.; Albert, P.; Alma-
hairi, A.; Babaei, Y.; Bashlykov, N.; Batra, S.; Bhargava,
P.; Bhosale, S.; Bikel, D. M.; Blecher, L.; Ferrer, C. C.;
Chen, M.; Cucurull, G.; Esiobu, D.; Fernandes, J.; Fu, J.; Fu,
W.; Fuller, B.; Gao, C.; Goswami, V.; Goyal, N.; Hartshorn,
A. S.; Hosseini, S.; Hou, R.; Inan, H.; Kardas, M.; Kerkez,
V.; Khabsa, M.; Kloumann, I. M.; Korenev, A. V.; Koura,
P. S.; Lachaux, M.-A.; Lavril, T.; Lee, J.; Liskovich, D.; Lu,
Y.; Mao, Y.; Martinet, X.; Mihaylov, T.; Mishra, P.; Moly-
bog, I.; Nie, Y.; Poulton, A.; Reizenstein, J.; Rungta, R.;
Saladi, K.; Schelten, A.; Silva, R.; Smith, E. M.; Subrama-
nian, R.; Tan, X.; Tang, B.; Taylor, R.; Williams, A.; Kuan,
J. X.; Xu, P.; Yan, Z.; Zarov, I.; Zhang, Y.; Fan, A.; Kam-
badur, M.; Narang, S.; Rodriguez, A.; Stojnic, R.; Edunov,
S.; and Scialom, T. 2023. Llama 2: Open Foundation and
Fine-Tuned Chat Models. ArXiv, abs/2307.09288.
Tsouros, D. C.; Verhaeghe, H.; Kadiouglu, S.; and Guns, T.
2023. Holy Grail 2.0: From Natural Language to Constraint
Models. ArXiv, abs/2308.01589.
Vaswani, A.; Shazeer, N. M.; Parmar, N.; Uszkoreit, J.;
Jones, L.; Gomez, A. N.; Kaiser, L.; and Polosukhin, I. 2017.
Attention is All you Need. In Neural Information Process-
ing Systems.
Yang, C.; Wang, X.; Lu, Y.; Liu, H.; Le, Q. V.; Zhou, D.;
and Chen, X. 2023. Large Language Models as Optimizers.
ArXiv, abs/2309.03409.