0% found this document useful (0 votes)
41 views

2503.02650v1

This study evaluates the effectiveness of Large Language Models (LLMs) in converting unstructured recipe text into the structured Cooklang format, addressing the challenges posed by the exponential growth of unstructured text data. The research demonstrates that GPT-4o, with few-shot prompting, achieves significant performance in this task, while also revealing potential in smaller models like Llama3.1:8b for optimization through fine-tuning. The findings suggest that LLMs can revolutionize data processing workflows across various domains by automating structured data generation.

Uploaded by

Ved Rewatkar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
41 views

2503.02650v1

This study evaluates the effectiveness of Large Language Models (LLMs) in converting unstructured recipe text into the structured Cooklang format, addressing the challenges posed by the exponential growth of unstructured text data. The research demonstrates that GPT-4o, with few-shot prompting, achieves significant performance in this task, while also revealing potential in smaller models like Llama3.1:8b for optimization through fine-tuning. The findings suggest that LLMs can revolutionize data processing workflows across various domains by automating structured data generation.

Uploaded by

Ved Rewatkar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 19

T HE E FFECTIVENESS OF L ARGE L ANGUAGE M ODELS IN

T RANSFORMING U NSTRUCTURED T EXT TO S TANDARDIZED


F ORMATS

A P REPRINT

William Brach Kristián Košt’ál Michal Ries


arXiv:2503.02650v1 [cs.AI] 4 Mar 2025

Slovak Technical University Slovak Technical University Slovak Technical University


[email protected] [email protected] [email protected]

A BSTRACT
The exponential growth of unstructured text data presents a fundamental challenge in modern data
management and information retrieval. While Large Language Models (LLMs) have shown remark-
able capabilities in natural language processing, their potential to transform unstructured text into
standardized, structured formats remains largely unexplored - a capability that could revolutionize
data processing workflows across industries. This study breaks new ground by systematically evaluat-
ing LLMs’ ability to convert unstructured recipe text into the structured Cooklang format. Through
comprehensive testing of four models (GPT-4o, GPT-4o-mini, Llama3.1:70b, and Llama3.1:8b), an
innovative evaluation approach is introduced that combines traditional metrics (WER, ROUGE-L,
TER) with specialized metrics for semantic element identification. Our experiments reveal that GPT-
4o with few-shot prompting achieves breakthrough performance (ROUGE-L: 0.9722, WER: 0.0730),
demonstrating for the first time that LLMs can reliably transform domain-specific unstructured text
into structured formats without extensive training. Although model performance generally scales with
size, we uncover surprising potential in smaller models like Llama3.1:8b for optimization through
targeted fine-tuning. These findings open new possibilities for automated structured data generation
across various domains, from medical records to technical documentation, potentially transforming
the way organizations process and utilize unstructured information.
For further information, source code, and associated resources, please refer to the source code
repository1 .

1 Introduction
The exponential growth of unstructured text data presents a challenge in data management, analysis, and information
retrieval. While Large Language Models (LLMs)Brown et al. [2020] have demonstrated remarkable capabilities in
natural language processing, their potential to transform unstructured text into standardized, structured domain-specific
formats remains unexplored. This capability could revolutionize data processing workflows across various domains,
enabling more efficient information retrieval, automated analysis, and data management.
This study addresses these gaps by examining the potential of LLMs in structured text generation Raffel et al. [2020],
specifically focusing on converting plain text recipes into Cooklang specifications Cooklang [2024a]. In this study,
we aim to address two research questions: How do different LLM architectures affect the quality and accuracy of
structured recipe generation? What impact do various prompting techniques Liu et al. [2023] have on the generation
of well-formed Cooklang specifications? The applications of this research extend beyond the culinary domain, with
potential implications for healthcare documentation using HL7 format Dolin et al. [2001], music notation in MusicXML
Good [2001], and technical documentation organization.
In order to evaluate the performance of LLM systems in the generation of structured recipes, a variety of methods are
employed for both qualitative and quantitative assessment. Our methodology included traditional natural language
1
https://github.com/williambrach/llm-text2cooklang
arXiv Template A P REPRINT

processing metrics such as Word Error Rate (WER) Ali and Renals [2018], Klakow and Peters [2002], and ROUGE-L
Research [2019], See et al. [2017] which measure the accuracy and fluency of the generated text. Additionally,
we introduce domain-specific metrics to evaluate the accuracy of a generated recipe. These include an Ingredient
Identification Score and Unit and Amount Identification Scores. The evaluation of these metrics allows for the assessment
of not only the linguistic quality of the generated recipes but also their adherence to the specific structural requirements of
the Cooklang format and their culinary accuracy. Furthermore, we analyze the LLMs performance across different input
configurations and prompting techniques (zero-shot PromptingGuide [2023a], few-shot PromptingGuide [2023b], and
MIPROv2 Opsahl-Ong et al. [2024]). This evaluation approach enables us to understand the strengths and limitations
of various LLMs in the context of structured domain-specific format generation.
The remainder of this paper is organized as follows: Section 2 reviews related work in structured text generation
and prompting techniques. Section 3 describes our methodology and evaluation framework. Section 4 presents our
experimental results and examination. Section 5 discusses the implications and limitations of our findings, and Section
6 concludes with future research directions.

2 Background

The standardization of recipe processing and cooking knowledge has evolved considerably since Mundie’s pioneering
work Mundie [1985] in 1985, which introduced RxOL (Recipe Operation Language) as one of the earliest structured
approaches to computational recipe representation. In web search, structured data specification Google Developers
[2024] for recipes represents a significant modern standardization effort, providing a systematic way to encode recipe
information for search engines and enabling enhanced discovery and presentation of recipe content across the web. This
work laid the foundation Dubovskoy [2015] for subsequent advancements in the field, as evidenced by research that
drew innovative parallels between software design patterns and cooking procedures, suggesting that cooking patterns
could serve as an abstraction layer for standardizing recipe knowledge. The development of domain-specific markup
languages (e.g., Cooklang for recipes Cooklang [2024a], MusicXML Good [2001]for digital sheet music, and HL7
Dolin et al. [2001] for healthcare data) has emerged to standardize information representation in specific domains.
These markdown languages offer a structured approach to encoding domain knowledge, enhancing readability and
processing efficiency. There has been an increase in the number of initiatives Bommasani and et al. [2021], Pitsilou
et al. [2024], Rostami [2024] focused on leveraging LLMs in the culinary domain or other domains that could leverage
this structured approach in free texts.
One of the use cases for LLMs is the extraction of food entities from cooking recipes, as demonstrated in the study
by Pitsilou et al. Pitsilou et al. [2024]. The results of this study demonstrate that, even in the absence of labeled
training data, LLMs can achieve promising results in domain-specific tasks such as named entity recognition (NER) for
food items. Another illustration of the application of LLMs for structured text parsing outside the culinary domain
is StructGPT, a general framework to enhance LLMs zero-shot reasoning capabilities over structured data such as
knowledge graphs, tables, and databases Jiang et al. [2023]. An additional example of the use of LLMs in a culinary
context is LLaVA-Chef, a multi-modal generative model specifically designed for food recipes Mohbat and Zaki [2024].
By adapting the LLaVA (Large Language and Vision Assistant) Liu et al. [2024] model to the food domain through a
multi-stage fine-tuning approach, the authors achieved state-of-the-art performance in recipe generation tasks. Notably,
these results demonstrate competitive performance in comparison to the work of Patil et al. Pansare et al. [2024], who
employed convolutional neural networks (CNNs) for image recognition in conjunction with deep learning models for
natural language processing. Another illustration of the application of LLMs to culinary tasks is the development of
RecipeNLG Bień et al. [2020], which employed language models to enhance semi-structured recipe text generation
and generate a dataset for subsequent LLM training and refinement. As this domain progresses, it demonstrates the
potential for integrating general-purpose language models with specialized domain expertise. Building on this work,
Zhou et al.’s research on FoodSky Zhou et al. [2024] demonstrated that LLMs can achieve better performance in
domain-specific tasks by using specialized mechanisms like the Topic-based Selective State Space Model (TS3M) and
Hierarchical Topic Retrieval Augmented Generation (HTRAG). The study demonstrated the effectiveness of FoodSky
by achieving accuracy rates of 67.2% and 66.4% in chef and dietetic examinations, respectively. Authors suggest
that LLMs, when adequately trained and enhanced with domain-specific knowledge, can efficiently handle structured
text conversion tasks. This lays a solid foundation for applications in recipe processing and other specialized fields.
The research paper Cook2LTL Mavrogiannis et al. [2024] demonstrates how to utilize a language model to translate
recipes into linear temporal logic (LTL). Cook2LTL shows how structured representations can bridge the gap between
human-readable content and machine-executable instructions. Another application of LLM Dubovskoy [2024] employs
graph-based representations in order to track ingredient transformations and state changes throughout the cooking
process. This evolution, from basic formalization to pattern-based approaches and finally to LLM-powered graph

2
arXiv Template A P REPRINT

Figure 1: Proposed methodology for evaluating the ability of Large Language Models in converting a recipe to Cooklang

representations, illustrates the ongoing efforts to bridge the gap between natural language cooking instructions and
structured, computationally accessible formats.
As LLMs have gained popularity, prompt engineering Brown et al. [2020] has emerged as a compelling alternative to
fine-tuning approaches. Research of language models has focused on developing and evaluating prompt techniques
across a diverse range of tasks. The research space has explored multiple prompting approaches: zero-shot learning
PromptingGuide [2023a], where models perform tasks without prior examples; few-shot learning PromptingGuide
[2023b], which utilizes a small number of examples to guide performance, or more complex prompt techniques like
MIPRO Opsahl-Ong et al. [2024] which optimizes both instructions and few-shot demonstrations for multi-stage
language model programs. The evolution of prompting techniques has demonstrated key advantages over traditional
fine-tuning methods.
Prompt engineering requires less computational resources and training data compared to full model fine-tuning. The
well-built prompts can generate strong performance Li et al. [2023] across a range of tasks while maintaining model
flexibility. These prompts have shown remarkable effectiveness across applications: in sentiment analysis, prompt-based
fine-tuning achieved 92.7% accuracy compared to traditional fine-tuning’s 81.4% while requiring only 32 examples
on the SST-2 dataset Gao et al. [2021]. Similarly, in classification tasks Wei et al. [2024] on datasets like SNLI,
prompt-based methods with demonstrations achieved 79.7% accuracy compared to standard fine-tuning’s 48.4%,
demonstrating the approach’s effectiveness across different linguistic tasks. Furthermore, prompting techniques can
enhance the reasoning capabilities of language models. A notable example is the self-consistency approach introduced
by Wang et al. Wang et al. [2022], which builds upon chain-of-thought prompting to improve complex reasoning
tasks. This method leverages the intuition that multiple reasoning paths can lead to a correct answer, similar to human
problem-solving processes. By sampling diverse reasoning paths instead of relying on a single greedy decoding path,
their approach achieved striking improvements across reasoning benchmarks: GSM8K, SVAMP, and AQuA.
These developments in prompting techniques, from basic few-shot demonstrations to more sophisticated approaches
like self-consistency, highlight the potential of prompt engineering as a practical and effective method for enhancing
language model performance. However, as these techniques become more sophisticated, the challenge of proper
evaluation becomes increasingly important. The evaluation of LLMs in these contexts presents unique challenges,
necessitating specialized metrics that assess both linguistic quality and compliance with domain-specific formatting
requirements Chang et al. [2024]. This has led to significant efforts in developing standardized evaluation methodologies,
particularly for zero-shot text classification tasks. Notable contributions include Ribeiro et al.’s Ribeiro et al. [2020]
comprehensive framework for behavioral testing of natural language processing models.
The convergence of structured data approaches and LLM capabilities creates new opportunities for transforming
unstructured text into standardized formats. While existing research demonstrates the potential of LLMs in domain-
specific tasks and prompt engineering shows promise for enhancing model performance, there remains a critical need
to systematically evaluate different LLM architectures and prompting techniques in the context of structured format
generation. The following methodology addresses these requirements through a comprehensive evaluation framework
that combines standard metrics like WER and ROUGE-L with specialized scoring mechanisms for assessing structured
recipe generation to Cooklang format Cooklang [2024a].

3 Methodology

This section contains the methodology for evaluating the effectiveness of Large Language Models (LLMs) in generating
structured recipe specifications in Cooklang format Cooklang [2024a]. Our experimental design encompasses three

3
arXiv Template A P REPRINT

key dimensions: input text variations, prompting strategies, and model architectures, allowing us to systematically
assess how each factor influences the quality and accuracy of structured recipe generation. As illustrated in Figure
1, our approach enables controlled comparison across these dimensions. We evaluated four state-of-the-art LLMs:
Llama 3.1:8b, Llama 3.1:70b Dubey et al. [2024], Meta AI [2024a], Meta AI [2024b], GPT-4o OpenAI [2024a], and
GPT-4o-mini OpenAI [2024b], each representing different points in the spectrum of model scale. These models were
selected to provide an evaluation across both closed and open-source models, allowing us to compare performance
between different model scales (from 8b to 70b parameters) and between proprietary and openly available solutions.
These LLMs represent the most popular models from both closed and open source, allowing for an assessment
of performance across a range of capabilities. Through the testing of combinations of input formats, prompting
strategies, and model architectures, the objective was to identify the most effective approach for converting recipe texts
into the structured domain-specific recipe format. The recipe format contains two distinct input fields, as shown in
dspy.Signatures (Listings 1, 2, and 3). The first field contains ingredients for the recipe, presented as free text where
each ingredient is comma-separated. The second field consists of the general recipe text, primarily this field comprises
cooking instructions. This field is a converted Cooklang format via LLM. The objective of this research is to examine
the limitations of LLM in structured domain-specific text generation. The research explores how different models and
prompt engineering influence semantic preservation and structural consistency in the conversion of unstructured recipes
to formal specifications. Through an evaluation of model size and prompt engineering approaches, the research aims
to advance our understanding of LLMs capabilities in maintaining semantic fidelity while adhering to strict syntactic
constraints. This fundamental challenge in natural language processing extends far beyond recipe conversion to broader
applications in structured knowledge representation.

Listing 1: DSPy Program Class recipe text to Cooklang


class Co okLangSignature ( dspy . Signature ) :
"""
Convert plain recipe text with provided ingredients into Cooklang text format .
Cooklang Recipe Specification :
1. Ingredients
- Use ‘@ ‘ to define ingredients
- For multi - word ingredients , end with ‘{} ‘
- Specify quantity in ‘{} ‘ after the name
- Use ‘% ‘ to separate quantity and unit
‘‘‘
@salt
@ground black pepper {}
@potato {2}
@bacon strips {1% kg }
@syrup {1/2% tbsp }
‘‘‘
2. Comments
- Single - line : Use ‘--‘ at the end of a line
- Multi - line : Enclose in ‘[ - -] ‘
‘‘‘
-- Don ’t burn the roux !
Mash @potato {2% kg } until smooth -- alternatively , boil ’ em first , then
mash ’em , then stick ’ em in a stew .
‘‘‘
3. Cookware
- Define with ‘# ‘
- Use ‘{} ‘ for multi - word items
‘‘‘
# pot
# potato masher {}
‘‘‘
4. Timers
- Define with ‘~ ‘
- Specify duration in ‘{} ‘
- Can include a name before the duration
‘‘‘
~{25% minutes }
~ eggs {3% minutes }
‘‘‘

4
arXiv Template A P REPRINT

Return only Cooklang formatted recipe , dont return any other information .
Return whole recipe in Cooklang format ! Dont stop till you reach the end of
the recipe .
"""

ingredients = dspy . InputField ( desc = " Ingredients for the recipe . Comma
separated list of ingredients . " )
recipe_text = dspy . InputField ( desc = " Recipe text to convert to Cooklang format .
")

cooklang = dspy . OutputField (


desc = " Cooklang formatted recipe . " ,
)

class Co okLangFormatter ( dspy . Module ) :


def __init__ ( self ) :
super () . __init__ ()

self . prog = dspy . ChainOfThought ( CookLangSignature )

def forward ( self , recipe_text : str , ingredients : str ) -> CookLangSignature :


prediction = self . prog ( recipe_text = recipe_text , ingredients = ingredients )
return prediction

Listing 2: DSPy Program Class recipe text to Cooklang without Cooklang Specification
class C o o k L an g S ig n a tu r e No S t e ps ( dspy . Signature ) :
"""
Convert plain recipe text with provided ingredients into Cooklang text format .
Return only Cooklang formatted recipe , dont return any other information .
Return whole recipe in Cooklang format ! Dont stop till you reach the end of
the recipe .
"""

ingredients = dspy . InputField ( desc = " Ingredients for the recipe . Comma
separated list of ingredients . " )
recipe_text = dspy . InputField ( desc = " Recipe text to convert to Cooklang format .
")

cooklang = dspy . OutputField (


desc = " Cooklang formatted recipe . " ,
)

class C o o k L an g F or m a tt e r No S t e ps ( dspy . Module ) :


def __init__ ( self ) :
super () . __init__ ()

self . prog = dspy . ChainOfThought ( C o ok L a ng S i gn a t ur e N oS t e ps )

def forward ( self , recipe_text : str , ingredients : str ) ->


C o o k L a ng S i gn a t ur e N oS t e ps :
prediction = self . prog ( recipe_text = recipe_text , ingredients = ingredients )
return prediction

Listing 3: DSPy Program Class recipe text to Cooklang without Cooklang Specification and without recipe ingredients
class C o o k L a n g S i g n a t u r e N o S t e p s N o I n g r e d i e n t s ( dspy . Signature ) :
"""
Convert plain recipe text with provided ingredients into Cooklang text format .
Return only Cooklang formatted recipe , dont return any other information .
Return whole recipe in Cooklang format ! Dont stop till you reach the end of
the recipe .
"""

5
arXiv Template A P REPRINT

recipe_text = dspy . InputField ( desc = " Recipe text to convert to Cooklang format .
")

cooklang = dspy . OutputField (


desc = " Cooklang formatted recipe . " ,
)

class C o o k L a n g F o r m a t t e r N o S t e p s N o I n g r e d i e n t s ( dspy . Module ) :


def __init__ ( self ) :
super () . __init__ ()

self . prog = dspy . ChainOfThought ( C o o k L a n g S i g n a t u r e N o S t e p s N o I n g r e d i e n t s )

def forward ( self , recipe_text : str ) -> C o o k L a n g S i g n a t u r e N o S t e p s N o I n g r e d i e n t s :


prediction = self . prog ( recipe_text = recipe_text )
return prediction

The study implements three prompting strategies. The first strategy - Zero-Shot PromptingGuide [2023a] - established a
baseline by presenting the task to the model without examples. The second strategy, Few-Shot Reynolds and McDonell
[2021], PromptingGuide [2023b], involved the presentation of a limited number of examples to guide the model in its
task. Final prompting strategy, implementation of MIPROv2 Opsahl-Ong et al. [2024]. The study tested three input
configurations and ran every combination of input type and prompt technique across all four large language models.
Following is an overview of the implementation of prompting strategies (method) and different variables at the input :
1. Method: The recipe method in its original format without additional processing.
2. Method + Ingredients: The recipe method is accompanied by a list of recipe ingredients.
3. Method + Ingredients + Cooklang schema: The recipe method and ingredients list, along with the Cooklang
syntax specification.
All configurations aimed to produce a string-modified version of recipe text, which we saved as a recipe.cook file
and loaded it through the Cooklang parser. The parser loaded the file correctly, confirming that we had successfully
converted the input text to Cooklang format. To fully evaluate our approach, we tested all possible combinations of
input types, prompt techniques, and LLMs.

3.1 Prompt engineering

We based our approach on transforming recipes into Cooklang format using an LLM on the DSPy framework Khattab
et al. [2024]. We developed distinct DSPy programs for each model input type to ensure optimal performance across all
scenarios. We implemented few-shot prompting using the bootstrap-few-shot-with-random-search method from the
DSPy library, keeping the default hyperparameters. Similarly, we implemented MIPROv2 Opsahl-Ong et al. [2024]
using the existing DSPy implementation with its default hyperparameters. The Listings 1, 2 show the DSPy programs
we used in our implementation. These prompts form the core elements of our recipe transformation process.

3.2 Dataset

We assembled the dataset for this research from sources that provide recipes in both standard and Cooklang formats.
The primary sources include the Cooklang documentation and associated GitHub repositories Cooklang [2024b,c],
Cooklang Community [2024]. We merged the recipes from these sources to create a comprehensive dataset, which we
now make publicly available in our GitHub repository. GitHub repository2 . The dataset contains 32 distinct recipe
samples, representing a diverse range of recipe categories, including baking, breakfast, dinners, lunches, and soups.
This diversity ensures a broad representation of culinary styles and techniques, enabling a more robust evaluation of
the models. We enhanced the dataset by extracting ingredients from the Cooklang Specification using the Cooklang
CLI CookLang [2024]. We added these extracted ingredients as an additional feature to the dataset, enriching the
information available for analysis and prompt optimization.

3.3 Evaluation Process

We evaluated the models across several metrics, including overall performance, the impact of different prompt techniques,
the effect of using Cooklang and ingredient exclusion, and domain-specific tasks related to recipe understanding. We
2
https://github.com/williambrach/llm-text2cooklang/tree/main/data

6
arXiv Template A P REPRINT

tested the LLMs ability to accurately identify ingredients, units, and amounts. We used several metrics to assess the
LLMs performance in detail.

1. Word Error Rate (WER) Ali and Renals [2018], Klakow and Peters [2002]: The metric measures the edit
distance between generated and reference text quantitatively, normalizing values by the reference text’s word
length. A lower WER value indicates more accurate performance that comes closer to the desired outcome.
2. ROUGE-L Research [2019], See et al. [2017]: The metric to evaluate the quality of generated summaries or
translations. ROUGE-L calculates F-scores based on the length of the longest common subsequence between
the candidate and reference texts. This method allows for capturing sentence-level structure similarity and
identifying the longest co-occurring in-sequence n-grams automatically. Higher ROUGE-L scores indicate
better performance, suggesting that the generated text shares more and longer sequential word matches with
the reference text.
3. Token Error Rate (TER): This metric measures how closely generated Cooklang texts match their references
by calculating a normalized edit distance. The system first tokenizes texts by breaking them down into
Cooklang-specific elements (ingredients, cookware, and cooking steps). It then computes TER for each
element by counting the minimum number of edits needed to transform the generated text into the reference -
including insertions, deletions, substitutions, and shifts of Cooklang tokens - and divides this count by the total
tokens in the reference. Lower TER scores show better performance.
4. Ingredient Identification Score: Metric assesses the LLM capacity to accurately identify all ingredients in a
given recipe. A score of 1 indicates that the LLM successfully identified all ingredients present in the reference
recipe, whereas a score of 0 indicates that the LLM failed to identify all ingredients correctly.
5. Unit and Amount Identification Scores: These two scores are metrics that assess the LLMs accuracy in
identifying units of measurement and ingredient amounts in a recipe. For each metric, the score is 1 if all units
or amounts are correctly identified for every ingredient in the recipe and 0 if one unit or amount is incorrectly
identified or missing.

For each configuration, we evaluated all metrics across the test samples to obtain individual scores. The mean value for
each metric M under configuration c was calculated as:

N
1 X
M̄c = Mc,i
N i=1

where Mc,i represents the metric score for configuration c and test sample i, and N denotes the total number of test
samples. This approach provides a comprehensive view of LLM performance across all aspects of recipe understanding
and generation tasks. Analyzing these aggregated results helps us identify strengths and weaknesses in different model-
prompt configurations and determine the most effective approaches for recipe-related natural language processing.

3.4 Hardware and Deployment Setup

We deployed all open-source models (Llama3.1:8b and LLama3.1:70b) locally using ollama Ollama [2024], a framework
that runs and manages local deployment of LLMs. Our local setup included two NVIDIA GeForce RTX 4090 graphics
cards, which provided the necessary computational power for efficient model inference. We accessed OpenAI GPT
models through the OpenAI API.

4 Results
This section analyzes the performance of four Large Language Models (LLMs): GPT-4o, GPT-4o-mini, Llama3.1:70b,
and Llama3.1:8b. The analysis evaluates each model using three traditional metrics: Word Error Rate (WER), ROUGE-
L score, and Translation Edit Rate (TER), grouping all results by their mean values. Table 1 shows each model’s
performance metrics:
As shown in Figure 2, the GPT-4o model displays superior performance across all metrics in converting inputs to
the Cooklang format. The model achieved the highest ROUGE-L score (0.9280), indicating an overlap between the
generated Cooklang output and the reference text at the n-gram level. Furthermore, the model had the lowest Token
Error Rate (TER) of 0.8619, indicating minimal hallucination or generation of content not grounded in the original
text. Additionally, it showed the lowest Word Error Rate (WER) of 0.1996, implying high accuracy in maintaining the
correct word sequence with minimal insertions, substitutions, or deletions when converting to Cooklang format. These

7
arXiv Template A P REPRINT

Table 1: Mean WER, ROUGE-L, and TER Scores Across Language Models
Model WER ↓ ROUGE-L ↑ TER ↓
GPT-4o 0.1996 0.9280 0.8619
GPT-4o-mini 0.2907 0.8689 1.4088
Llama3.1:70b 0.9803 0.5453 4.7859
Llama3.1:8b 1.3513 0.4526 6.4910

Note: ↓ indicates lower values are better, ↑ indicates higher values are better.

Figure 2: Comparison of Language Model Performance Across WER, ROUGE-L, and TER Metrics

results suggest that GPT-4o could be particularly well-suited for applications requiring accurate transformation between
different text formats and structures, as its strong performance in maintaining semantic content while adapting syntactic
structure indicates robust capabilities in text transformation tasks. GPT-4o-mini ranks second in converting plain text
to Cooklang, demonstrating consistent performance, but falls behind GPT-4o. Its Word Error Rate (WER) of 0.2907
shows more errors in word sequence maintenance. The lower ROUGE-L score of 0.8689 reveals reduced overlap
between the generated Cooklang and reference text. A Token Error Rate (TER) of 1.4088 indicates that GPT-4o-mini
generates more content without grounding in the original text. Results for the family of Llama3.1 demonstrate models
exhibit considerably reduced performance in Cooklang conversion relative to the GPT models. The 70b version exhibits
superior performance compared to the 8b version across all metrics, suggesting that an increased model size contributes
to enhanced performance. However, both models demonstrate a notable deficit in comparison to the GPT models. The
WER values for the Llama models exceed 0.98, indicating a significant prevalence of errors in word sequence. The
ROUGE-L scores are below 0.55, suggesting a notable discrepancy between the generated text and the reference text.
Additionally, the TER values exceed 4.78, indicating a considerable tendency for hallucination or the generation of
content that is not directly relevant to the source text.

4.1 Impact of Prompt Techniques on Model Performance

Analysis of the layer of prompt techniques – MIPROv2, Few-Shot, Zero-Shot – reveals differences in their effectiveness
across models. The results, presented in Table 2 and Figure 3, demonstrate that the impact of prompt techniques
varies considerably depending on the model used. The Few-Shot prompt technique consistently demonstrates superior
performance compared to other techniques across all models, achieving the lowest word error rate (WER), the highest
ROUGE-L score, and the lowest term error rate (TER).
The efficacy of each prompt technique varies considerably across different models. Larger, more advanced models
(GPT-4o and GPT-4o-mini) demonstrate superior overall performance and less variability between prompt techniques
in comparison to smaller models (Llama3.1:70b and Llama3.1:8b). It is noteworthy that while Few-Shot consistently
ranks first, the relative performance of MIPROv2 and Zero-Shot varies depending on the model and metric. Larger
models demonstrate the most optimal overall performance across all techniques, with Few-Shot attaining a WER as low
as 0.073 for GPT-4o. In contrast, Llama3 3.1 models exhibit elevated error rates and diminished ROUGE-L scores,
with Zero-Shot outperforming MIPROv2, particularly for the 8b variant.

8
arXiv Template A P REPRINT

Table 2: Model Performance Metrics Across Prompt Techniques


Model WER ↓ ROUGE-L ↑ TER ↓
GPT-4o (MIPROv2) 0.2221 0.9384 0.9950
GPT-4o (Few-Shot) 0.0730 0.9722 0.2096
GPT-4o (Zero-Shot) 0.3038 0.8733 1.3811
GPT-4o-mini (MIPROv2) 0.1962 0.8948 1.1047
GPT-4o-mini (Few-Shot) 0.0865 0.9646 0.2509
GPT-4o-mini (Zero-Shot) 0.6208 0.7387 2.9723
Llama3.1:70b (MIPROv2) 1.2060 0.4100 5.9615
Llama3.1:70b (Few-Shot) 0.8692 0.6170 4.4031
Llama3.1:70b (Zero-Shot) 0.8656 0.6091 3.9931
Llama3.1:8b (MIPROv2) 1.9504 0.2806 9.2570
Llama3.1:8b (Few-Shot) 1.1835 0.4898 5.8003
Llama3.1:8b (Zero-Shot) 0.9200 0.5874 4.4156

Note: ↓ indicates lower values are better, ↑ indicates higher values are better.

Figure 3: Prompting Technique Performance Across WER, ROUGE-L, and TER Metrics

9
arXiv Template A P REPRINT

4.2 Effect of Cooking Language Specification and Ingredient Exclusion

We also evaluated if inserting Cooklang language specifications into prompts or if presenting a list of ingredients
could yield a significant difference. Our findings indicate considerable discrepancies in model performance contingent
on the utilization of Cooklang and the exclusion of ingredient information. Examined four models under different
configurations: with/without Cooklang specification and with/without recipe ingredients. The results, presented in
Table 3, demonstrate that these factors influence model performance across evaluation metrics.

Table 3: LLM Performance Metrics Across Different Configurations


Model Cooklang Specification Ingredients WER ↓ ROUGE-L ↑ TER ↓
GPT-4o False False 0.3370 0.8839 1.4448
GPT-4o False True 0.1596 0.9392 0.6676
GPT-4o True True 0.1023 0.9608 0.4733
GPT-4o-mini False False 0.3924 0.8461 1.5210
GPT-4o-mini False True 0.2768 0.8800 1.2981
GPT-4o-mini True True 0.2075 0.8770 1.4443
Llama3.1:70b False False 1.1409 0.4483 5.3866
Llama3.1:70b False True 1.2036 0.4926 5.8035
Llama3.1:70b True True 0.5964 0.6951 3.1677
Llama3.1:8b False False 1.6832 0.3529 8.1012
Llama3.1:8b False True 1.2686 0.4796 5.7104
Llama3.1:8b False True 1.1021 0.5252 5.6614

Note: ↓ indicates lower values are better, ↑ indicates higher values are better.

Figures 4 and 5 show that including both Cooklang and ingredient information improved performance across all tested
models. This suggests that structured language and contextual information enhance recipe processing capabilities.
GPT-4o performed best across all configurations and metrics, achieving its highest scores when using both Cooklang and
ingredient information. The benefits of these additions scaled with model size, as larger models showed more substantial
improvements. While ingredient information alone generally improved performance, adding Cooklang created further
gains, highlighting the value of domain-specific structured languages. These findings could indicate that language
models benefit from both structured formats and comprehensive contextual information when processing domain-
specific tasks. The pronounced impact of including ingredient information, even without Cooklang specifications,
indicates that contextual knowledge plays a crucial role in recipe understanding. This aligns with previous findings in
natural language processing, where domain-specific context enhances task performance. However, the synergistic effect
of combining both Cooklang and ingredients suggests that structured formats help models better leverage available
contextual information.

4.3 Ingredient Identification

Our evaluation metric assessed how accurately LLMs identify all ingredients in recipes. Figure 6 illustrates the
identification capabilities and shows performance variations across four tested models. GPT-4o achieved the highest
score of 0.61 on the ingredient identification metric (0-1 range), while GPT-4o-mini scored 0.56. The LLama models
performed significantly worse, with Llama3.1:70b scoring 0.24 and Llama3.1:8b reaching only 0.10. This gap reveals
a significant difference between GPT and Llama models in ingredient identification tasks. The top GPT model
outperforms the Llama model by approximately 2.5 times. Within the Llama family, increasing the model size from 8b
to 70b parameters improved performance by 2.4 times, highlighting how the model scale affects this task. It is important
to note that this evaluation methodology relies solely on exact string matching for ingredient identification without
accounting for spelling variations or minor textual discrepancies. This strict matching criterion likely underestimates
the models’ true ingredient identification capabilities, as it penalizes semantically correct but textually inexact matches.
Future work could enhance this metric by incorporating fuzzy matching or semantic similarity measures to provide
a more comprehensive assessment of ingredient identification performance. Figure 7 provides a deeper analysis,
examining both missing ingredients (false negatives) and incorrectly added ingredients (false positives) across models.

4.4 Accuracy in Unit and Amount Identification

Table 4 shows LLM performance metrics across configurations and prompt techniques, focusing on unit and amount
identification accuracy in domain-specific contexts. We evaluated the models using MIPROv2, few-shot, and zero-shot

10
arXiv Template A P REPRINT

Figure 4: Impact of Cooklang Specification Integration on WER, ROUGE-L, and TER Performance Metrics

11
arXiv Template A P REPRINT

Figure 5: Impact of Ingredients Integration on WER, ROUGE-L, and TER Performance Metrics

12
arXiv Template A P REPRINT

Figure 6: Model Performance in Ingredient Identification, higher values are better.

prompting techniques. GPT-4o, with few-shot prompting, achieved the highest scores, reaching 0.5729 and 0.5833
for unit and amount identification. Few-Shot and MIPROv2 techniques outperformed Zero-Shot across most models,
highlighting effective prompting’s impact. GPT-4o and GPT-4o-mini surpassed the Llama models in performance,
while the 70b Llama version exceeded its 8b variant. Zero-Shot yielded the lowest results, demonstrating proper
prompting’s necessity. Llama models, particularly the 8b version, struggled compared to GPT models. Notably,
Llama3.1:8b showed slightly better performance with Zero-Shot for amount detection. These results demonstrate how
model size and prompting techniques affect specialized task performance while indicating room for improvement in
domain applications. Similar to ingredient identification, in our metrics that evaluate LLMs across Unit and Amount
Identification, the metric relies solely on exact string matching for ingredient identification without accounting for
spelling variations or minor textual discrepancies.

Table 4: LLM Performance Metrics Across Different Models and Prompt Techniques
Model Prompt Technique Find All Units ↑ Find All Amounts ↑
GPT-4o MIPROv2 0.5313 0.5625
GPT-4o Few-Shot 0.5729 0.5833
GPT-4o Zero-Shot 0.2292 0.2396
GPT-4o-mini MIPROv2 0.5234 0.5469
GPT-4o-mini Few-Shot 0.4688 0.4479
GPT-4o-mini Zero-Shot 0.1146 0.1563
Llama3.1:70b MIPROv2 0.1771 0.1667
Llama3.1:70b Few-Shot 0.1979 0.2083
Llama3.1:70b Zero-Shot 0.1042 0.1146
Llama3.1:8b MIPROv2 0.0104 0.0000
Llama3.1:8b Few-Shot 0.1146 0.1146
Llama3.1:8b Zero-Shot 0.1146 0.1250

Note: ↑ indicates higher values are better.

13
arXiv Template A P REPRINT

Figure 7: Comparison of models

5 Discussion

Our analysis of models, prompt techniques, and configurations demonstrates a clear pattern in the optimization of
performance for both recipe processing and Cooklang conversion tasks. The results consistently indicate a hierarchy
of performance, with GPT-4o exhibiting the highest performance, followed by GPT-4o-mini > Llama 3.1:70b >
Llama 3.1:8b.
The following in-depth examination of configurations revealed that they consistently exhibited greater performance
than other configurations in:

• Model: GPT-4o achieved superior performance across all evaluation metrics and tasks, consistently outperform-
ing other models in word error rate (WER), ROUGE-L, translation error rate (TER), ingredient identification,
and quantity recognition accuracy. The model demonstrated exceptional general language understanding and
generation capabilities while excelling in domain-specific tasks. Llama models show positive scaling behavior
with increased model size. However, they consistently underperformed compared to GPT models across both
general and domain-specific metrics.

14
arXiv Template A P REPRINT

• Prompt Technique: The Few-Shot prompting technique demonstrated superior performance across all models.
For GPT-4o specifically, Few-Shot achieved optimal results with WER=0.0730, ROUGE-L=0.9722, and
TER=0.2096. MIPROv2 showed strong performance across most models, with Llama3.1:70b being the
exception where both Zero-Shot and Few-Shot outperformed MIPROv2. The overall ranking of prompt
techniques from our analysis: Few-Shot > MIPROv2 > Zero-Shot.
• Cooklang and Ingredient Information: The implementation of the Cooklang format and ingredient in-
formation resulted in enhanced recipe parsing and generation, leading to improved WER, ROUGE-L, and
TER scores. However, GPT models exhibited a tendency to include an excess of ingredients, occasionally
introducing superfluous elements, as shown in Figure 7. In contrast, Llama models demonstrated a reduced
proclivity to incorporate unnecessary ingredients but were more susceptible to the omission of essential
components, potentially impacting the completeness and accuracy of the recipes.
• Accuracy of Units and Amounts: The GPT-4o model with Few-Shot prompting demonstrated the most
effective performance (Table 4) in identifying units (0.5729) and amounts (0.5833). GPT models exhibited
performance in quantitative information extraction. In contrast, Llama models exhibited poor performance in
extracting quantitative information (units and amounts), which is crucial for ensuring recipe precision.

The results indicate that the optimal configuration for achieving optimal performance in recipe processing and Cooklang
conversion tasks is the use of GPT-4o with Few-Shot prompting, which incorporates both Cooklang specifications and
ingredient information. While GPT-4o-mini and larger Llama models (70b) demonstrated improvements with analogous
configurations, the performance differential between GPT-4o and alternative models was considerable. However,
the higher cost of using GPT-4o compared to smaller models presents a notable drawback, potentially limiting its
practical implementation in resource-constrained scenarios where GPT-4o-mini could be more suitable with minimal
performance reduction.

5.1 Open source model size comparison

The comparison between Llama3.1:8b and Llama3.1:70b reveals intriguing performance characteristics across all
metrics. As anticipated, the larger 70b parameter model generally outperformed its 8b counterpart, consistent with the
expectation that increased model size correlates with enhanced capabilities. However, a noteworthy observation emerges
from the data shown in Figure 2: the upper-performance bounds of the 8b model frequently approach or overlap with
those of the 70b model. This phenomenon is particularly evident in the WER and TER metrics, where the top quartile of
Llama3.1:8b’s performance distribution nearly intersects with the median performance of Llama3.1:70b. The observed
performance overlap indicates that, while the 70b model demonstrates more consistent high-level performance, the 8b
model exhibits the potential for comparable results in optimal scenarios. This finding has significant implications for
model selection and optimization strategies. The demonstrated capacity of the smaller model to achieve performance
levels similar to its larger counterpart in certain instances opens avenues for targeted fine-tuning. By optimizing the 8b
model for specific use cases, it may be possible to narrow or even eliminate the performance gap with the 70b model
in particular applications. This potential for high performance in a smaller model through fine-tuning is particularly
attractive from both practical and resource-efficiency perspectives.

5.2 Limitations

While this study sheds light on the capabilities of Large Language Models (LLMs) in the converting recipes to Cooklang
markdown, it is important to acknowledge the limitations of the study.

1. Model Versions: This study focuses on specific LLMs: GPT-4o, GPT-4o-mini, Llama:3.1:8b, and
Llama:3.1:70b. To gain a more comprehensive understanding, further research should expand the scope
to include a wider range of models.
2. Task Specificity: Our evaluation focused on converting recipe text into recipe in Cooklang markdown. While
this offers deep insights into domain-specific performance, it may not fully demonstrate the model’s ability
to generate other domain-specific structured text. Future studies should examine the model’s broader text
generation capabilities.
3. Dataset Limitations: The study utilized a dataset of 32 distinct recipe samples from categories, which
provides a foundation for initial analysis. However, to enhance the generalization of the results across different
cuisines and recipe types, it would be beneficial to construct a larger, more comprehensive dataset. This
expanded dataset could include a wider variety of recipes from diverse cultural backgrounds, cooking methods,
and ingredient combinations. By evaluating the models on this larger dataset, we could gain more robust

15
arXiv Template A P REPRINT

insights into their performance and better assess their applicability across a broader spectrum of culinary
contexts.
4. Fine-tuning Potential: The study does not examine the potential for improvements resulting from task-specific
fine-tuning, which could enhance performance and alter results. It would be beneficial for future research to
include fine-tuning experiments for a more comprehensive evaluation. Fine-tuning large open-source models
like Llama3.1 8b and 70b could potentially achieve performance comparable to GPT-4o or GPT-4o-mini,
which highlights the importance of investigating this approach

Future research should address these limitations in order to gain a more comprehensive understanding of language
model performance in recipe generation and related tasks.

6 Conclusion
Our research demonstrates the transformative potential of large language models (LLMs) in revolutionizing domain-
specific content processing and generation. Through experimentation, we have shown that state-of-the-art models,
particularly GPT-4o, can handle the intricate task of converting unstructured recipes into standardized Cooklang
format with performance metrics—achieving a ROUGE-L score of 0.9722 and Word Error Rate of 0.0730 with few-
shot prompting—showcasing not just incremental improvement but a quantum leap in automated recipe processing
capabilities.
The implications of this research extend far beyond the culinary domain, opening new horizons in the field of structured
text generation. Our findings provide a clear path forward for organizations grappling with the challenge of converting
unstructured information into standardized, machine-readable formats. The success demonstrated in the conversion
of recipes serves as a powerful proof-of-concept for similar transformations in other specialized domains, including
healthcare documentation (HL7), legal contract analysis, financial compliance reporting, and technical documentation
management.
What makes these results particularly exciting is their immediate practical applicability. Organizations can leverage
this technology to dramatically streamline their workflows, potentially reducing manual processing time by orders of
magnitude while simultaneously improving data quality and consistency with a little implementation overhead.
Looking ahead, this research focuses on the new era in natural language processing, where the barrier between
unstructured and structured content becomes increasingly permeable. The demonstrated capabilities of large language
models (LLMs) in understanding and generating domain-specific formats suggest a paradigm shift in how organizations
manage, process, and utilize their textual data. As these models continue to evolve, the potential applications will only
expand, promising even more sophisticated and efficient solutions for complex content transformation challenges.

6.1 Future Research Directions


1. Fine-tuning: The smaller models, like those in the Llama family, demonstrated difficulties in performing
complex tasks such as accurate ingredient identification. However, the study suggests that targeted fine-tuning
of these smaller models, such as Llama3.1:8b, could potentially narrow the performance gap in specific
applications.
2. Tools for structured generation: Frameworks such as Outlines Willard and Louf [2023] provide tools for
utilizing open-source language models to generate Cooklang-compliant recipes without the necessity for
fine-tuning. The framework could employ regular expression-guided generation to guarantee correct formatting
of ingredients, structured text generation to provide overall recipe structure, constrained decoding to ensure
specification compliance, and custom stopping criteria to confirm complete recipes. This approach allows the
generation of fully-formed, Cooklang-compliant recipes from simple prompts, combining the model’s culinary
knowledge with strict adherence to Cooklang specifications.
3. Broader model selection: While our current study established baseline performance using representative
models (GPT-4o, GPT-4o-mini, Llama3.1:70b, and Llama3.1:8b), we plan to expand this analysis to include
newer model iterations and additional open-source and proprietary models. This broader evaluation will help
validate our findings and identify performance patterns across different model architectures and sizes.
4. Advanced Prompting Techniques: The exploration of alternative prompting techniques, such as multi-step
prompting Fu et al. [2023] for recipe generation and conversion, and structured prompts that emulate the
Cooklang format is a promising avenue for further research. That could improve every metric.
5. Dataset Expansion: The proposed improvements involve an expansion of the recipe dataset beyond the current
32 samples to encompass a more diverse range of cuisines, cooking methods, and ingredient combinations.

16
arXiv Template A P REPRINT

This will enhance the generalization of the results. This approach could be also used to create a new dataset,
it will convert recipe dataset into cooklang markdown. These kind of datasets could be used for finetuning
large and small language models for this task. Furthermore, this approach facilitates the development of novel
datasets, which contain the potential for the fine-tuning of both large and small language models, enabling
more nuanced natural language understanding in culinary contexts. This transformation not only standardizes
the recipe representation but also creates opportunities for cross-cultural culinary analysis and automated
recipe processing applications.
6. Potential Impact on Culinary Industry: AI-powered automation shows clear potential to transform recipe
generation and conversion, especially within digital cookbooks. Models like GPT-4 have initiated a promising
shift in cookbook creation and management. AI systems can now streamline the conversion between plain
text and structured recipe formats, which could revolutionize cookbook publishing and recipe management
systems.

6.2 Closing Remarks

Our research illustrates the adaptability of general-purpose LLMs to domain-specific tasks, particularly in the culinary
domain. By demonstrating the efficacy of models like GPT-4o in processing and generating structured recipe content in
Cooklang format, the study bridges the gap between general natural language processing capabilities and specialized
applications. These findings have implications for the development of AI-powered tools in industries. They suggest that,
with appropriate prompting and input structuring, general-purpose LLMs can be effectively applied to domain-specific
challenges without extensive fine-tuning. This study provides a foundation for further research into the application of
LLMs in specialized domains. Future work could explore fine-tuning strategies, expanded datasets, and applications in
other domain-specific areas.

References
Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan,
Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom
Henighan, Rewon Child, Aditya Ramesh, Daniel Ziegler, Jeffrey Wu, Clemens Winter, Chris Hesse, Mark Chen, Eric
Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford,
Ilya Sutskever, and Dario Amodei. Language models are few-shot learners. In H. Larochelle, M. Ranzato, R. Hadsell,
M.F. Balcan, and H. Lin, editors, Advances in Neural Information Processing Systems, volume 33, pages 1877–1901.
Curran Associates, Inc., 2020. URL https://proceedings.neurips.cc/paper_files/paper/2020/file/
1457c0d6bfcb4967418bfb8ac142f64a-Paper.pdf.
Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and
Peter J. Liu. Exploring the limits of transfer learning with a unified text-to-text transformer. J. Mach. Learn. Res., 21
(1), January 2020. ISSN 1532-4435.
Cooklang. Cooklang: Recipe markup language. https://github.com/cooklang, 2024a. URL https://
github.com/cooklang. Accessed: 15.1.2025.
Pengfei Liu, Weizhe Yuan, Jinlan Fu, Zhengbao Jiang, Hiroaki Hayashi, and Graham Neubig. Pre-train, prompt, and
predict: A systematic survey of prompting methods in natural language processing. ACM Comput. Surv., 55(9),
January 2023. ISSN 0360-0300. doi:10.1145/3560815. URL https://doi.org/10.1145/3560815.
Robert H. Dolin, Liora Alschuler, Calvin Beebe, Paul V. Biron, Sandra L. Boyer, Daniel Essin, Elliot Kimber, Tom
Lincoln, and John E. Mattison. The HL7 clinical document architecture. Journal of the American Medical Informatics
Association, 8(6):552–569, 2001. doi:10.1136/jamia.2001.0080552.
Michael Good. Musicxml for notation and analysis. In The Virtual Score, Volume 12: Representation, Retrieval,
Restoration. The MIT Press, 05 2001. ISBN 9780262316187. doi:10.7551/mitpress/2058.003.0010. URL https:
//doi.org/10.7551/mitpress/2058.003.0010.
Ahmed Ali and Steve Renals. Word error rate estimation for speech recognition: e-WER. In Iryna Gurevych and
Yusuke Miyao, editors, Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics
(Volume 2: Short Papers), pages 20–24, Melbourne, Australia, July 2018. Association for Computational Linguistics.
doi:10.18653/v1/P18-2004. URL https://aclanthology.org/P18-2004/.
Dietrich Klakow and Jochen Peters. Testing the correlation of word error rate and perplexity. Speech Com-
munication, 38(1):19–28, 2002. ISSN 0167-6393. doi:https://doi.org/10.1016/S0167-6393(01)00041-3. URL
https://www.sciencedirect.com/science/article/pii/S0167639301000413.

17
arXiv Template A P REPRINT

Google Research. Rouge: Recall-oriented understudy for gisting evaluation. https://github.com/google-


research/google-research/tree/master/rouge, 2019. Accessed on September 22, 2024.
Abigail See, Peter Liu, and Christopher Manning. Get to the point: Summarization with pointer-generator networks. In
Association for Computational Linguistics, 2017. URL https://arxiv.org/abs/1704.04368.
PromptingGuide. Zero-shot prompting. https://www.promptingguide.ai/techniques/zeroshot, 2023a. URL
https://www.promptingguide.ai/techniques/zeroshot. Accessed: 15.9.2024.
PromptingGuide. Few-shot prompting. https://www.promptingguide.ai/techniques/fewshot, 2023b. URL
https://www.promptingguide.ai/techniques/fewshot. Accessed: 16.1.2025.
Krista Opsahl-Ong, Michael J Ryan, Josh Purtell, David Broman, Christopher Potts, Matei Zaharia, and Omar Khattab.
Optimizing instructions and demonstrations for multi-stage language model programs. In Yaser Al-Onaizan, Mohit
Bansal, and Yun-Nung Chen, editors, Proceedings of the 2024 Conference on Empirical Methods in Natural Language
Processing, pages 9340–9366, Miami, Florida, USA, November 2024. Association for Computational Linguistics.
doi:10.18653/v1/2024.emnlp-main.525. URL https://aclanthology.org/2024.emnlp-main.525/.
David A. Mundie. Computerized cooking. https://diyhpl.us/~bryan/papers2/CompCook.html, 1985. Accessed:
15.1.2025.
Google Developers. Recipe, howto, and itemlist structured data. https://developers.google.com/search/
docs/appearance/structured-data/recipe, 2024. URL https://developers.google.com/search/
docs/appearance/structured-data/recipe. Google Search Central Documentation, Accessed : 15.1.2025.
Alexey Dubovskoy. Cooking patterns. https://alexey.ch/post/2015/02/cooking-patterns/„ feb 2015. URL
https://alexey.ch/post/2015/02/cooking-patterns/. Accessed: 27.12.2024.
Rishi Bommasani and et al. On the opportunities and risks of foundation models. arXiv, abs/2108.07258, 2021.
doi:arXiv:2108.07258.
Vasiliki Pitsilou, George Papadakis, and Dimitrios Skoutas. Using LLMs to Extract Food Entities from Cooking
Recipes . In 2024 IEEE 40th International Conference on Data Engineering Workshops (ICDEW), pages 21–28,
Los Alamitos, CA, USA, May 2024. IEEE Computer Society. doi:10.1109/ICDEW61823.2024.00008. URL
https://doi.ieeecomputersociety.org/10.1109/ICDEW61823.2024.00008.
A. Rostami. An Integrated Framework for Contextual Personalized LLM-Based Food Recommendation. Ph.d.
dissertation, UC Irvine, 2024. URL https://escholarship.org/uc/item/4b7448hc. ProQuest ID: Ros-
tami_uci_0030D_18859, Merritt ID: ark:/13030/m5vr48xn.
Jinhao Jiang, Kun Zhou, Zican Dong, Keming Ye, Wayne Xin Zhao, and Ji-Rong Wen. Structgpt: A general framework
for large language model to reason over structured data. arXiv preprint arXiv:2305.09645, 2023.
Fnu Mohbat and Mohammed J Zaki. Llava-chef: A multi-modal generative model for food recipes. arXiv preprint
arXiv:2408.16889, 2024.
Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning. In
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 26296–26306,
June 2024. doi:10.1109/CVPR52733.2024.02484.
Prof. Pratiksha Prakash Pansare, Kunal Navnath Khatik, Niraj Nandkumar Shigvan, and Rohan Vaijanath Lande.
A review on recipe generation from food image using machine learning. International Journal of Advanced Re-
search in Science, Communication and Technology, 2024. URL https://api.semanticscholar.org/CorpusID:
274084737.
Michał Bień, Michał Gilski, Martyna Maciejewska, Wojciech Taisner, Dawid Wisniewski, and Agnieszka Lawrynow-
icz. RecipeNLG: A cooking recipes dataset for semi-structured text generation. In Brian Davis, Yvette Gra-
ham, John Kelleher, and Yaji Sripada, editors, Proceedings of the 13th International Conference on Natural
Language Generation, pages 22–28, Dublin, Ireland, December 2020. Association for Computational Linguistics.
doi:10.18653/v1/2020.inlg-1.4. URL https://aclanthology.org/2020.inlg-1.4/.
Pengfei Zhou, Weiqing Min, Chaoran Fu, Ying Jin, Mingyu Huang, Li Xiangyang, Shuhuan Mei, and Shuqiang Jiang.
Foodsky: A food-oriented large language model that passes the chef and dietetic examination. arXiv, abs/2406.10261,
2024. doi:10.48550/arXiv.2406.10261. URL https://arxiv.org/abs/2406.10261.
Angelos Mavrogiannis, Christoforos Mavrogiannis, and Yiannis Aloimonos. Cook2ltl: Translating cooking recipes
to ltl formulae using large language models. In 2024 IEEE International Conference on Robotics and Automation
(ICRA), pages 17679–17686, 2024. doi:10.1109/ICRA57147.2024.10611086.
Alexey Dubovskoy. Ai and the evolution of recipe formats. https://cooklang.org/blog/03-ai-and-the-
evolution-of-recipe-formats/, 2024. URL https://cooklang.org/blog/03-ai-and-the-evolution-
of-recipe-formats/. Accessed: 27.12.2024.

18
arXiv Template A P REPRINT

Jiazheng Li, Runcong Zhao, and Lin Gui. Overprompt: Enhancing chatgpt capabilities through an efficient in-context
learning approach. arXiv, abs/2305.14973, 2023. doi:10.48550/arXiv.2305.14973. URL https://arxiv.org/abs/
2305.14973.
Tianyu Gao, Adam Fisch, and Danqi Chen. Making pre-trained language models better few-shot learners. In
Chengqing Zong, Fei Xia, Wenjie Li, and Roberto Navigli, editors, Proceedings of the 59th Annual Meeting of
the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language
Processing (Volume 1: Long Papers), pages 3816–3830, Online, August 2021. Association for Computational
Linguistics. doi:10.18653/v1/2021.acl-long.295. URL https://aclanthology.org/2021.acl-long.295/.
Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed H. Chi, Quoc V. Le, and
Denny Zhou. Chain-of-thought prompting elicits reasoning in large language models. In Proceedings of the 36th
International Conference on Neural Information Processing Systems, NIPS ’22, Red Hook, NY, USA, 2024. Curran
Associates Inc. ISBN 9781713871088.
Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le, Ed H. Chi, and Denny Zhou. Self-consistency improves chain of
thought reasoning in language models. ArXiv, abs/2203.11171, 2022. URL https://api.semanticscholar.org/
CorpusID:247595263.
Yupeng Chang, Xu Wang, Jindong Wang, Yuan Wu, Linyi Yang, Kaijie Zhu, Hao Chen, Xiaoyuan Yi, Cunxiang Wang,
Yidong Wang, Wei Ye, Yue Zhang, Yi Chang, Philip S. Yu, Qiang Yang, and Xing Xie. A survey on evaluation of
large language models. ACM Trans. Intell. Syst. Technol., 15(3), March 2024. ISSN 2157-6904. doi:10.1145/3641289.
URL https://doi.org/10.1145/3641289.
Marco Tulio Ribeiro, Tongshuang Wu, Carlos Guestrin, and Sameer Singh. Beyond accuracy: Behavioral testing of nlp
models with checklist. arXiv, abs/2005.04118, 2020. doi:10.48550/arXiv.2005.04118. URL https://arxiv.org/
abs/2005.04118.
Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil
Mathur, Alan Schelten, Amy Yang, Angela Fan, et al. The llama 3 herd of models. arXiv preprint arXiv:2407.21783,
2024. URL https://arxiv.org/abs/2407.21783.
Meta AI. Llama 3.1: Meta’s latest AI model is now available. https://ai.meta.com/blog/meta-llama-3-1/,
February 2024a. URL https://ai.meta.com/blog/meta-llama-3-1/. Accessed: 15.9.2024.
Meta AI. Llama models, 2024b. URL https://github.com/meta-llama/llama-models. Accessed: 15.9.2024.
OpenAI. GPT-4. https://platform.openai.com/docs/models/gpt-4o, 2024a. Accessed: 15.9.2024.
OpenAI. GPT-4. https://platform.openai.com/docs/models/gpt-4o-mini, 2024b. Accessed: 15.9.2024.
Laria Reynolds and Kyle McDonell. Prompt programming for large language models: Beyond the few-shot paradigm.
In Extended Abstracts of the 2021 CHI Conference on Human Factors in Computing Systems, CHI EA ’21, New York,
NY, USA, 2021. Association for Computing Machinery. ISBN 9781450380959. doi:10.1145/3411763.3451760.
URL https://doi.org/10.1145/3411763.3451760.
Omar Khattab, Arnav Singhvi, Paridhi Maheshwari, Zhiyuan Zhang, Keshav Santhanam, Sri Vardhamanan, Saiful Haq,
Ashutosh Sharma, Thomas T. Joshi, Hanna Moazam, Heather Miller, Matei Zaharia, and Christopher Potts. Dspy:
Compiling declarative language model calls into self-improving pipelines. In The Twelfth International Conference
on Learning Representations, 2024.
Cooklang. Cooklang recipes repository. https://github.com/Cooklang/recipes, 2024b. URL https:
//github.com/Cooklang/recipes. Accessed: 2024-09-18.
Cooklang. Cooklang specification examples. https://github.com/Cooklang/spec/tree/main/examples,
2024c. URL https://github.com/Cooklang/spec/tree/main/examples. Accessed: 16.1.2025.
Cooklang Community. Awesome cooklang recipes. https://github.com/cooklang/awesome-cooklang-
recipes, 2024. URL https://github.com/cooklang/awesome-cooklang-recipes. Accessed: 2024-09-
18.
CookLang. Cookcli: Command line interface for cooklang. https://github.com/cooklang/cookcli, 2024. URL
https://github.com/cooklang/cookcli. Accessed: 15.9.2024.
Ollama. Ollama: Get up and running with large language models, locally. https://ollama.com/, 2024. URL
https://ollama.com/. Accessed on 16 September 2024.
Brandon T. Willard and Rémi Louf. Efficient guided generation for large language models. ArXiv, abs/2307.09702,
2023. URL https://api.semanticscholar.org/CorpusID:260278488.
Yao Fu, Litu Ou, Mingyu Chen, Yuhao Wan, Hao Peng, and Tushar Khot. Chain-of-thought hub: A continuous
effort to measure large language models’ reasoning performance. arXiv, abs/2305.17306, 2023. URL https:
//arxiv.org/abs/2305.17306.

19

You might also like