0% found this document useful (0 votes)
53 views

Large Language Model Enhanced Text-to-SQL Generation- A Survey

This paper surveys the advancements in Text-to-SQL generation using Large Language Models (LLMs), focusing on various training strategies such as prompt engineering, fine-tuning, task-training, and LLM agents. It highlights the challenges faced in translating natural language queries into SQL commands, including ambiguity, database complexity, and the need for robust models. The survey also categorizes existing datasets and evaluation metrics to provide a comprehensive understanding of the current research landscape in LLM-enhanced Text-to-SQL systems.

Uploaded by

Debrup Paul
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
53 views

Large Language Model Enhanced Text-to-SQL Generation- A Survey

This paper surveys the advancements in Text-to-SQL generation using Large Language Models (LLMs), focusing on various training strategies such as prompt engineering, fine-tuning, task-training, and LLM agents. It highlights the challenges faced in translating natural language queries into SQL commands, including ambiguity, database complexity, and the need for robust models. The survey also categorizes existing datasets and evaluation metrics to provide a comprehensive understanding of the current research landscape in LLM-enhanced Text-to-SQL systems.

Uploaded by

Debrup Paul
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 18

1

Large Language Model Enhanced Text-to-SQL


Generation: A Survey
Xiaohu Zhu, Qian Li, Lizhen Cui, Yongkang Liu

Abstract—Text-to-SQL translates natural language queries What is the name of the employee
into Structured Query Language (SQL) commands, enabling who has the highest salary?
users to interact with databases using natural language. Es-
sentially, the text-to-SQL task is a text generation task and its user
arXiv:2410.06011v1 [cs.DB] 8 Oct 2024

development primarily dependent on changes in language models.


Especially with the rapid development of Large Language Models
(LLMs), the pattern of text-to-SQL has undergone significant
changes. Existing survey work mainly focuses on rule-based and
neural-based approaches, still lacking a survey of Text-to-SQL
with LLMs. In this paper, we survey the large language model Table Schema
enhanced text-to-SQL generations, classifying them into prompt
engineering, fine-tuning, pre-trained and Agent groups according
to training strategies. And we also summarize datasets and eval-
uation metrics comprehensively. This survey could help people
better understand the pattern, research status, and challenges of
fine tuning
LLM-based text-to-SQL generations.
Index Terms—Text-to-SQL, Large Language Models, Prompt
Engineering, Fine-Tuning, Database Querying Fig. 1. Flowchart of the Text-to-SQL. The flowchart illustrates the process
where user questions and the database schema are first collected. These inputs
are then processed through prompt engineering and fine-tuning techniques
before being passed to a large language model (LLM). The LLM generates the
I. I NTRODUCTION corresponding SQL query based on the refined inputs, allowing for accurate
query formulation based on natural language input.

D ATA has become a crucial production factor [1], [2]


in the productive life of human activities. With the
proliferation of electronic devices, there have been more and data grow exponentially, it becomes expensive for these rule-
more databases appearing, storing massive information from based methods. Deep neural networks began to take center
all sorts of areas [3], [4]. However, the threshold for learning stage as foundational approaches, such as LSTM-based [8] and
database query language, such as SQL, is relatively high for Transformer-based [9] methods. However, they are confronted
ordinary people. Even for practitioners, it is more troublesome with problems such as data sparsity and generalization issues.
to write a large number of query statements with guaranteed Recently, with the significant improvement of inference and
correctness for different domain databases and application generalization abilities in Large Language Models (LLMs),
scenarios. To lower the barriers of using database queries, text- many works use LLMs to generate SQL queries correctly, and
to-SQL task translates natural language queries into Structured have achieved greater abilities to understand natural language
Query Language (SQL) commands, enabling users to interact than previous approaches [10]. For instance, ChatGPT-4 [11]
with databases using natural language. has achieved the top performance on the Spider [12] dataset,
Fig.1 gives an example of a text-to-SQL task. Given a setting the new standard for execution accuracy. Existing
natural language question Q and a database schema S: survey work mainly focuses on rule-based and neural-based
Q: What is the name of the employee with the highest salary? approaches, still lacking a survey of Text-to-SQL with LLMs.
S: Table: Employees (ID, Name, Salary) In this paper, we survey the large language model enhanced
The goal of text-to-SQL is to generate a SQL query Ŷ , text-to-SQL generation methods, classifying them into the
SELECT Name FROM Employees prompt, fine-tuning, task-training, and agent according to
ORDER BY Salary DESC LIMIT 1; training strategies, as shown in Fig.3,
After converting text into SQL language, we can search for • Prompt: (No training) They use well-designed prompts
relevant knowledge from databases, thus breaking down the to guide LLMs to generate more accurate SQL queries,
barriers between natural language and structured data [5]. enabling powerful LLMs to generate SQL queries from
The history of Text-to-SQL goes back to 1973 when [6] zero-shot [13], [14] or few-shot [15], [16] examples.
developed a system called the LUNAR system, which was • Fine-Tuning: (Training from pretrained LLMs) They
primarily used to answer questions related to rocks brought finetune the LLM models to adapt to the text-to-SQL
back from the Moon. The earliest researches are mostly based task, including full-parameters fine-tuning [17], [18] and
on fine-designed rules [7], which are suitable for uncompli- parameter-efficient fine-tuning [19], [20].
cated or specific scenarios. As the amount and domains of • Task-Training: (Training from scratch) They train a task-
2

[19], [28]–[31],[32], [33], [34], [35],


Exact Matching Accuracy Zero-shot Prompting
[29], [36], [37]
Execution Accuracy Prompt
Evaluation TRANX, SC-prompt, MCS-SQL, SQL-
Engineering Few-shot Prompting
Valid Efficiency Score Metric PaLM, ACT-SQL
Methods
Test-suite Accuracy Chain of Thought [43]–[45], [54], [55]

Parameter-efficient Fine- [46], [47], [48], [33], [49], [61], [62],


Fine-tuning Tuning [65], [66], [57], [58], [61]
ATIS, GeoQuery, Scholar, Advising Single-Domain Datasets
Methods Knowledge-to-SQL, SGU-SQL, DIN-
WikiSQL, Spider, KaggleDBQA, Full-Parameters Fine-tuning
SQL, MAC-SQL
DuSQL, BIRD, BEAVER, CoSQL, Cross-Domain Datasets
CHASE Datasets
ADVETA, Spider-Gen, Spider-DK,
Text-to-SQL
Spider-
SS&CG, Spider-SYN, Spider-SSP, Augmented-Datasets
Spider-Realistic, CSpider, TrustSQL,
BigTable- 0.2k, SParC
Mixture of Experts Models SQL-GEN
Task Training
Transformer-based Models CodeS, SQLova, RESDSQL

TypeSQL, Seq2SQL, SQLNet,


SyntaxSQLNe, IRNet, RAT-SQL,
Bridge, SDSQL, SLSQL, IESQL,
LSTM-based Methods
SEAD, SmBoP, DT-Fixup, RaSaP,
GNN, ShadowGNN, SADGA, MAC-SQL, Tool-SQL,
LGESQL, UnifiedSKG Traditional SQLFixAgent, MAG-SQL,
LLM Agent
Methods MAGIC, Distyl AI's
MIGA, IRNet, RAT-SQL, GraPPa, Methods
Analytics Insight Engine ,
StruG, SCoRe, TaBERT, TAPAS, SuperSQL
Transfromer-based Methods
MATE, TableFormer, TAPEX, S²SQL,
IST-SQL, IGSQL, GAZP, EditSQL

Fig. 2. The overview of the text-to-SQL metrics, datasets, and methods.

specific text-to-SQL model with training strategies similar query Ŷ that retrieves the desired output from databases D.
to LLMs, such as Transformers [21], [22] and mixture of This can be conceptualized as a sequence-to-sequence [26]
experts [23]. problem:
• LLM Agent: (Training with multiple agents and exter- Input:

nal tools) They collaborate with multiple intelligences,
– a Natural Language problem:Q = (q1 , q2 , . . . , qn ),
dynamically generate and correct SQL queries, handle
where qi represents the i-th token in the question.
database matching issues, and improve query accuracy
– A database schema:S = {T1 , T2 , . . . , Tm }, where Ti
and execution through external tools [24], [25].
represents the i-th table in the database, and each table
Meanwhile, we summarize the datasets and metrics for the Ti includes columns Ci1 , Ci2 , . . . , Cip .
large language model enhanced text-to-SQL generations. For
• Output:
datasets, we explore the characteristics of datasets in terms
of data source scenarios, number of tables, SQL complexity, – A SQL query:Ŷ = (ŷ1 , ŷ2 , . . . , ŷk ), where ŷi repre-
and number of conversation rounds by systematically combing sents the i-th token in the generated query.
single-domain, cross-domain, and augmented datasets, and The task can be expressed as finding the most probable SQL
analyze the challenges and limitations of these datasets in real- query Ŷ given a natural language question Q and a database
world applications. For metrics, we consider exact matching schema S:
accuracy, execution accuracy, valid efficiency and test-suite
accuracy in handling single-turn, multi-turn interactions. Ŷ = arg max P (Y | Q, S)
Y
In the following, we first give the preliminaries in Section
II, then introduce the metrics and datasets in Section III, third
propose detailed descriptions of the method in Section IV, and B. Mythology
give a conclusion and future work at the final.
To solve this problem, modern methodologies typically em-
II. P RELIMINARIES ploy deep learning models [27], particularly Encoder-Decoder
(ED) architectures. Presented below is a high-level overview
The Text-to-SQL systems enable users to input questions
of the process:
directly in natural language, and the system will automatically
generate corresponding SQL query statements. • Encoding: The encoder processes the input question Q
and the schema S to create a contextual representation.
A. The Text-to-SQL problem This can be represented as:
Given a natural language question Q and a database S, The
objective of Text-to-SQL tasks is to generate an accurate SQL h = Encoder(Q, S)
3

where h is the hidden state or contextual representation columns. In addition, the data types and formats in the
generated by the encoder. database are diversified, for example, date data may have
• Decoding: The decoder generates the SQL query Ŷ different representations (e.g., “2024-01-01” or “year2024”),
token by token based on the encoded representation h. which further increases the difficulty of data parsing.
The probability of each token in the SQL query can be 3) Complexity of SQL queries : The complexity of SQL
calculated as: queries is usually related to the query structure, involving
operations such as joins of multiple tables, nested subqueries,
k
Y and complex conditional filtering. For example, a query may
P (Y | Q, S) = P (ŷi | h, ŷ1:i−1 )
contain both SELECT, JOIN, GROUP BY, HAVING, and
i=1
nested WHERE conditions, which requires the model to be
where ŷ1:i−1 are the previously generated tokens, P (ŷi | able to understand and generate complex SQL statements
h, ŷ1:i−1 ) represents the probability of theith token given efficiently. Particularly in multi-table queries, the model must
the context and previous tokens. be able to infer table-to-table relationships (e.g., foreign keys)
• Optimization: The model is trained to maximize the and ensure that the generated queries are logically correct.
likelihood of the correct SQL query Y given the training For example, SQL with nested subqueries requires the model
data. The loss function typically used is the negative log- to have the ability to handle hierarchical relationships, while
likelihood of the correct query: conditional filtering and aggregation operations require the
k model to be able to generate precise expressions based on
different column types and values. In addition, certain queries
X
L=− log P (yi | h, y1:i−1 )
i=1 in a given domain may require the use of specific SQL
functions or operations (e.g., regular expression matching),
where yi are the tokens of the ground truth SQL query.
which further complicates the SQL generation process.
In a paragraph or chapter consisting of multiple sentences,
C. Challenges various kinds of ambiguities remain, such as referential am-
Ambiguity presents one of the most prevalent and in- biguities and ellipsis ambiguities. Referential ambiguity is the
tractable problems in Natural Language Processing, it denotes possibility of ambiguity in the events referred to by pronouns
the phenomenon wherein a single and the same linguistic form (e.g., I, you, he) and pronoun phrases.
may be interpreted in more than one way [28], [29]. 4) Pragmatic Ambiguity: Pragmatic Ambiguity pertains
1) Word Segmentation and Sense Ambiguity: Word Seg- to ambiguity due to context, speaker attributes, scene, and
mentation Ambiguity refers to the phenomenon of different other situational pragmatic aspects. A sentence may elicit
meanings when words are combined from characters. For varying interpretations across different contexts. For example,
Indo-European languages, most words are separated by spaces the following example shows that the same sentence can yield
or punctuation. However, in languages such as Chinese and different meanings based on different scenes.
Japanese, words are usually not separated by a space or a Sentence: Do you know how to get to Fifth Avenue?
punctuation mark. Consequently, ambiguity arises when these If the speaker is a tourist and the person speaking is a
consecutive characters are attempting to segment into words. policeman, the meaning of the sentence is to ask for directions.
Word sense ambiguity denotes the phenomenon where a If the speaker is also a tourist, but the person speaking is a
word shares identical orthography while possessing distinct taxi driver instead, then the meaning of the sentence is to ask
semantic interpretations in linguistic terms. For example, Bank for a ride to Fifth Avenue.
noun: (1) sloping land (especially the slope beside a body of 5) Robustness and Efficiency: Users may input queries
water “they pulled the canoe up on the bank” (2) a financial with spelling mistakes, grammatical errors, or incomplete
institution that accepts deposits and channels the money into sentences. In real-world applications, users often provide im-
lending activities “he cashed a check at the bank” perfect or ambiguous queries due to human error or lack of
2) Database size and diversity: Realistic databases often domain knowledge. The model should be robust enough to
contain hundreds of tables and columns, and the relationships deal with such inputs.
between different tables can be very complex. Due to the sheer Correctly mapping valid information in natural language to
size of the database schema, Text-to-SQL systems are often corresponding parts of the database is one of the most critical
unable to incorporate all relevant table structure information steps. The model needs to understand information such as the
in a single prompt. This poses a challenge to the model structure and constraints of the schema. In addition to this,
because without the complete schema context, it is difficult the structure of the schema format will change in different
for the model to generate correct SQL queries. To further databases, and the model must have good generality.
complicate matters, databases in different domains may have One of the basic requirements of the model is the cor-
completely different naming conventions, formats, and table rectness of the query generated by SQL. Due to the special
structures. For example, column names in some databases may characteristics of SQL, even a small error can cause the entire
not have intuitive meanings (e.g., “col1”, “data1”), or even query to fail to produce the correct answer.
have a large number of abbreviations or naming ambiguities, Also, the execution efficiency of SQL queries is a key
which requires the model to have good reasoning capabilities consideration in practical applications. Even if the model
to correctly understand the relationships between tables and is able to generate accurate SQL queries, if the execution
4

efficiency of these queries is too low, especially on large-scale • f (Qi , Si ) denotes the result returned by executing the
databases, this will affect the practicality of the system. A good SQL query generated by the model for the i-th question
Text-to-SQL system should not only generate correct queries, on the database schema Si .
but also ensure that these queries can be executed quickly and • I(·) is the indicator function, which equals 1 if the
efficiently in the actual database to reduce system response condition inside is true, and 0 otherwise.
time and improve user experience. 3) Valid Efficiency Score: Valid Efficiency Score (VES) is
a common evaluation metric used in the text-to-SQL field
III. M ETRICS AND DATASETS to assess the performance of models. It considers both the
A. Evaluation Metric correctness and the efficiency of the generated SQL queries.
The metric accounts for the validity of the SQL query (whether
To evaluate the performance of Text-to-SQL models, several it executes correctly and produces the correct result) and its
key metrics are employed to assess the accuracy, execution execution efficiency (how fast it runs).
effectiveness, and overall efficiency of the generated SQL The formula typically consists of two components:
queries. These metrics provide a comprehensive understanding Query Validity (Correctness): Evaluates whether the gen-
of how well a model translates natural language questions into erated SQL can be successfully executed and returns the
valid SQL statements, ensuring both syntactic correctness and same result as the ground truth SQL query. Query Efficiency:
proper execution on the given database schema. In this section, Measures how efficiently the generated SQL query is executed
we introduce four commonly used evaluation metrics: Exact compared to the ground truth.
Matching Accuracy (EM), Execution Accuracy (EX), Valid
N  
Efficiency Score (VES), and Test-suite Accuracy (TS). Each 1 X Tgold
V ES = I(Qgen
i = Q gold
i ) ·
of these metrics plays a crucial role in highlighting different N i=1 Tgen
aspects of model performance, from syntactic correctness to
real-world execution efficiency. where:
1) Exact Matching Accuracy(EM): Exact Matching Accu- • N represents the total number of queries;
gen
racy [12]requires that the SQL statement generated by the • Qi denotes the generated SQL query for the i-th exam-
model must be exactly the same as the ground-truth answers. ple;
gold
It is a critical metric for evaluating the performance of Text- • Qi denotes the ground truth SQL query for the i-th
to-SQL. This stringent metric requires an identical query due example;
to the diversity of syntax in SQL statements. Completing the • I(·) is the indicator function, equal to 1 if the generated

same results may not always be possible to determine the SQL is equal to the ground truth, and 0 otherwise;
correct SQL query for the same task uniquely. • Tgold is the execution time of the ground truth SQL query;
• Tgen is the execution time of the generated SQL query.

N 4) Test-suite Accuracy (TS): Test-suite Accuracy (TS) is a


1 X  
key evaluation metric that is designed to test the performance
Exact Matching Accuracy = I Ŷi = Yi
N i=1 of models on a diverse and concentrated set of test databases.
where: This metric constructs a small, focused database test suite from
a large collection of randomly generated databases, ensuring
• Total number of queries N : This represents the total
that the suite has a high code coverage rate for accurate SQL
number of natural language questions used in the evalu- queries. By testing the model’s performance on this suite, Test-
ation. suite Accuracy measures how well the model predicts SQL
• Generated SQL query Ŷi : This is the SQL query
queries that produce the correct results across various database
generated by the model for the i-th natural language scenarios. The goal is to measure the strict upper limit of
question. semantic accuracy, as it evaluates the performance not only in
• Reference SQL query Yi : This is the correct SQL query
terms of SQL structure but also execution outcomes.
for the i-th natural language question, serving as the
reference standard. N
• Indicator function I(·): If the generated SQL query Ŷi 1 X
Test-suite Accuracy = I (f (Qi , Di ) = Ri )
exactly matches the reference SQL query Yi , the indicator N i=1
function I(·) equals 1; otherwise, it equals 0.
where:
2) Execution Accuracy(EX):
• N represents the total number of queries in the test suite.
N • Qi is the SQL query generated by the model for the i-th
1 X
Execution Accuracy = I (f (Qi , Si ) = Ai ) example.
N i=1
• Di represents the corresponding database for the i-th
• N is the total number of queries. query.
• Qi denotes the i-th natural language question. • Ri is the expected result when executing the reference
• Si denotes the database schema corresponding to the i-th SQL query on Di .
question. • f (Qi , Di ) is the result produced by executing the model-
• Ai denotes the reference answer for the i-th question. generated SQL query on database Di .
5

• I(·) is the indicator function, which equals 1 if the able to model problems in displays more effectively. Even the
generated query result matches the expected result Ri , current state-of-the-art generative AI models fall short in their
and 0 otherwise. performance in generating accurate queries on it. [56]
2) Cross-Domain Datasets: WikiSQL The two datasets,
ATIS & GeoQuery, have problems such as small data size (less
B. Datasets
than a thousand SQL sentences) and simple annotation. So, in
Datasets are a fundamental component of training Text-to- 2017, Victor Zhong and other researchers annotated 80,654
SQL tasks. Training the system with a large corpus allows it to training data based on Wikipedia, covering 26,521 databases
automatically acquire the mapping relationship between natu- named WikiSQL [57]. It posed new challenges to the design of
ral language and SQL without relying on hand-written rules. the model, requiring the model to better construct the mapping
The dataset for Text-to-SQL is usually manually labeled with relationship, make better use of the attributes in the tables, and
natural language questions and corresponding SQL queries. A pay more attention to the decoding process.
natural language question is a question restricted to the domain Spider However, WikiSQL also has a problem; it only
in which the database data is located and whose answer comes involves one table per question and only supports simple SQL
from its database. In essence, the question describes a SQL operations, which is not very suitable for our daily life sce-
query. Executing the SQL query yields the answer to the narios. So, in 2018, researchers at Yale University introduced
question from its database [30]. Table I provides an overview the Spider [12] dataset, which is currently the most complex
of commonly used Text-to-SQL datasets, summarizing key Text-to-SQL dataset. It has the following characteristics:1) The
features such as dataset size, interaction type, and domain domain is richer, with more than 200 databases from 138
coverage, which are essential for evaluating the generalization domains. Each database corresponds to 5.1 tables on average,
capability of Text-to-SQL models. Datasets in this domain and the databases appearing in the training set and the test
typically have the following characteristics. set do not overlap.2) The SQL statements are more complex,
Single/Cross Domain: database data source scenarios, ac- containing orderBy, union, except, groupBy, intersect, limit,
cording to the number of scenarios involved, can be divided having keywords, and nested queries. The authors divided
into single fields and multiple fields [31], such as catering data the SQL statements into 4 levels of difficulty based on their
and tourist attractions for two fields. complexity (number of keywords, degree of nesting), and
Number of dialogue rounds: according to the number of WikiSQL only has EASY difficulty under this division.
dialogue rounds required for complete SQL generation, the KaggleDBQA KaggleDBQA [44] is a cross-domain evalu-
dataset is divided into single and multiple rounds. ation dataset of real Web databases with domain-specific data
SQL Complexity: Based on the SQL complexity corre- types, original formats, and unrestricted questions. It includes
sponding to the natural language problem, the dataset is clas- 272 examples across 8 databases with an average of 2.25 tables
sified into simple and complex problems, where the problem per database. The dataset is known for its real-world data
complexity is determined by the number of keywords, nesting sources, natural problem-creation environment, and database
level, and number of clauses. documentation with rich domain knowledge.
1) Single-Domain Datasets: ATIS [32] is derived from the DuSQL DuSQL [46] is a large-scale Chinese dataset de-
Airline Ticket Subscription System (ATIS), which generates signed specifically for cross-domain text-to-SQL tasks, filling
SQL statements from user questions and is a single domain, the gap of lack of labeled data in the Chinese domain.
context-sensitive dataset. The dataset manually analyzes real-world problems in several
GeoQuery [33] is derived from U.S. Geography, and con- representative applications and contains a large number of
sists of 880 questions and SQL statements, and is a single SQL queries involving row or column computations.
domain, context-independent dataset. BIRD [45] dataset focuses on aspects like grammatical
Scholar [34] provides a benchmark that reflects the query formulation, ambiguity, specificity, and alignment with the
requirements of real academic databases. The dataset contains database schema. This benchmark aims to bridge the gap
816 annotated natural language queries and corresponding between academic research and real-world applications by
SQL queries, covering a wide range of information retrieval focusing on the comprehension of database values and the
needs in the academic domain. The queries in the dataset efficiency of SQL queries in large databases. It introduces
cover information about academic papers, authors, citations, challenges such as dealing with dirty database contents, re-
journals, keywords, and databases used. quiring external knowledge to link natural language questions
Advising [35] dataset is a Text-to-SQL task assessment with database contents, and ensuring the efficiency of SQL
dataset focused on student academic advising contexts, drawn queries. The dataset includes questions of varying difficulty
from the University of Michigan’s course database. Questions levels—simple, moderate, and challenging. Each question in
are written by students to simulate real questions they might the dataset is annotated with an optional evidence value,
ask during academic advising, and each question is manually providing context that helps in understanding the query.
annotated with the corresponding SQL query and reviewed by BEAVER [47] is used to evaluate the performance of large
multiple annotators to ensure accuracy and helpfulness. language models in complex SQL generation tasks. Existing
TPC-DS [55] is a commonly used benchmark in the field publicly available Text-to-SQL datasets (e.g., Spider and Bird)
of database systems, which, compared to Bird and spider, has fall far short of real enterprise environments in terms of
a significantly more complex structure of its dataset and is database structure and query complexity, resulting in large
6

TABLE I
DATASET STATISTICS . P ROVIDES A SUMMARY OF KEY T EXT- TO -SQL DATASETS , HIGHLIGHTING ATTRIBUTES SUCH AS SIZE , INTERACTION TYPE ,
DOMAIN COVERAGE , LANGUAGE , RELEASE YEAR , AND LINK

Number
Name Turn Type Language Year Link
Train Valid Test
ATIS [32] 4473 497 448 Single-Turn Single-Domain English 1990 https://github.com/howl-anderson/ATIS dataset
GeoQuery [33] 600 - 280 Single-Turn Single-Domain English 1996 https://www.cs.utexas.edu/ ml/nldata/geoquery.html
Scholar [34] 600 - 216 Single-Turn Single-Domain English 2017 https://metatext.io/datasets/scholar
Advising [35] 4791 - - Single-Turn Single-Domain English 2018 https://github.com/jkkummerfeld/text2sql-data
EHRSQL [36] 5124 1163 1167 Multi-Turn Single-Domain English 2023 https://github.com/glee4810/EHRSQL
WikiSQL [37] 56355 8421 15878 Single-Turn Cross-Domain English 2017 https://github.com/salesforce/WikiSQL?tab=readme-ov-file
Spider [12] Single-Turn Cross-Domain English 2018 https://github.com/taoyds/spider
Spider-SYN [38] Single-Turn Cross-Domain English 2021 https://github.com/ygan/Spider-Syn
Spider-DK [39] Single-Turn Cross-Domain English 2021 https://github.com/ygan/Spider-DK
Spider-SS&CG [40] 7000 1034 2147 Single-Turn Cross-Domain English 2022 https://github.com/ygan/SpiderSS-SpiderCG?tab=readme-ov-file
Spider-GEN [41] Single-Turn Cross-Domain English 2023 https://github.com/ManasiPat/Spider-Gen
Spider-Realistic [42] Single-Turn Cross-Domain English 2024 https://zenodo.org/records/5205322#.YTts o5Kgab
Spider-SSP [43] Single-Turn Cross-Domain English 2021 https://github.com/google-research/language/tree/master/language/nqg
KaggleDBQA [44] 272 - - Single-Turn Cross-Domain English 2021 https://www.microsoft.com/en-us/research/publication/
kaggledbqa-realistic-evaluation-of-text-to-sql-parsers
BIRD [45] 8659 1034 2147 Single-Turn Cross-Domain English 2023 https://github.com/MohammadrezaPourreza/
Few-shot-NL2SQL-with-prompting
BigTable-0.2k [42] 200 - - Single-Turn Cross-Domain English 2024 -
DuSQL [46] 18602 2039 3156 Single-Turn Cross-Domain English 2020 https://paperswithcode.com/paper/
dusql-a-large-scale-and-pragmatic-chinese
BEAVER [47] 93 - - Multi-Turn Cross-Domain English 2024 https://peterbaile.github.io/beaver/
CoSQL [48] 2164 292 551 Multi-Turn Cross-Domain English 2019 https://yale-lily.github.io/cosql
ADVETA [49] - - - Multi-Turn Cross-Domain English 2022 https://github.com/microsoft/ContextualSP
CSpider [50] 6831 954 1906 Single-Turn Cross-Domain Chinese 2019 https://github.com/taolusi/chisp
TrustSQL [51] - - - Single-Turn Cross-Domain English 2024 https://github.com/glee4810/TrustSQL
SParC [52] 9025 1203 2498 Multi-Turn Cross-Domain English 2018 https://github.com/taoyds/sparc
CHASE [53] 3949 755 755 Multi-Turn Cross-Domain Chinese 2021 https://github.com/xjtu-intsoft/chase
SQUALL [54] 9030 2246 4344 Single-Turn Cross-Domain English 2020 https://github.com/tzshi/squall

language models that perform well in these tasks but poorly in The dataset consists of real questions from 222 hospital staff
real-world enterprise environments. The BEAVER benchmark (including doctors, nurses, insurance auditors, etc.), covering
constructs a more representative dataset by anonymizing the common retrieval needs in healthcare scenarios, such as patient
data warehouses of the two enterprises that contain complex information querying, complex statistical computations, etc.
table joins and aggregation. 3) Augmented-Datasets: ADVETA [49] is the first bench-
CoSQL [48] dataset is a cross-domain conversational Text- marking dataset specifically designed to evaluate the robust-
to-SQL dataset designed for building general-purpose database ness of Text-to-SQL models under table perturbation. While
query dialogue systems. The dataset contains more than previous research has focused on perturbation on the natural
3000 dialogues, more than 30,000 dialogue rounds, and more language side of the problem, ignoring the diversity of tables
than 10,000 annotated SQL queries covering 200 complex themselves.
databases in 138 different domains.CoSQL collects dialogues Spider-DK [58] focuses on the ability of the model to
by means of Wizard-of-Oz (WOZ), where a simulated user on work with data obtained from domain-specific knowledge. It
one side and a SQL expert on the other side ask database query transforms the challenge to the use case using data such as
questions, and the user builds corresponding SQL queries and Implicit Query Column, Simple Reasoning, Synonym Replace-
returns results. ment, and Conditional Generation, etc. Spider-DK is a way to
CHASE [53] dataset is a large-scale and practical Chinese determine if a model has a basic understanding of data and
dataset with cross-database context dependencies. The dataset uses old information to process new content.
aims to bridge the gap between existing datasets in terms Spider-SS&CG [59] aims at realizing tasks through
of context dependency and SQL query complexity. CHASE Schema Simplification and Complexity Generation in real
contains 5,459 problem sequences with a total of 17,940 databases. As the training progresses, its database is simplified
problems annotated with SQL queries distributed across 280 and complicated more. Spider-SS & CG combines these two
multi-table relational databases. elements to check, respectively, the performance of a simpli-
EHRSQL [36] is a Text-to-SQL benchmark dataset for fied and complexified database structure.
Electronic Health Record (EHR) data, aiming at evaluating the Spider-SYN [60] (Synonym Substitution) suggests the in-
real-world applicability of the model in the healthcare domain. troduction of synonym changing to model the synonym vari-
7

ation in the real language. The model will be tested for dramatically changed the direction of the Text-to-SQL field.
robustness using a given dataset which will replace schema- Trained on large-scale text data, these models are able to cap-
related words such as table names and column names with ture rich semantic representations between natural language
their synonyms. The model’s schema linking ability is a and SQL, not only by fine-tuning them to specific tasks but
problem in this case, as the naming is the opposite of what it also by combining them with Prompt Engineering to realize
should be, e.g., cross-domain tests. applications that do not require large amounts of labeled
Spider-SSP [43] (Schema-Specific Parsing) concentrates data. They show high scalability and flexibility in handling
on the parsing oriented towards a schema to test whether complex queries and cross-domain tasks and have become the
the schema generalization is feasible by changing the names core of modern text-to-SQL systems. Meanwhile, Fine-tuning
of columns and tables in the schema. It insists on schema- optimizes the performance of pre-trained models in specific
dependent parsing capabilities that will be tested against domains by further supervised training on specific datasets.
unknown database structures or schema name changes. Figure 2 provides a detailed taxonomy of various Text-
Spider-Realistic [61] generates questions and correspond- to-SQL methods, comparing key attributes such as backbone
ing SQL statements (pairs) that carriers more of a relevant models, optimization strategies, query generation strategies,
value for real-world applications. The data set is aimed at and datasets used. This table offers a comprehensive overview
enterprise performance in real-life databases, which requires of different approaches and highlights the role of pre-trained
the model to work under real-world conditions, particularly in models in modern text-to-SQL systems.
dealing with complex queries that have multiple levels. In addition, the introduction of LLM Agents represents
CSpider [50] addresses the status quo of Chinese as a low- a new direction in the development of Text-to-SQL. These
resource language in this task area. Chinese text needs to be systems are not only capable of generating SQL queries,
processed by word splitting, while SQL keywords and column but also interacting with users through multiple rounds of
names of database tables are usually written in English. Word- dialog to dynamically adjust the generation strategy. It points
based semantic parsers are susceptible to word-splitting errors, out the direction for the further development of Text-to-SQL
and cross-language word embedding is very helpful for text- technology, showing higher flexibility and adaptability. In what
to-SQL mapping. follows, we discuss in detail the specific implementation of
TrustSQL [51] aims to evaluate models on their ability each approach.
to either generate a correct SQL query or abstain from it
when the question is unanswerable, making a prediction, or
A. Traditional Text-to-SQL Methods
the generated SQL is likely to be incorrect. The data is split in
two ways: question-based, which assesses the model’s ability This part focuses on text-to-SQL models based on tradi-
to handle different phrasings, and query-based. tional deep neural networks. The focus of the discussion is on
BigTable-0.2k [42] built upon the BIRD dataset, which the model architecture and its application to specific tasks. For
includes various question complexities and dataset sizes. To instance, the model needs to understand the natural language
comprehensively evaluate the performance of LLMs, the au- input and show how it corresponds to the database structure
thors design five distinct tasks, Text-to-SQL, SQL Debugging, of tables and columns [62], [63].
SQL Optimization, Schema Linking, and SQL-to-Text. In the earliest Text-to-SQL tasks, SQL queries were typi-
SParC SParC [52] demonstrates complex contextual depen- cally generated through a template- or rule-based approach.
dencies, and greater semantic diversity, and requires the model The model would map the natural language input to a set of
to be able to generalize over unseen domains due to its cross- predefined SQL templates. Performance is significantly limited
domain nature and unseen databases used in testing when faced with complex database structures and queries. As
deep learning techniques matured, more sophisticated methods
such as LSTM-based and Transformer-based models began
IV. M ETHODOLOGY to emerge, offering improved capabilities in handling more
In this section, we describe in detail the traditional Text-to- complex query scenarios.
SQL methods and their evolution. Traditional methods mainly The deep neural network system must identify and map
rely on bi-structured models. These approaches typically use out the target, which is the matching data according to the
LSTM-based and Transformer-based models to generate SQL user input. This model generally represents a dual structure,
queries by learning a contextual representation between natural the encoder and the decoder, each of which performs its own
language questions and database tables. In this context, this duty. The first one, the encoder, is responsible for capturing
paper also discusses a variety of existing frameworks and tech- the semantics of a natural language (NL) question; the second
niques, including models based on techniques such as graph one, the decoder, generates an SQL query based on the
neural networks, table semantic understanding, and schema representation extracted from the encoder input data [64].
linking. These techniques and frameworks further enhance the a) LSTM-based Methods: Traditional methods using
accuracy and efficiency of Text-to-SQL by improving schema LSTM and Transformers generate SQL queries by learning
linking, reducing error propagation, and optimizing the use of the contextual representation of input natural language (NL)
pre-trained models. questions and database tables as well. LSTM-based models
In addition to the traditional LSTM and Transformer mod- were among the first deep learning approaches to be applied
els, the Pre-trained Models such as BERT, GPT, and T5 have to text-to-SQL tasks, as they could effectively model the
8

sequential dependencies between natural language questions complex query generation tasks. These difficulties illustrate
and SQL queries. Models such as use Bi-LSTM to learn various endeavors to improve text-to-SQL performance, where
the semantic representation of question-SQL pairs. While the major three areas are schema linking, propagation of errors,
pre-trained models generally outperform these approaches, and pre-trained language models-based query generation.
LSTM-based systems are still employed in some cases. These TRANX [85] is an artificial intelligence system designed
methods, especially LSTM and its variants, leveraged sequen- to decipher the syntax and the semantics of the loaded text
tial processing to capture context; however, they faced chal- data. Within the foundation of TRANX, there is a neural
lenges when dealing with long-range dependencies in complex abstract syntax parser that is fundamentally the transfer of
queries. Traditional techniques work with LSTM [8] and the natural language into the internal representation. Such as
Transformers [9], [65] by learning contextual representation it is in the case here, this is the deep learning practice, where
from input natural language questions and database tables. the model is working first by dividing the text into smaller
Models like TypeSQL [66], Seq2SQL [67], SQLNet [68], portions and then figuring out the relationships between the
and SyntaxSQLNet [69] use Bi-LSTM to learn the semantic individual components. It exploits in-context learning as well
representation of question-SQL pairs. Pretrained models yield by feeding the model its own examples that are in fact
better accuracy compared to these methods. LSTM-based frequent in the context it has been introduced to. Notably, the
systems, however, are still employed in certain case studies. system of architecture has multiple prompt design strategies
For example, IRNet [70] and RAT-SQL [71] scheme are for improving its performance, one of these strategies is by
introduced that takes advantage of LSTM in their grammar selecting the examples, which are the most relevant to the
decoder to yield abstract syntax trees. input query, and adding extra information on the schema
Although these LSTM-based methods helped in laying the during the task execution. A number of frameworks and
foundation for text-to-SQL research, the limitations in scal- techniques have been outlined in the present day which serve
ability and the ability to generalize across different domains the purpose of improving this sphere. Bridge [86] links and
led researchers to explore more advanced architectures, such integrates textual and tabular data to provide schema linking
as Transformer models. with greater efficiency. SDSQL [87]improves schema depen-
b) Transformer-based Methods: A short while ago, dency in Text-to-SQL tasks by leveraging schema dependency
works focused on Transformer-based models, complement- structures. SLSQL [88] discusses different techniques used for
ing them with improved Text-to-SQL performance-oriented schema linking in Text-to-SQL models. IESQL [89] increases
structures. These models introduced a new paradigm by em- the efficiency of mention extraction and schema grounding
ploying self-attention mechanisms, which allowed them to in Text-to-SQL tasks. SEAD [90] is an end-to-end Text-
handle long-range dependencies more effectively than LSTM to-SQL generation method that employs more sophisticated
models. This shift in architecture significantly improved the schema linking and is the end-to-end method. SmBoP [91]
models’ ability to generate accurate SQL queries, even for proposes a bottom-up, non-autoregressive method that tackles
complex database schemas. Such settings can be observed. Text-to-SQL parsing through sequence-dependent probabilistic
For instance, GraPPa [72] introduces grammar-augmented grammars DT-Fixup [92] fine-tunes Transformer small-based
pretraining, which enriches schema understanding. GAP [73] models to enhance their capability for small datasets in Text-
focuses on contextual representations for tasks of semantic to-SQL tasks. Relation parsing in SQL is achieved through
parsing. StruG [61] presented a new approach called structure semi-autoregressive parsing with hybrid relation-aware RasaP
objective for text-based table encoding, which emphasizes [93] models. GNN [94] utilizes graph neural networks being
on the structure of tables. SCoRe [74] proposed schema- employed for the purpose of schema representation for the pur-
compound embedding for structured data tasks. TaBERT [75] pose of improving schema linking. ShadowGNN [95] suggests
was trained to jointly understand the semantics of tabular using a graph projection neural network, which is a promising
and textual data using internal pretraining, which resulted idea as it can boost schema referencing and text-to-SQL
in a greater grasp of semantic parsing. TAPAS [76] pre- accuracy. SADGA [96] employs a structure-aware dual-graph
trains table-based fresh data for tabel data question answering structure that provides better schema alignment for Text-to-
task. MATE [77] entertains multi-view attention for table SQL tasks. LGESQL [97] uses graph representation learning
learning tasks. TableFormer [78] suggests a stable transform for schema linking in Text-to-SQL query generation. Uni-
model for data table comprehension. TAPEX [79] utilizes fiedSKG [98] makes it possible to unify multitasking across
table pretraining in terms of logic procedures. S2 SQL [80] structured knowledge grounding tasks including Text2SQL.
injects syntax into question-schema interactions to improve These various frameworks and techniques represent signifi-
SQL code generation. IST-SQL [81] project interaction state cant advancements in the Text-to-SQL domain, particularly in
in multi-turn SQL tasks. At this point, IGSQL [82] aids addressing the challenges of schema linking, query parsing,
schema linking in intricate SQL queries. GAZP [83] is a and improving overall model efficiency. By incorporating
neural approach to a changed zero-shot parsing designed for methods such as schema dependency structures, graph neural
table question answering. EditSQL [84] assembles an editor- networks, and semi-autoregressive parsing, modern Text-to-
like generation method for SQL query formulation. These SQL models are better equipped to handle complex databases
Transformer-based models introduced innovative techniques and intricate SQL queries. As Text-to-SQL models continue to
such as attention mechanisms and pretraining, which greatly evolve, the combination of robust schema linking mechanisms,
enhanced their performance, particularly in schema linking and powerful pre-trained models, and advanced parsing techniques
9

NL
Question (a) Zero-shot

Model Processing

Prompt

###Given a natural language question and a database schema, generate the corresponding SQL query to
Database retrieve the required information.
Schema ###You are provided with a table named Employees containing three columns: ID, Name, and Salary. Your
task is to generate a SQL query that answers the following question based on this table.

(b) Few-shot
NL
Question
Structure Combination
Prompt (SQL
Structure +
SQL
Content Content)
Prompt

Database Schema

(c) Reasoning
NL
Question Structure
Prompt Combination
(SQL
Structure + thought ... thought
SQL
Content)
Content
Prompt

Database Schema few-shot prompt(SC-prompt) Prompt Reasoning

Fig. 3. Prompt Engineering Methods. The figure illustrates three key prompt engineering approaches for Text-to-SQL: (a) zero-shot, where the model generates
SQL without prior examples; (b) few-shot, which provides a few examples to guide query generation; (c) Reasoning, breaking down the reasoning process
step-by-step for complex queries.

promises to bring about more accurate, scalable, and efficient answers are heightened by this instruction, since it is ground-
systems capable of handling real-world tasks. ing the model. The basic principle of operation is to generate,
step by step, the next output token with the highest probability
B. Text-to-SQL with Prompt based on the relevant prompts of the input. Therefore, as
a contrast to the traditional training-dependent approaches
In the process of using prompt engineering, a technique discussed earlier, the core of tackling the Text-to-SQL task
is used to improve the performance of the language models with LLMs [99], [100]lies in finding the optimal prompt.
and reinforce the reliability of these models by creating
input prompts in detail. While traditional methods such as 1) Zero-shot Prompt: In the zero-shot approach, the model
LSTM-based and Transformer-based models focus on learning has zero specific training data about the task. Problems posed
contextual representations through supervised training, prompt directly to the model usually consist of a task description, a
engineering offers an alternative approach that leverages the test problem, and a corresponding database without any cases.
capabilities of large pre-trained language models (LLMs) with- [13], [14], [42], [101], [102] hints at the effect of structure on
out requiring additional fine-tuning. This particular method the zero-shot performance of large language models (LLMs).
uses the identification of specific prompts or sentences to The model can answer questions accordingly and can make
help guide the model in creating meaningful outputs that are initial judgments based on the large amount of data. But this
relevant to the intention of the user (See Fig. 3). Unlike earlier method needs both large-scale pre-training language models
methods that required extensive training on labeled datasets to and a considerable amount of data for adaptation to ensure
improve accuracy and generalization, most of the time, Prompt top-notch quality. In regard to the broader fields of database
Engineering does not require any training on the model, so it contents, accuracy can be compromised, and the SQL output
works on the basis of the pre-trained model, which creates can be insufficiently accurate. Nevertheless, this way can be
SQL directly. The merit of such a technique is in executing quickly applied to new tasks and areas without spending time
results in a short time while not incurring any additional on training again. Using this approach method, the output may
computational charges for model tuning. sometimes be obtained, which could, therefore, be described
This method is vital for problem-solving techniques like as having some flexibility variations or unexpected changes.
hallucinations. The authority and persuasion of the model’s The techniques of using prompts in the case of zero-second
10

text-to-SQL tasks are dedicated to the work of [103] [20] [104] was found that the model’s reasoning performance could be
[105] [14] [106] [107]. significantly improved by adding key sentences to the prompts,
2) Few-shot Prompt: A few-shot is a question posed to such as ”Let’s think step by step”. The experiments show that
the model with only a small number of cases given, and in such a method helps the rationalization of the GPT even when
response, the model generates an answer formulated on the the relevant samples are not provided. Among these methods
basis of the initiated cases, which results in a better compre- are those that provide deep and true insights into the model’s
hension of the task. However, here we consider only the cases reasoning in the case of Text-to-SQL tasks.
of good quality. Unlike Zero-shot, Few-shot can significantly Chat2Query [114] deploys its zero-shot SQL generation
yield improved performance of the model, particularly for capability, which allows users to input natural language queries
those tasks characterized by their complexity [108]. and receive SQL outputs without the need for prior model
SC-prompt [109] employs the divide-and-conquer method training or domain-specific fine-tuning. The system integrates
as a way of tackling the problem of translating the text into two a text-to-SQL generator, SQL rewriter, SQL formatter, and
more manageable phases: the structure stage and the content data-to-chart generator, streamlining the process from query to
stage. The first stage results in the generation of a basic SQL data visualization. Utilizing the Chain-of-Thought prompt en-
structure, which contains, but does not limit itself to, the table ables step-by-step SQL generation, which improves the accu-
and column names as their placeholders. It should initially racy of generated queries, especially in complex or ambiguous
be given a structure where the specific details are expected. situations. The system is built on TiDB Serverless, ensuring
The next step consists of replacing the tokens of general scalability and adaptability to different data workloads.
description with concrete values. For this purpose, it applies ACT-SQL [104] delves into the ability of LLM to solve
the hybrid approach of text pre-processing that links the static some few-shot learning tasks by examining how the few-shot
word embeddings to the learnable vectors. learning strategy affects the model performance. A hybrid
MCS-SQL [16] is the method proposed to enhance the pre- method, based on both the static and the dynamic examples,
cision. Its operation relies on three main activities: schematic has been proposed, for example, a selection that is current to
linking, parallel generation of SQL, and picking one that meets test the sample given, and the experimental results showed the
the purpose. The first stage is schema linking, which later efficiency of the chosen strategy.
will be performed in two stages: the tables joining and the
columns joining. The interaction is repetitive, as the sentence
C. Text-to-SQL with Fine-Tuning
is exchanged in both phases. Afterwards, it links with Schema
and generates more than one candidate SQL query. It is Fine-tuning is still an important method within LLMs and
through multiple prompts, which aid the exploration of a much still offers a high level of improvement over the use of low-
wider parameter space, that one is able to accomplish this task. cost prompt methods. Fine-tuning methods rely on models that
It is the LLM (Language Model) that produces different SQL are already pre-trained but are further fine-tuned to fit specific
queries as output, based on those prompts. The primary goal tasks and domains. Depending on the scope of the fine-tuning,
now is to pick the most precise SQL query that matches the it can be categorized in two ways.
input. LLM takes into its consideration the reasoning steps 1) Full-Parameter Fine-Tuning: Full fine-tuning means
and the scores to choose as the best answer. fine-tuning all the parameters of the model. In this approach,
A range of sampling techniques investigating the inspiration the entire model is trained with domain-specific data to op-
demos [110] has been carried out in a selection of declara- timize its performance in a specific task. Full fine-tuning is
tive programming tasks. It seeks to enhance performance by usually applied in those cases where high accuracy or specific
ensuring that there is an equilibrium between likeness and tasks are required. For example, the Text-to-SQL task on
diversity between demos. The random sampling is then used the Spider dataset requires the model to generate extremely
as a standard to check the effectiveness of the given strategies. accurate SQL, in which case all model parameters need to be
SQL-PaLM [111] by the first selection strategy of few- fine-tuned to improve accuracy.
shot examples talks about a task of selection both similarity Knowledge-to-SQL [17] intends to improve the DELL
and diversity between various examples. Along with various model’s capability to produce relevant knowledge more
sampling methods and their combinations, a random sample swiftly, thus giving an added boost to the performance of these
is taken as a standard for performance evaluation. Text-to-SQL systems. Semantic techniques are adopted to find
3) Chain of Thought (CoT): The Chain of Thought (CoT) the best matching table to the query. First, supervised fine-
[112] prompts activate complex thinking skills with the help tuning is performed, and then the model is refined by applying
of these intermediate steps that are based on reasoning. Its the Direct Preference Optimization (DPO) algorithm.
effectivity can be stretched by mixing it with sample-less SGU-SQL [18] is a system depending on the structure
prompts for complex problem-solving, which implies synthesis of the questions and schema. To begin with, user queries
before diagnostics. The two key improvements include the and databases are linked through an enriching framework.
enhancement of the reasoning ability in terms of how to notice This connection can be made through graph-based structures,
is the Chain of Thoughts or the CoT principle of the GPT and the system SGU-SQL will decompose the complicated,
model: [18], [20], [104], [105], [113]. The Building of a GPT interrelated structures through grammar trees. The system is
Model was to provide incorrect answers to two claims and give also based on a specially designed structure-based linking
an explanation for the computational approach. In addition, it mechanism that connects the query structure at the node
11

level. The process begins with the composition of the node 2) Parameter-Efficient Fine-Tuning: Parameter-efficient
representations, and later, it is directed by the propagation fine-tuning fine-tunes only some parameters of the model,
of messages, which self-structures and, therefore, makes the usually some specific layers or modules. Such fine-tuning can
model capture the main relationships. Ultimately, the structure- effectively reduce training time and computational resource
blindness measure is utilized by the model to merge the consumption while maintaining high performance. This
schema graph and query graph, and afterward, the merged approach usually targets domain-specific texts or structures,
information is transferred to the adjacent graph. preserving the general linguistic knowledge already learned
DIN-SQL [115] breaks down the complex task of convert- in the pretrained model and optimizing only for subtle
ing test to SQL queries into smaller, manageable sub-tasks, differences in the Text-to-SQL task.
which helps LLMs perform better. Firstly, its schema linking For example, when it is about fine-tuning the complexity of
module identifies references to database schema and condition SQL statements or the database schema referred to as schema,
values in NL queries. It will classify each query into one of only the layers or parameters that pertain to understanding of
three classes: easy, non-nested complex, and nested complex. the schema need to be retrained, hence saving the model’s
This classification helps in using different prompts for each training cost and nothing else. The advantage of this way
query class. For complex queries, it includes an intermediate of doing things is reducing the training overhead on the one
representation called NatSQL, which simplifies the transition hand and achieving a happy result of the compromise between
from natural language to SQL by removing certain SQL effectiveness and efficiency on the other hand. [116] explains
operators that do not have clear counterparts. Then the SQL the proposed idea of accelerating the decoding process of a
generation module generates the final results based on the large language model by suggesting sample sequences. The
solutions of sub-tasks. After generating the initial query, the technique does not modify the current model, but instead
self-correction module reviews and corrects any errors. The develops a rough instrument that produces possible candidates;
new method achieved the state-of-the-art on the Spider dataset the decision of rejection sampling ensures that the output
and BIRD benchmark with the accuracy of 85.3% and 55.9%. distribution of the model stays unchanged.
MAC-SQL [24] could be much more efficient in its work of In the inference process of [117], some computational steps
query generation for SQL by involving multiple intelligences: can be approximated by smaller, more efficient models. Par-
the fundamental Decomposer intelligence, the Selector intel- ticularly, using a quite involved approximation model, many
ligence, and the Refiner intelligence. The Decomposer Intel- candidate tokens are generated and based on the trained target
ligence, when given a complex SQL query to resolve, breaks model, tokens are verified in parallel to find the ones that are
it into fragment sub-problems sequentially through chained within the target distribution of the model. Once done, the
reasoning and generates the final SQL query in a stepwise distance from the goal is reduced, leading to quicker model
manner. The usage of Selector Intelligence limits the database moves that finally end the decoding process.
to eliminate the problem-specific data interference, while Fixer [19] mentions some techniques to enhance the robustness
Intelligence repairs mistaken SQL via SQL execution using of models in noisy data by fine-tuning the manually selected
outside tools. The MSCFT is the method fine-tuning strategy pre-trained models and applying robust Regularization (RR).
used during SQL-Llama development, and this model is an The provided regularization methods, such as fine-tuning using
open-source version based on the Code Llama. This model pre-trained models, can help to increase modeling robustness
is trained using the dataset of instructions generated from the effectively. In pre-training of the model, providing it clean
MAC-SQL intelligence tasks. data and then restoring it on noisy data, the model is less
Few-shot learning is combined with Instruction Fine-Tuning likely to avoid design overfitting as a result of noisy labels.
that further extends the improvement in the large language Another way is to add an explicit regularization to the pre-
model’s performance in the text-to-SQL tasks. First of all, training technique mentioned here, and further noise robust-
[111] proposes consistent decoding and instruction execution- ness improvement will be achieved, for instance, in the case
based filtering of errors using a few sample hints. It inves- of PHuber, which is a fine-tuning additionally.
tigates instruction fine-tuning to widen the coverage of the DAIL-SQL [20] presents an exploration of the effect of
training data, apply data augmentation, and integrate specific Supervised Fine-Tuning (SFT). The technique is a systematic
content to the queries. This article also proposed a method- approach of SFT by employing LLMs to be fine-tuned for
ology, although with selection at test time, to further boost specific Text-to-SQL tasks trained by task-specific training
the accuracy by combining output and execution feedback in data. It investigates the representation strategies used for
different paradigms. supervised fine-tuning with insights from different strategies
Symbol-LLM [102] employs a two-stage supervised fine- on the efficiency of supervised fine-tuning.
tuning framework for this purpose. This strengthens the sym- The paper [118] has various avenues for operations by
bolic processing capability of large models. In the first phase employing pre-trained models and customizing the pre-trained
(injection), the model acquires the inherent dependencies in ones. As for example, through a method of language pre-
the symbolic data based on fine-tuning (SFT) of an underlying training called BERT [119], natural languages and database
model. Thus, these latter became the new optimizators (MLE) schemas are encoded with the goal of a deeper representation
in order to handle symbolic tasks. During the fusion period, of syntactic and semantic structures.
the model is not only enhanced by the symbolic and natural Moreover, the pre-trained model embeddings are cus-
language data but is also boosted with the combined data. tomized using adaptive fine-tuning to respond to the schema of
12

TABLE II
TAXONOMY OF TEXT- TO -SQL M ETHODS .

Methods Released Backbone Access Optimization Strategy Query Generation Robustness and Dataset Metrics Schema
Time Models Strategy Error Handling Linking
SC-prompt Jun-23 T5 ✓ Task Decomposition Guided Decoding - Spider, CoSQL, EM, EX ✗
GenQuery
MCS-SQL May-24 GPT-4 ✗ Prompt Tuning Guided Decoding Self-Consistency Spider, Bird EX, VES ✓
SQL-PaLM May-23 Palm-2 ✗ Prompt Tuning Consistency Decoding Self-Correction, Spider, Spider, EX, TS ✓
Self-Debugging Bird-SYN,
Spider-DK,
Spider-realistic
ACT-SQL Oct-23 - - Chain of thought (CoT) Greedy Search Self-Correction Spider, SParC, EM, EX, TS ✓
CoSQL
Chat2Query May-24 - - Chain of thought (CoT) - - Spider EM, EX, VES ✗
Knowledge- Feb-24 LLaMA-2- ✓ Expert Fine-Tuning Framework-Based - Spider, Bird EX, VES ✗
to-SQL 13b
SGU-SQL Feb-24 GPT-4 ✗ Expert Fine-Tuning Guided Decoding - Spider, Bird EX, EM ✓
DIN-SQL Apr-23 GPT-4 ✗ Task Decomposition Greedy Search Self-Correction Spider, Bird EX, EM ✓
MAC-SQL Dec-23 GPT-4 ✗ Task Decomposition Greedy Search Refiner BIRD EX, EM, VES ✓
Symbol- Nov-23 LLaMA-2- ✓ Expert Fine-Tuning Greedy Search - Spider, SParC, EM ✗
LLM Chat CoSQL
DAIL-SQL Aug-23 GPT-4 ✗ Supervised Fine-tuning Greedy Search Self-Consistency Spider, Spider- EX, EM ✗
Realistic
CLLMs Feb-24 Deepseek- ✓ Performance & Efficiency Greedy Search - Spider EX ✗
coder-7B- Enhancements
instruct
StructLM Feb-24 CodeLlama- ✓ Expert Fine-Tuning - - Bird, CoSQL, EX, EM ✓
Instruct SParC
SQL-GEN Aug-24 MoE - Expert Fine-Tuning - - Bird EX ✓
CodeS Feb-24 StarCoder ✓ - Beam Search Execution- Spider, Bird EX, TS ✓
Guided SQL
Selector
RESDSQL Feb-23 T5 ✓ Skeleton Parsing Beam Search Execution- Spider-DK, EM, EX ✓
Guided SQL Spider-Syn, and
Selector SpiderRealistic
Tool-SQL Aug-24 GPT-4 ✗ Query Error Handling Python Interpreter - Spider, Spider- EX, EM ✓
Realistic
SQLFixAgent Jun-24 GPT-3.5- ✗ Query Error Handling Perturbation-Based Refiner Spider, Bird, EX, EM, VES ✓
turbo Query Generation Spider-SYN,
Spider-DK,
Spider-realistic
MAG-SQL Aug-24 - ✗ Query Error Handling - Refiner Spider, Bird EX, VES ✓
MAGIC Jun-24 GPT-4 ✗ Expert Fine-Tuning - Self-Correction Spider, Bird EX, VES ✗
SuperSQL Jun-24 GPT-4 ✗ - Greedy Search Self-Consistency Spider, Bird EX, EM ✓

the database or the respective language problem. For instance, tuning to reduce memory requirements and adapt to SQL
SQLova [120] and X-SQL [121] utilize multiple post-training, generation tasks. and provides a standardized set of evaluation
self-organized semi-supervised learning to improve perfor- pipelines.LoRA freezes the weights of the pre-trained model
mance. Fine-tuning of other models: through pre-training the and injects trainable layers in each Transformer block for
models on tabular data and further fine-tuning in the Text-to- efficient fine-tuning and reduced memory footprint.
SQL task, improvements in the accuracy and robustness of [130] investigates the performance of language models on
models like TaBERT [122] and Grappa [123] take place. task generalization. It was found that expert language models
CLLMs [124] coupling is additionally optimized by updat- trained for a single task can outperform multitask instruction
ing with the losses of the two streams: consistency loss and fine-tuning (MT) language models on unseen tasks. The expert
autoregressive loss (AR loss). Consistency loss is employed language model avoids negative task transfer and catastrophic
to ensure that the model settles to a final state with minimum forgetting and performs well in learning new tasks compared to
energy across any kind of input, increasing the rate of conver- the MT model. Through independent training and the Retrieval
gence. Through learning the Jacobi trajectories of the model, it of Experts (RoE) mechanism, the study demonstrates the
is able to generate multiple markers via one iteration step, and potential for selecting appropriate expert models in multi-task
thus it needs fewer calculations to output the current results. scenarios. It uses a Parameter-Efficient Fine-Tuning approach,
StructLM [125] optimizes pretrained models through In- which reduces training costs and improves efficiency by freez-
struction Fine-Tuning. Instruction Fine-tuning combines struc- ing the underlying pretrained language model and fine-tuning
tured data with generic instruction-tuning data to enhance the only the addition of extra adapters.
model’s generalization ability on structured knowledge tasks.
In addition, the paper explores fine-tuning on top of code
pre-training and finds that code pre-training has a signifi- D. Text-to-SQL with Task-Training
cant enhancement effect on processing structured knowledge Such a strategy involves fine-tuning and training the com-
tasks. [126], [127] uses LoRA [128] and QLoRA [129] fine- plete models from scratch for the specifics of the work.
13

Differing from the two mentioned approaches, Fully Pre- to introduce four types of SQL perturbations that aim to
Trained Models range from those off-the-shelf pre-trained minimize dependencies on previously created SQL instructions
language models that are now available but are eclectically and to solve the problems of error propagation. Therefore,
selected for the SQL generation tasks. Typically, that includes the model’s accuracy will be enhanced, as it will be able to
deep learning methods, such as CODES models which are process diverse conversational inputs in a more robust fashion.
based on CNN architecture, Mixture of Experts (MoE) models, When plaguing the model with specific languages, the model is
and Transformers-based models [131]. guided by these specific language prompts, which are based on
1) Mixture of Experts Models: MoE model is a new type of the T5 training methodology. This method involves combining
architecture that works by introducing several expert modules, a specific task prompt to a training sample pertaining to an
and, at each module, one is responsible for a specific type input to a given target task.
of task or input. The MoE model of text-to-SQL tasks will SQLova [120] presents significant progress observed in
allow multiple experts to be involved in the different modules both execution accuracy and logical form accuracy. Taking
of tasks, such as natural language understanding, database advantage of BERT for embedding form text and further
schema parsing, and SQL generation, thus making it easier generation of SQL queries through multi-layer LSTM, the
for the system to learn. model applies form context as well. It must be apparent
SQL-GEN [23] adopts the LLM technique to extend the that the SQL generation master-theory, which integrates form-
existing SQL templates and create a tutorial-driven SQL aware BERT and execution-guided decoding (EG) methods,
engine that serves purely as a code interpreter for different achieves excellent results against the SQL.
dialects. The strategy boosts performance on BigQuery and RESDSQL [132] enhances the Text-to-SQL task by decou-
PostgreSQL dialects remarkably. In addition, the paper also pling schema linking from skeleton parsing, in which the first
recommends an MoE (Mixture of Experts) approach, which process is to identify relevant database tables and the other is to
is used generally to boost the performance of such models establish the SQL queries structure. This helps to simplify the
further by combining the DB-specific models into a single process of creating accurate SQL queries. This model features
system with the help of dialect keyword injections. an enhanced encoder that prioritizes the most relevant.
2) Transformer-based Models: The Transformer model is a [133] introduces two approaches to improve generaliza-
mainstream architecture in the field of deep learning in recent tion in Text-to-SQL semantic parsing based on pre-trained
years, which is particularly suitable for generating complex language models. Token Preprocessing helps the language
SQL queries by modelling long-distance dependencies through model’s tokenizer generate semantically meaningful tokens by
the attention mechanism. Fully trained Transformer models are processing the naming conventions of database schemas and
built from scratch and trained with large amounts of text and SQL queries. This includes converting serpentine nomencla-
SQL data and can well handle cross-domain and complex SQL ture to a natural naming format, handling column references
generation tasks. Such models can show strong generalization for dot notation, and extending the spelling of some SQL key-
capabilities in database query scenarios, especially in cross- words. Component Boundary Marking inserts special markers
domain and multi-language environments. at aligned component boundaries between NL input and SQL
CodeS [21] is specifically designed for Text-to-SQL tasks. output. These markers are inserted in the input and output
The paper points out that while existing large-scale language to help the model identify semantic component boundaries
models (e.g., GPT-4, ChatGPT) perform well in the task of and enhance the language model’s ability to generalize to
generating SQL, their closed-source nature poses data privacy combinations.
risks and high inference costs. Accordingly, CodeS has be- [134] explored the performance of different data models
come an open-source alternative that seeks to achieve effective under real user queries Two specific approaches were used.
and accurate SQL generation by reducing the number of valueNet: a BART-based encoder using Intermediate Repre-
parameters (1B to 15B) and a pre-trained SQL architecture that sentation (IR) to transform natural language into SQL. the
is fine-tuned on SQL generation tasks. Therefore, it applies architecture and mechanisms of IRNet were used. in particular,
database hints that filter relevant tables, columns, and values, connecting entities in a natural language problem to tables
such as BM25, to generate accurate SQL queries. Regarding in the database through Schema linking, columns, and data
the new domain adaptation, CodeS generates automatically values. There is also T5-Picard, which is based on a variant of
a huge amount of (question, SQL) pairs through the data the T5 model that incorporates the Picard method [135], which
augmentation techniques, bi-directional. is an incremental parsing method that constrains the decoded
MIGA [22] is based on PLMs like T5, which is highly output of the language model to valid SQL statements. [136]
proficient in the Text-to-SQL conversion by breaking the main introduces multi-task training, where different Text-to-SQL
goal into a bunch of smaller but interconnected sub-goals and tasks (CoSQL, Spider, SParC) are combined and discrete
recognizing the relationships learned during pre-training, so task-specific prompts are used Two reordering methods are
that a Seq2Seq formatted Text-to-SQL could be generated. employed: the Query Plan (QP) Model: generates SQL queries
The three sub-tasks—Related Schema Prediction (RSP), Turn with predictions of whether the query should contain specific
Switch Prediction (TWP), and Final Utterance Prediction SQL clauses. and Schema Linking (SL): improves SQL gen-
(FUP)—are implemented to boost the model accuracy and eration by performing Schema Linking on the dialog context,
intelligence by exposing it specifically to the aforementioned ensuring that column and table names in SQL queries are
SQL generation aspects. Moreover, it is imperative for MIGA consistent with natural language input.
14

[137] enhances prompts primarily through de- smaller problems, feeds into the Iterative Generation module,
semanticization and skeleton retrieval. Galois [138] extracts which iteratively generates sub-SQL for each problem, and
structured data from LLMs by decomposing SQL queries then a refiner executes these queries using external tools
into multiple steps and converting these steps into natural to optimize any incorrect SQL queries. External supervision
language prompts, it enhances the process primarily through is used throughout the process to ensure that the generated
de-semanticization and skeleton retrieval. Not only the queries queries are consistent with the expected queries
from LLMs but also traditional databases can be processed MAGIC [141] proposes a novel agent approach to automat-
using this technique that allows users to query the information ically create self-correcting guides by iterating over incorrect
contained in LLMs with SQL commands. queries in the text-to-SQL process. By providing automated,
data-driven to mimic the correction process of human experts.
Distyl AI’s Analytics Insight Engine [142] generates com-
E. Text-to-SQL with LLM Agent
plex queries in arbitrary environments and is able to improve
Intelligent agent-based Text-to-SQL systems have become the generated results through user feedback and dynamics. It
a cutting-edge solution for dealing with complex SQL query also performs external knowledge retrieval, where the user’s
generation tasks. By collaborating with multiple intelligences, question is paraphrased into an authoritative form that captures
the LLM Agent framework not only automatically generates the user’s intent and, based on that intent, proceeds to retrieve
SQL queries, but also dynamically adapts and corrects SQL relevant examples.
statements, handles database matching issues, and improves SuperSQL [143] uses a number of approaches to different
query accuracy and execution through external tools. By in- architectures. In the preprocessing phase, it uses architectural
troducing the mechanism of multi-intelligentsia collaboration, links in RESDSQL and database content in BRIDGE v2 as
these systems are equipped with stronger flexibility and adap- a way to enhance connectivity to the overall database, uses
tive capability to decompose complex tasks, detect and repair DAIL-SQL’s few-shot prompt engineering module to select
errors, and optimize query conditions. This paper discusses contextual examples based on similarity and self-consistency
several LLM Agent systems that dramatically improve the per- to ensure the reliability of the generated content, and then
formance and reliability of Text-to-SQL tasks through stepwise generates SQL queries using the SQL query generation using
reasoning, external supervision, and query optimization. greedy decoding strategy
MAC-SQL [24] could be much more efficient in its work of
query generation for SQL by involving multiple intelligences: V. C ONCLUSION
the fundamental Decomposer intelligence, the Selector intel-
In this paper, we thoroughly review and analyze the im-
ligence, and the Refiner intelligence. The Decomposer Intel-
plications of large-scale language models (LLMs) in the text-
ligence, when given a complex SQL query to resolve, breaks
to-SQL framework, looking for further research from different
it into fragment sub-problems sequentially through chained
perspectives. To start with, the contribution of this paper lies in
reasoning and generates the final SQL query in a stepwise
classifying the application of LLMs on the task of text-to-SQL
manner. The usage of Selector Intelligence limits the database
into two groups: traditional approaches and approaches to hint
to eliminate the problem-specific data interference, while Fixer
engineering on the one hand, and fine-tuned approaches on the
Intelligence repairs mistaken SQL via SQL execution using
other. We will develop in-depth important aspects, such as the
outside tools. The MSCFT is the method fine-tuning strategy
general structure of hints, ways of supplementing knowledge,
used during SQL-Llama development, and this model is an
example choosing, and reasoning.
open-source version based on the Code Llama. This model
We are looking forward to future research being able to
is trained using the dataset of instructions generated from the
expand the methods that allow the generation to be smarter
MAC-SQL intelligence tasks.
and more cost-effective and generation smarter and more cost-
Tool-SQL [25] proposes a helper-agent framework that
effective SQL query generation that shows stronger gener-
equips LLM-based agents with two specialized tools, a re-
alization ability in cross-domain and cross-language applica-
triever and a detector. They are responsible for diagnosing and
tion scenarios. Continuous optimization and innovation in the
correcting SQL queries that suffer from database mismatch
Text-to-SQL field imply that more efficient, high-performance
problems. The Retriever helps LLM-based agents verify the
decision-making and data information retrieval procedures via
correctness of SQL conditional clauses by aligning the values
formal language will be realized, marked by further unprece-
in the SQL query to the corresponding cells in the database.
dented developments in the relevant fields.
SQLFixAgent [139] adopts a multi-agent collaborative ap-
proach and consists of three main agents, first SQLRefiner is
responsible for generating the core agent for the final repaired
SQL query. Secondly, SQLReviewer detects syntactic and
semantic errors in SQL queries. Generate multiple candidate
SQL statements using a fine-tuned SQLTool.
MAG-SQL [140] consists of four parts. First, it performs
a soft-column selection of the database schema, the purpose
of which is to filter out redundant information. Then in the
next module, it breaks down the problem into a series of
15

R EFERENCES [20] D. Gao, H. Wang, Y. Li, X. Sun, Y. Qian, B. Ding, and J. Zhou, “Text-
to-sql empowered by large language models: A benchmark evaluation,”
[1] X. Xu, “Research prospect: data factor of production,” Journal of 2023. [Online]. Available: https://doi.org/10.48550/arXiv.2308.15363
Internet and Digital Economics, vol. 1, no. 1, pp. 64–71, 2021. [21] H. Li, J. Zhang, H. Liu, J. Fan, X. Zhang, J. Zhu, R. Wei, H. Pan,
[2] M.-R. Zhong, J.-Y. Fu, and H. Zou, “The data as a production factor: C. Li, and H. Chen, “CodeS: Towards Building Open-source Language
nonlinear effects of factor efficiency on haze pollution,” Environment, Models for Text-to-SQL,” in Proceedings of the ACM SIGMOD
Development and Sustainability, 2023. [Online]. Available: https: International Conference on Management of Data, 2024. [Online].
//doi.org/10.1007/s10668-023-04264-z Available: https://doi.org/10.48550/arXiv.2402.16347
[3] Y. Carriere-Swallow and V. Haksar, “The economics and implications [22] Y. Fu, W. Ou, Z. Yu, and Y. Lin, “MIGA: A Unified Multi-
of data: An integrated perspective,” Departmental Papers, vol. 2019, task Generation Framework for Conversational Text-to-SQL,” in
no. 013, p. A001, 2019. [Online]. Available: https://www.elibrary.imf. Proceedings of the AAAI Conference on Artificial Intelligence, 2023.
org/view/journals/087/2019/013/article-A001-en.xml [Online]. Available: https://doi.org/10.48550/arXiv.2212.09278
[4] P. Qi, D. Sun, C. Xu, Q. Li, and Q. Wang, “Can data [23] M. Pourreza, R. Sun, H. Li, L. Miculicich, T. Pfister, and
elements promote the high-quality development of china’s economy?” S. O. Arik, “Sql-gen: Bridging the dialect gap for text-to-sql
Sustainability, vol. 15, no. 9, 2023. [Online]. Available: https: via synthetic data and model merging,” 2024. [Online]. Available:
//www.mdpi.com/2071-1050/15/9/7287 https://arxiv.org/abs/2408.12733
[5] Q. Liu, B. Chen, J. Guo, J.-G. Lou, B. Zhou, and D. Zhang, [24] B. Wang, C. Ren, J. Yang, X. Liang, J. Bai, L. Chai, Z. Yan,
“How far are we from effective context modeling? an exploratory Q.-W. Zhang, D. Yin, X. Sun, and Z. Li, “Mac-sql: A multi-agent
study on semantic parsing in context,” 2020. [Online]. Available: collaborative framework for text-to-sql,” 2024. [Online]. Available:
https://arxiv.org/abs/2002.00652 https://arxiv.org/abs/2312.11242
[6] W. Woods, R. Kaplan, and B. Webber, “The lunar sciences natural [25] Z. Wang, R. Zhang, Z. Nie, and J. Kim, “Tool-assisted agent on
language information system,” Jul 1972. sql inspection and refinement in real-world scenarios,” arXiv preprint
[7] F. Li and H. V. Jagadish, “Constructing an interactive natural language arXiv:2408.16991, 2024.
interface for relational databases,” Proceedings of the VLDB Endow- [26] A. See, P. J. Liu, and C. D. Manning, “Get to the point:
ment, vol. 8, no. 1, pp. 73–84, 2014. Summarization with pointer-generator networks,” 2017. [Online].
[8] S. Hochreiter and J. Schmidhuber, “Long Short-Term Memory,” Available: https://arxiv.org/abs/1704.04368
Neural Computation, vol. 9, no. 8, pp. 1735–1780, 11 1997. [Online]. [27] G. Katsogiannis-Meimarakis and G. Koutrika, “A survey on
Available: https://doi.org/10.1162/neco.1997.9.8.1735 deep learning approaches for text-to-sql,” The VLDB Journal,
[9] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. vol. 32, no. 4, p. 905–936, jan 2023. [Online]. Available:
Gomez, L. u. Kaiser, and I. Polosukhin, “Attention is all you need,” https://doi.org/10.1007/s00778-022-00776-8
in Advances in Neural Information Processing Systems, I. Guyon, [28] Z. Hong and J. Liu, “Towards better question generation in qa-based
U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, event extraction,” 2024. [Online]. Available: https://arxiv.org/abs/2405.
and R. Garnett, Eds., vol. 30. Curran Associates, Inc., 2017. 10517
[Online]. Available: https://proceedings.neurips.cc/paper files/paper/ [29] C. Zheng, L. Li, Q. Dong, Y. Fan, Z. Wu, J. Xu, and B. Chang, “Can
2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf we edit factual knowledge by in-context learning?” 2023. [Online].
[10] Z. Hong, Z. Yuan, Q. Zhang, H. Chen, J. Dong, F. Huang, and Available: https://arxiv.org/abs/2305.12740
X. Huang, “Next-generation database interfaces: A survey of llm-based [30] C. Li, Y. Wang, Z. Wu, Z. Yu, F. Zhao, S. Huang, and X. Dai,
text-to-sql,” 2024. [Online]. Available: https://arxiv.org/abs/2406.08426 “MultiSQL: A schema-integrated context-dependent Text2SQL dataset
[11] OpenAI, “Gpt-4 technical report,” ArXiv, vol. abs/2303.08774, 2023. with diverse SQL operations,” in Findings of the Association for
[Online]. Available: https://arxiv.org/abs/2303.08774 Computational Linguistics ACL 2024, L.-W. Ku, A. Martins, and
[12] T. Yu, R. Zhang, K. Yang, M. Yasunaga, D. Wang, Z. Li, J. Ma, V. Srikumar, Eds. Bangkok, Thailand and virtual meeting: Association
I. Li, Q. Yao, S. Roman, Z. Zhang, and D. Radev, “Spider: A for Computational Linguistics, Aug. 2024, pp. 13 857–13 867. [Online].
large-scale human-labeled dataset for complex and cross-domain Available: https://aclanthology.org/2024.findings-acl.823
semantic parsing and text-to-SQL task,” in Proceedings of the 2018 [31] J. Li, Z. Chen, L. Chen, Z. Zhu, H. Li, R. Cao, and K. Yu, “Dir:
Conference on Empirical Methods in Natural Language Processing, A large-scale dialogue rewrite dataset for cross-domain conversational
E. Riloff, D. Chiang, J. Hockenmaier, and J. Tsujii, Eds. Brussels, text-to-sql,” Applied Sciences, vol. 13, no. 4, 2023. [Online]. Available:
Belgium: Association for Computational Linguistics, Oct.-Nov. 2018, https://www.mdpi.com/2076-3417/13/4/2262
pp. 3911–3921. [Online]. Available: https://aclanthology.org/D18-1425 [32] P. J. Price, “Evaluation of spoken language systems: the ATIS domain,”
[13] N. Rajkumar, R. Li, and D. Bahdanau, “Evaluating the text-to-sql in Speech and Natural Language: Proceedings of a Workshop Held
capabilities of large language models,” ArXiv, vol. abs/2204.00498, at Hidden Valley, Pennsylvania, June 24-27,1990, 1990. [Online].
2022. [Online]. Available: https://api.semanticscholar.org/CorpusID: Available: https://aclanthology.org/H90-1020
247922681 [33] J. M. Zelle and R. J. Mooney, “Learning to parse database queries using
[14] A. Liu, X. Hu, L. Wen, and P. S. Yu, “A comprehensive evaluation of inductive logic programming,” in Proceedings of the Thirteenth Na-
chatgpt’s zero-shot text-to-sql capability,” 2023. [Online]. Available: tional Conference on Artificial Intelligence - Volume 2, ser. AAAI’96.
https://arxiv.org/abs/2303.13547 AAAI Press, 1996, p. 1050–1055.
[15] T. B. Brown, B. Mann, N. Ryder, M. Subbiah, J. Kaplan, P. Dhariwal, [34] S. Iyer, I. Konstas, A. Cheung, J. Krishnamurthy, and L. Zettlemoyer,
A. Neelakantan, P. Shyam, G. Sastry, A. Askell, S. Agarwal, “Learning a neural semantic parser from user feedback,” 2017.
A. Herbert-Voss, G. Krueger, T. Henighan, R. Child, A. Ramesh, D. M. [Online]. Available: https://arxiv.org/abs/1704.08760
Ziegler, J. Wu, C. Winter, C. Hesse, M. Chen, E. Sigler, M. Litwin, [35] C. Finegan-Dollak, J. K. Kummerfeld, L. Zhang, K. Ramanathan,
S. Gray, B. Chess, J. Clark, C. Berner, S. McCandlish, A. Radford, S. Sadasivam, R. Zhang, and D. Radev, “Improving text-to-SQL
I. Sutskever, and D. Amodei, “Language models are few-shot evaluation methodology,” in Proceedings of the 56th Annual Meeting
learners,” 2020. [Online]. Available: https://arxiv.org/abs/2005.14165 of the Association for Computational Linguistics (Volume 1: Long
[16] D. Lee, C. Park, J. Kim, and H. Park, “Mcs-sql: Leveraging multiple Papers), I. Gurevych and Y. Miyao, Eds. Melbourne, Australia:
prompts and multiple-choice selection for text-to-sql generation,” Association for Computational Linguistics, Jul. 2018, pp. 351–360.
2024. [Online]. Available: https://doi.org/10.48550/arXiv.2405.07467 [Online]. Available: https://aclanthology.org/P18-1033
[17] Z. Hong, Z. Yuan, H. Chen, Q. Zhang, F. Huang, and X. Huang, [36] G. Lee, H. Hwang, S. Bae, Y. Kwon, W. Shin, S. Yang,
“Knowledge-to-SQL: Enhancing SQL Generation with Data Expert M. Seo, J.-Y. Kim, and E. Choi, “Ehrsql: A practical text-to-sql
LLM,” in Findings of the Association for Computational Linguistics: benchmark for electronic health records,” 2023. [Online]. Available:
ACL 2024, 2024. [Online]. Available: https://doi.org/10.48550/arXiv. https://arxiv.org/abs/2301.07695
2402.11517 [37] V. Zhong, C. Xiong, and R. Socher, “Seq2sql: Generating structured
[18] Q. Zhang, J. Dong, H. Chen, W. Li, F. Huang, and X. Huang, queries from natural language using reinforcement learning,” 2017.
“Structure guided large language model for sql generation,” mar 2024. [38] Y. Gan, X. Chen, Q. Huang, M. Purver, J. R. Woodward, J. Xie, and
[Online]. Available: https://doi.org/10.48550/arXiv.2402.13284 P. Huang, “Towards robustness of text-to-sql models against synonym
[19] H. Song, M. Kim, D. Park, Y. Shin, and J.-G. Lee, “Learning from substitution,” 2021.
noisy labels with deep neural networks: A survey,” 2022. [Online]. [39] Y. Gan, X. Chen, and M. Purver, “Exploring underexplored limitations
Available: https://arxiv.org/abs/2007.08199 of cross-domain text-to-sql generalization,” 2021.
16

[40] Y. Gan, X. Chen, Q. Huang, and M. Purver, “Measuring and improving Meeting of the Association for Computational Linguistics. Florence,
compositional generalization in text-to-sql via component alignment,” Italy: Association for Computational Linguistics, 2019.
2022. [53] J. Guo, Z. Si, Y. Wang, Q. Liu, M. Fan, J.-G. Lou, Z. Yang,
[41] R. Patil, M. Patwardhan, S. Karande, L. Vig, and G. Shroff, “Exploring and T. Liu, “Chase: A large-scale and pragmatic Chinese dataset
dimensions of generalizability and few-shot transfer for text-to-sql for cross-database context-dependent text-to-SQL,” in Proceedings
semantic parsing,” in Proceedings of The 1st Transfer Learning for of the 59th Annual Meeting of the Association for Computational
Natural Language Processing Workshop, ser. PMLR, vol. 203, 2022, Linguistics and the 11th International Joint Conference on Natural
pp. 103–114. Language Processing (Volume 1: Long Papers), C. Zong, F. Xia,
[42] B. Zhang, Y. Ye, G. Du, X. Hu, Z. Li, S. Yang, C. H. Liu, R. Zhao, W. Li, and R. Navigli, Eds. Online: Association for Computational
Z. Li, and H. Mao, “Benchmarking the text-to-sql capability of Linguistics, Aug. 2021, pp. 2316–2331. [Online]. Available: https:
large language models: A comprehensive evaluation,” 2024. [Online]. //aclanthology.org/2021.acl-long.180
Available: https://doi.org/10.48550/arXiv.2403.02951 [54] T. Shi, C. Zhao, J. Boyd-Graber, H. Daumé III, and L. Lee, “On
[43] P. Shaw, M.-W. Chang, P. Pasupat, and K. Toutanova, “Compositional the potential of lexico-logical alignments for semantic parsing to SQL
generalization and natural language variation: Can a semantic parsing queries,” in Findings of EMNLP, 2020.
approach handle both?” in Proceedings of the 59th Annual Meeting [55] M. Poess and C. Floyd, “New tpc benchmarks for decision support
of the Association for Computational Linguistics and the 11th and web commerce,” SIGMOD Rec., vol. 29, no. 4, p. 64–71, Dec.
International Joint Conference on Natural Language Processing 2000. [Online]. Available: https://doi.org/10.1145/369275.369291
(Volume 1: Long Papers), C. Zong, F. Xia, W. Li, and R. Navigli, Eds. [56] L. Ma, K. Pu, and Y. Zhu, “Evaluating llms for text-to-sql generation
Online: Association for Computational Linguistics, Aug. 2021, pp. with complex sql workload,” arXiv preprint arXiv:2407.19517, 2024.
922–938. [Online]. Available: https://aclanthology.org/2021.acl-long.75 [57] V. Zhong, C. Xiong, and R. Socher, “Seq2SQL: Generating structured
[44] C.-H. Lee, O. Polozov, and M. Richardson, “KaggleDBQA: Realistic queries from natural language using reinforcement learning,” 2018.
evaluation of text-to-SQL parsers,” in Proceedings of the 59th [Online]. Available: https://openreview.net/forum?id=Syx6bz-Ab
Annual Meeting of the Association for Computational Linguistics [58] Y. Gan, X. Chen, and M. Purver, “Exploring underexplored limitations
and the 11th International Joint Conference on Natural Language of cross-domain text-to-sql generalization,” 2021. [Online]. Available:
Processing (Volume 1: Long Papers), C. Zong, F. Xia, W. Li, https://arxiv.org/abs/2109.05157
and R. Navigli, Eds. Online: Association for Computational [59] Y. Gan, X. Chen, Q. Huang, and M. Purver, “Measuring and
Linguistics, Aug. 2021, pp. 2261–2273. [Online]. Available: https: improving compositional generalization in text-to-sql via component
//aclanthology.org/2021.acl-long.176 alignment,” 2022. [Online]. Available: https://arxiv.org/abs/2205.02054
[45] N. Wretblad and F. G. Riseby, “Bridging language & data: [60] Y. Gan, X. Chen, Q. Huang, M. Purver, J. R. Woodward,
Optimizing text-to-sql generation in large language models,” Ph.D. J. Xie, and P. Huang, “Towards robustness of text-to-sql models
dissertation, Linköping University, 2024, dissertation. [Online]. against synonym substitution,” 2021. [Online]. Available: https:
Available: https://urn.kb.se/resolve?urn=urn:nbn:se:liu:diva-200605 //arxiv.org/abs/2106.01065
[46] L. Wang, A. Zhang, K. Wu, K. Sun, Z. Li, H. Wu, M. Zhang, [61] X. Deng, A. H. Awadallah, C. Meek, O. Polozov, H. Sun,
and H. Wang, “DuSQL: A large-scale and pragmatic Chinese and M. Richardson, “Structure-grounded pretraining for text-to-
text-to-SQL dataset,” in Proceedings of the 2020 Conference on sql,” in Proceedings of the 2021 Conference of the North
Empirical Methods in Natural Language Processing (EMNLP), American Chapter of the Association for Computational Linguistics:
B. Webber, T. Cohn, Y. He, and Y. Liu, Eds. Online: Association Human Language Technologies. Association for Computational
for Computational Linguistics, Nov. 2020, pp. 6923–6935. [Online]. Linguistics, 2021. [Online]. Available: http://dx.doi.org/10.18653/v1/
Available: https://aclanthology.org/2020.emnlp-main.562 2021.naacl-main.105
[47] P. B. Chen, F. Wenz, Y. Zhang, M. Kayali, N. Tatbul, M. Cafarella, [62] A. Kumar, P. Nagarkar, P. Nalhe, and S. Vijayakumar, “Deep learning
Çağatay Demiralp, and M. Stonebraker, “Beaver: An enterprise driven natural languages text to sql query conversion: A survey,”
benchmark for text-to-sql,” 2024. [Online]. Available: https://arxiv.org/ 2022. [Online]. Available: https://arxiv.org/abs/2208.04415
abs/2409.02038 [63] G. Katsogiannis-Meimarakis and G. Koutrika, “A deep dive into
[48] T. Yu, R. Zhang, H. Er, S. Li, E. Xue, B. Pang, X. V. Lin, Y. C. deep learning approaches for text-to-sql systems,” in Proceedings
Tan, T. Shi, Z. Li, Y. Jiang, M. Yasunaga, S. Shim, T. Chen, of the 2021 International Conference on Management of Data,
A. Fabbri, Z. Li, L. Chen, Y. Zhang, S. Dixit, V. Zhang, C. Xiong, ser. SIGMOD ’21. New York, NY, USA: Association for
R. Socher, W. Lasecki, and D. Radev, “CoSQL: A conversational text- Computing Machinery, 2021, p. 2846–2851. [Online]. Available:
to-SQL challenge towards cross-domain natural language interfaces https://doi.org/10.1145/3448016.3457543
to databases,” in Proceedings of the 2019 Conference on Empirical [64] M. Schuster and K. Paliwal, “Bidirectional recurrent neural networks,”
Methods in Natural Language Processing and the 9th International IEEE Transactions on Signal Processing, vol. 45, no. 11, pp. 2673–
Joint Conference on Natural Language Processing (EMNLP-IJCNLP), 2681, 1997.
K. Inui, J. Jiang, V. Ng, and X. Wan, Eds. Hong Kong, China: [65] C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena,
Association for Computational Linguistics, Nov. 2019, pp. 1962–1979. Y. Zhou, W. Li, and P. J. Liu, “Exploring the limits of transfer learning
[Online]. Available: https://aclanthology.org/D19-1204 with a unified text-to-text transformer,” 2023. [Online]. Available:
[49] X. Pi, B. Wang, Y. Gao, J. Guo, Z. Li, and J.-G. Lou, “Towards https://arxiv.org/abs/1910.10683
robustness of text-to-SQL models against natural and realistic [66] T. Yu, Z. Li, Z. Zhang, R. Zhang, and D. Radev, “Typesql:
adversarial table perturbation,” in Proceedings of the 60th Annual Knowledge-based type-aware neural text-to-sql generation,” 2018.
Meeting of the Association for Computational Linguistics (Volume [Online]. Available: https://arxiv.org/abs/1804.09769
1: Long Papers), S. Muresan, P. Nakov, and A. Villavicencio, Eds. [67] V. Zhong, C. Xiong, and R. Socher, “Seq2sql: Generating structured
Dublin, Ireland: Association for Computational Linguistics, May queries from natural language using reinforcement learning,” 2017.
2022, pp. 2007–2022. [Online]. Available: https://aclanthology.org/ [Online]. Available: https://arxiv.org/abs/1709.00103
2022.acl-long.142 [68] X. Xu, C. Liu, and D. Song, “Sqlnet: Generating structured queries
[50] Q. Min, Y. Shi, and Y. Zhang, “A pilot study for Chinese from natural language without reinforcement learning,” 2017. [Online].
SQL semantic parsing,” in Proceedings of the 2019 Conference Available: https://arxiv.org/abs/1711.04436
on Empirical Methods in Natural Language Processing and the [69] T. Yu, M. Yasunaga, K. Yang, R. Zhang, D. Wang, Z. Li, and D. Radev,
9th International Joint Conference on Natural Language Processing “Syntaxsqlnet: Syntax tree networks for complex and cross-domaintext-
(EMNLP-IJCNLP), K. Inui, J. Jiang, V. Ng, and X. Wan, Eds. Hong to-sql task,” 2018. [Online]. Available: https://arxiv.org/abs/1810.05237
Kong, China: Association for Computational Linguistics, Nov. 2019, [70] J. Guo, Z. Zhan, Y. Gao, Y. Xiao, J.-G. Lou, T. Liu, and
pp. 3652–3658. [Online]. Available: https://aclanthology.org/D19-1377 D. Zhang, “Towards complex text-to-sql in cross-domain database
[51] G. Lee, W. Chay, S. Cho, and E. Choi, “TrustSQL: Benchmarking with intermediate representation,” 2019. [Online]. Available: https:
Text-to-SQL Reliability with Penalty-Based Scoring,” jun 2024. //arxiv.org/abs/1905.08205
[Online]. Available: https://doi.org/10.48550/arXiv.2403.15879 [71] B. Wang, R. Shin, X. Liu, O. Polozov, and M. Richardson, “Rat-sql:
[52] T. Yu, R. Zhang, M. Yasunaga, Y. C. Tan, X. V. Lin, S. Li, I. L. Relation-aware schema encoding and linking for text-to-sql parsers,”
Heyang Er, B. Pang, T. Chen, E. Ji, S. Dixit, D. Proctor, S. Shim, 2021. [Online]. Available: https://arxiv.org/abs/1911.04942
V. Z. Jonathan Kraft, C. Xiong, R. Socher, and D. Radev, “Sparc: Cross- [72] T. Yu, C.-S. Wu, X. V. Lin, B. Wang, Y. C. Tan, X. Yang,
domain semantic parsing in context,” in Proceedings of the 57th Annual D. Radev, R. Socher, and C. Xiong, “Grappa: Grammar-augmented
17

pre-training for table semantic parsing,” 2021. [Online]. Available: [90] K. Xu, Y. Wang, Y. Wang, Z. Wen, and Y. Dong, “Sead: End-to-end
https://arxiv.org/abs/2009.13845 text-to-sql generation with schema-aware denoising,” 2023. [Online].
[73] P. Shi, P. Ng, Z. Wang, H. Zhu, A. H. Li, J. Wang, C. N. Available: https://arxiv.org/abs/2105.07911
dos Santos, and B. Xiang, “Learning contextual representations [91] O. Rubin and J. Berant, “SmBoP: Semi-autoregressive bottom-up
for semantic parsing with generation-augmented pre-training,” 2020. semantic parsing,” in Proceedings of the 2021 Conference of the North
[Online]. Available: https://arxiv.org/abs/2012.10309 American Chapter of the Association for Computational Linguistics:
[74] T. Yu, R. Zhang, A. Polozov, C. Meek, and A. Awadallah, “Score: Pre- Human Language Technologies, K. Toutanova, A. Rumshisky,
training for context representation in conversational semantic parsing,” L. Zettlemoyer, D. Hakkani-Tur, I. Beltagy, S. Bethard, R. Cotterell,
in ICLR, May 2021. T. Chakraborty, and Y. Zhou, Eds. Online: Association for
[75] P. Yin, G. Neubig, W. tau Yih, and S. Riedel, “Tabert: Pretraining Computational Linguistics, Jun. 2021, pp. 311–324. [Online].
for joint understanding of textual and tabular data,” 2020. [Online]. Available: https://aclanthology.org/2021.naacl-main.29
Available: https://arxiv.org/abs/2005.08314 [92] P. Xu, D. Kumar, W. Yang, W. Zi, K. Tang, C. Huang, J. C. K. Cheung,
[76] J. Herzig, P. K. Nowak, T. Müller, F. Piccinno, and J. Eisenschlos, S. J. D. Prince, and Y. Cao, “Optimizing deeper transformers on small
“Tapas: Weakly supervised table parsing via pre-training,” in datasets,” 2021. [Online]. Available: https://arxiv.org/abs/2012.15355
Proceedings of the 58th Annual Meeting of the Association [93] J. Huang, Y. Wang, Y. Wang, Y. Dong, and Y. Xiao, “Relation aware
for Computational Linguistics. Association for Computational semi-autoregressive semantic parsing for nl2sql,” 2021. [Online].
Linguistics, 2020. [Online]. Available: http://dx.doi.org/10.18653/v1/ Available: https://arxiv.org/abs/2108.00804
2020.acl-main.398 [94] B. Bogin, M. Gardner, and J. Berant, “Representing schema structure
[77] J. M. Eisenschlos, M. Gor, T. Müller, and W. W. Cohen, “Mate: with graph neural networks for text-to-sql parsing,” 2019. [Online].
Multi-view attention for table transformer efficiency,” 2021. [Online]. Available: https://arxiv.org/abs/1905.06241
Available: https://arxiv.org/abs/2109.04312 [95] Z. Chen, L. Chen, Y. Zhao, R. Cao, Z. Xu, S. Zhu, and K. Yu,
[78] J. Yang, A. Gupta, S. Upadhyay, L. He, R. Goel, and S. Paul, “Shadowgnn: Graph projection neural network for text-to-sql parser,”
“Tableformer: Robust transformer modeling for table-text encoding,” 2021. [Online]. Available: https://arxiv.org/abs/2104.04689
2022. [Online]. Available: https://arxiv.org/abs/2203.00274 [96] R. Cai, J. Yuan, B. Xu, and Z. Hao, “Sadga: Structure-aware dual
[79] Q. Liu, B. Chen, J. Guo, M. Ziyadi, Z. Lin, W. Chen, and J.-G. Lou, graph aggregation network for text-to-sql,” 2022. [Online]. Available:
“Tapex: Table pre-training via learning a neural sql executor,” 2022. https://arxiv.org/abs/2111.00653
[Online]. Available: https://arxiv.org/abs/2107.07653 [97] R. Cao, L. Chen, Z. Chen, Y. Zhao, S. Zhu, and K. Yu, “LGESQL:
Line graph enhanced text-to-SQL model with mixed local and
[80] B. Hui, R. Geng, L. Wang, B. Qin, B. Li, J. Sun, and
non-local relations,” in Proceedings of the 59th Annual Meeting of the
Y. Li, “S2 sql: Injecting syntax to question-schema interaction
Association for Computational Linguistics and the 11th International
graph encoder for text-to-sql parsers,” 2022. [Online]. Available:
Joint Conference on Natural Language Processing (Volume 1: Long
https://arxiv.org/abs/2203.06958
Papers), C. Zong, F. Xia, W. Li, and R. Navigli, Eds. Online:
[81] R.-Z. Wang, Z.-H. Ling, J.-B. Zhou, and Y. Hu, “Tracking interaction
Association for Computational Linguistics, Aug. 2021, pp. 2541–2555.
states for multi-turn text-to-sql semantic parsing,” 2020. [Online].
[Online]. Available: https://aclanthology.org/2021.acl-long.198
Available: https://arxiv.org/abs/2012.04995
[98] T. Xie, C. H. Wu, P. Shi, R. Zhong, T. Scholak, M. Yasunaga, C.-S.
[82] Y. Cai and X. Wan, “IGSQL: Database schema interaction graph Wu, M. Zhong, P. Yin, S. I. Wang, V. Zhong, B. Wang, C. Li, C. Boyle,
based neural model for context-dependent text-to-SQL generation,” A. Ni, Z. Yao, D. Radev, C. Xiong, L. Kong, R. Zhang, N. A. Smith,
in Proceedings of the 2020 Conference on Empirical Methods L. Zettlemoyer, and T. Yu, “Unifiedskg: Unifying and multi-tasking
in Natural Language Processing (EMNLP), B. Webber, T. Cohn, structured knowledge grounding with text-to-text language models,”
Y. He, and Y. Liu, Eds. Online: Association for Computational 2022. [Online]. Available: https://arxiv.org/abs/2201.05966
Linguistics, Nov. 2020, pp. 6903–6912. [Online]. Available: https: [99] W. Huang, X. Zheng, X. Ma, H. Qin, C. Lv, H. Chen, J. Luo,
//aclanthology.org/2020.emnlp-main.560 X. Qi, X. Liu, and M. Magno, “An empirical study of llama3
[83] V. Zhong, M. Lewis, S. I. Wang, and L. Zettlemoyer, “Grounded quantization: From llms to mllms,” 2024. [Online]. Available:
adaptation for zero-shot executable semantic parsing,” in Proceedings https://arxiv.org/abs/2404.14047
of the 2020 Conference on Empirical Methods in Natural Language [100] F. Huang, Z. Yang, J. Jiang, Y. Bei, Y. Zhang, and H. Chen,
Processing (EMNLP), B. Webber, T. Cohn, Y. He, and Y. Liu, “Large language model interaction simulator for cold-start item
Eds. Online: Association for Computational Linguistics, Nov. 2020, recommendation,” 2024. [Online]. Available: https://arxiv.org/abs/
pp. 6869–6882. [Online]. Available: https://aclanthology.org/2020. 2402.09176
emnlp-main.558 [101] S. Xue, C. Jiang, W. Shi, F. Cheng, K. Chen, H. Yang, Z. Zhang,
[84] R. Zhang, T. Yu, H. Y. Er, S. Shim, E. Xue, X. V. Lin, T. Shi, C. Xiong, J. He, H. Zhang, G. Wei, W. Zhao, F. Zhou, D. Qi, H. Yi,
R. Socher, and D. Radev, “Editing-based sql query generation for S. Liu, and F. Chen, “Db-gpt: Empowering database interactions
cross-domain context-dependent questions,” 2019. [Online]. Available: with private large language models,” 2024. [Online]. Available:
https://arxiv.org/abs/1909.00786 https://arxiv.org/abs/2312.17449
[85] L. Nan, Y. Zhao, W. Zou, N. Ri, J. Tae, E. Zhang, A. Cohan, and D. R. [102] F. Xu, Z. Wu, Q. Sun, S. Ren, F. Yuan, S. Yuan, Q. Lin,
Radev, “Enhancing few-shot text-to-sql capabilities of large language Y. Qiao, and J. Liu, “Symbol-llm: Towards foundational symbol-
models: A study on prompt design strategies,” vol. abs/2305.12586, centric interface for large language models,” 2024. [Online]. Available:
2023. [Online]. Available: https://api.semanticscholar.org/CorpusID: https://arxiv.org/abs/2311.09278
258833511 [103] S. Chang and E. Fosler-Lussier, “How to prompt llms for text-to-sql:
[86] X. V. Lin, R. Socher, and C. Xiong, “Bridging textual and tabular A study in zero-shot, single-domain, and cross-domain settings,” 2023.
data for cross-domain text-to-sql semantic parsing,” 2020. [Online]. [Online]. Available: https://arxiv.org/abs/2305.11853
Available: https://arxiv.org/abs/2012.12627 [104] H. Zhang, R. Cao, L. Chen, H. Xu, and K. Yu, “Act-sql: In-context
[87] B. Hui, X. Shi, R. Geng, B. Li, Y. Li, J. Sun, and X. Zhu, learning for text-to-sql with automatically-generated chain-of-thought,”
“Improving text-to-sql with schema dependency learning,” 2021. 2023. [Online]. Available: https://arxiv.org/abs/2310.17342
[Online]. Available: https://arxiv.org/abs/2103.04399 [105] C. Hu, J. Fu, C. Du, S. Luo, J. Zhao, and H. Zhao, “Chatdb:
[88] W. Lei, W. Wang, Z. Ma, T. Gan, W. Lu, M.-Y. Kan, and Augmenting llms with databases as their symbolic memory,” 2023.
T.-S. Chua, “Re-examining the role of schema linking in text- [Online]. Available: https://arxiv.org/abs/2306.03901
to-SQL,” in Proceedings of the 2020 Conference on Empirical [106] X. Liu and Z. Tan, “Divide and prompt: Chain of thought prompting for
Methods in Natural Language Processing (EMNLP), B. Webber, text-to-sql,” 2023. [Online]. Available: https://arxiv.org/abs/2304.11556
T. Cohn, Y. He, and Y. Liu, Eds. Online: Association for [107] X. Dong, C. Zhang, Y. Ge, Y. Mao, Y. Gao, lu Chen, J. Lin, and
Computational Linguistics, Nov. 2020, pp. 6943–6954. [Online]. D. Lou, “C3: Zero-shot text-to-sql with chatgpt,” 2023. [Online].
Available: https://aclanthology.org/2020.emnlp-main.564 Available: https://arxiv.org/abs/2307.07306
[89] J. Ma, Z. Yan, S. Pang, Y. Zhang, and J. Shen, “Mention extraction [108] Z. Li, S. Fan, Y. Gu, X. Li, Z. Duan, B. Dong, N. Liu, and
and linking for SQL query generation,” in Proceedings of the 2020 J. Wang, “Flexkbqa: A flexible llm-powered framework for few-shot
Conference on Empirical Methods in Natural Language Processing knowledge base question answering,” 2024. [Online]. Available:
(EMNLP), B. Webber, T. Cohn, Y. He, and Y. Liu, Eds. Online: https://arxiv.org/abs/2308.12060
Association for Computational Linguistics, Nov. 2020, pp. 6936–6942. [109] Z. Gu, J. Fan, N. Tang, L. Cao, B. Jia, S. Madden, and X. Du,
[Online]. Available: https://aclanthology.org/2020.emnlp-main.563 “Few-shot text-to-sql translation using structure and content prompt
18

learning,” Proc. ACM Manag. Data, vol. 1, no. 2, jun 2023. [Online]. [127] F. Zhou, S. Xue, D. Qi, W. Shi, W. Zhao, G. Wei, H. Zhang,
Available: https://doi.org/10.1145/3589292 C. Jiang, G. Jiang, Z. Chu, and F. Chen, “Db-gpt-hub: Towards
[110] L. Nan, Y. Zhao, W. Zou, N. Ri, J. Tae, E. Zhang, A. Cohan, open benchmarking text-to-sql empowered by large language models,”
and D. Radev, “Enhancing few-shot text-to-sql capabilities of large 2024. [Online]. Available: https://arxiv.org/abs/2406.11434
language models: A study on prompt design strategies,” 2023. [128] E. J. Hu, Y. Shen, P. Wallis, Z. Allen-Zhu, Y. Li, S. Wang, L. Wang,
[Online]. Available: https://arxiv.org/abs/2305.12586 and W. Chen, “Lora: Low-rank adaptation of large language models,”
[111] R. Sun, S. Ö. Arik, A. Muzio, L. Miculicich, S. Gundabathula, P. Yin, 2021. [Online]. Available: https://arxiv.org/abs/2106.09685
H. Dai, H. Nakhost, R. Sinha, Z. Wang, and T. Pfister, “Sql-palm: [129] T. Dettmers, A. Pagnoni, A. Holtzman, and L. Zettlemoyer, “Qlora:
Improved large language model adaptation for text-to-sql (extended),” Efficient finetuning of quantized llms,” 2023. [Online]. Available:
2024. [Online]. Available: https://arxiv.org/abs/2306.00739 https://arxiv.org/abs/2305.14314
[112] J. Wei, X. Wang, D. Schuurmans, M. Bosma, B. Ichter, F. Xia, [130] J. Jang, S. Kim, S. Ye, D. Kim, L. Logeswaran, M. Lee,
E. H. Chi, Q. V. Le, and D. Zhou, “Chain-of-thought prompting K. Lee, and M. Seo, “Exploring the benefits of training expert
elicits reasoning in large language models,” in Proceedings of the 36th language models over instruction tuning,” 2023. [Online]. Available:
International Conference on Neural Information Processing Systems, https://arxiv.org/abs/2302.03202
ser. NIPS ’22. Red Hook, NY, USA: Curran Associates Inc., 2024. [131] K. Meng, A. S. Sharma, A. Andonian, Y. Belinkov, and D. Bau,
[113] T. Wang, H. Lin, X. Han, L. Sun, X. Chen, H. Wang, and Z. Zeng, “Mass-editing memory in a transformer,” 2023. [Online]. Available:
“Dbcopilot: Scaling natural language querying to massive databases,” https://arxiv.org/abs/2210.07229
2024. [Online]. Available: https://arxiv.org/abs/2312.03463 [132] H. Li, J. Zhang, C. Li, and H. Chen, “RESDSQL: Decoupling Schema
[114] J.-P. Zhu, P. Cai, B. Niu, Z. Ni, K. Xu, J. Huang, J. Wan, S. Ma, Linking and Skeleton Parsing for Text-to-SQL,” in Proceedings of the
B. Wang, D. Zhang, L. Tang, and Q. Liu, “Chat2query: A zero-shot AAAI Conference on Artificial Intelligence, 2023, oral presentation.
automatic exploratory data analysis system with large language mod- [Online]. Available: https://doi.org/10.48550/arXiv.2302.05965
els,” in 2024 IEEE 40th International Conference on Data Engineering [133] D. Rai, B. Wang, Y. Zhou, and Z. Yao, “Improving generalization
(ICDE), 2024, pp. 5429–5432. in language model-based text-to-sql semantic parsing: Two simple
[115] M. Pourreza and D. Rafiei, “DIN-SQL: Decomposed In-Context semantic boundary-based techniques,” 2023. [Online]. Available:
Learning of Text-to-SQL with Self-Correction,” in Proceedings of https://arxiv.org/abs/2305.17378
the Conference on Neural Information Processing Systems (NeurIPS), [134] J. Fürst, C. Kosten, F. Nooralahzadeh, Y. Zhang, and K. Stockinger,
2023. [Online]. Available: https://doi.org/10.48550/arXiv.2304.11015 “Evaluating the data model robustness of text-to-sql systems
[116] C. Chen, S. Borgeaud, G. Irving, J.-B. Lespiau, L. Sifre, and J. Jumper, based on real user queries,” 2024. [Online]. Available: https:
“Accelerating large language model decoding with speculative //arxiv.org/abs/2402.08349
sampling,” 2023. [Online]. Available: https://arxiv.org/abs/2302.01318 [135] T. Scholak, N. Schucher, and D. Bahdanau, “Picard: Parsing
[117] Y. Leviathan, M. Kalman, and Y. Matias, “Fast inference from incrementally for constrained auto-regressive decoding from language
transformers via speculative decoding,” 2023. [Online]. Available: models,” 2021. [Online]. Available: https://arxiv.org/abs/2109.05093
https://arxiv.org/abs/2211.17192 [136] S. H. K. Parthasarathi, L. Zeng, and D. Hakkani-Tur, “Conversational
[118] N. Deng, Y. Chen, and Y. Zhang, “Recent advances in text-to-SQL: A text-to-sql: An odyssey into state-of-the-art and challenges ahead,”
survey of what we have and what we expect,” in Proceedings of the 29th 2023. [Online]. Available: https://arxiv.org/abs/2302.11054
International Conference on Computational Linguistics, N. Calzolari, [137] C. Guo, Z. Tian, J. Tang, P. Wang, Z. Wen, K. Yang, and T. Wang,
C.-R. Huang, H. Kim, J. Pustejovsky, L. Wanner, K.-S. Choi, P.-M. “Prompting gpt-3.5 for text-to-sql with de-semanticization and skeleton
Ryu, H.-H. Chen, L. Donatelli, H. Ji, S. Kurohashi, P. Paggio, N. Xue, retrieval,” 2023. [Online]. Available: https://arxiv.org/abs/2304.13301
S. Kim, Y. Hahm, Z. He, T. K. Lee, E. Santus, F. Bond, and S.-H. [138] M. Saeed, N. D. Cao, and P. Papotti, “Querying large language models
Na, Eds. Gyeongju, Republic of Korea: International Committee with sql,” 2023. [Online]. Available: https://arxiv.org/abs/2304.00472
on Computational Linguistics, Oct. 2022, pp. 2166–2187. [Online]. [139] J. Cen, J. Liu, Z. Li, and J. Wang, “Sqlfixagent: Towards semantic-
Available: https://aclanthology.org/2022.coling-1.190 accurate sql generation via multi-agent collaboration,” arXiv preprint
[119] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “BERT: Pre- arXiv:2406.13408, 2024.
training of deep bidirectional transformers for language understanding,” [140] W. Xie, G. Wu, and B. Zhou, “Mag-sql: Multi-agent generative
in Proceedings of the 2019 Conference of the North American Chapter approach with soft schema linking and iterative sub-sql refinement for
of the Association for Computational Linguistics: Human Language text-to-sql,” arXiv preprint arXiv:2408.07930, 2024.
Technologies, Volume 1 (Long and Short Papers), J. Burstein, [141] A. Askari, C. Poelitz, and X. Tang, “Magic: Generating self-correction
C. Doran, and T. Solorio, Eds. Minneapolis, Minnesota: Association guideline for in-context text-to-sql,” arXiv preprint arXiv:2406.12692,
for Computational Linguistics, Jun. 2019, pp. 4171–4186. [Online]. 2024.
Available: https://aclanthology.org/N19-1423 [142] K. Maamari and A. Mhedhbi, “End-to-end text-to-sql generation within
[120] W. Hwang, J. Yim, S. Park, and M. Seo, “A comprehensive exploration an analytics insight engine,” arXiv preprint arXiv:2406.12104, 2024.
on wikisql with table-aware word contextualization,” 2019. [Online]. [143] B. Li, Y. Luo, C. Chai, G. Li, and N. Tang, “The dawn of natural
Available: https://arxiv.org/abs/1902.01069 language to sql: Are we fully ready?” arXiv preprint arXiv:2406.01265,
[121] P. He, Y. Mao, K. Chakrabarti, and W. Chen, “X-sql: reinforce 2024.
schema representation with context,” 2019. [Online]. Available:
https://arxiv.org/abs/1908.08113
[122] P. Yin, G. Neubig, W.-t. Yih, and S. Riedel, “TaBERT: Pretraining for
joint understanding of textual and tabular data,” in Proceedings of the
58th Annual Meeting of the Association for Computational Linguistics,
D. Jurafsky, J. Chai, N. Schluter, and J. Tetreault, Eds. Online:
Association for Computational Linguistics, Jul. 2020, pp. 8413–8426.
[Online]. Available: https://aclanthology.org/2020.acl-main.745
[123] T. Yu, C.-S. Wu, X. V. Lin, bailin wang, Y. C. Tan, X. Yang,
D. Radev, richard socher, and C. Xiong, “Gra{pp}a: Grammar-
augmented pre-training for table semantic parsing,” in International
Conference on Learning Representations, 2021. [Online]. Available:
https://openreview.net/forum?id=kyaIeYj4zZ
[124] S. Kou, L. Hu, Z. He, Z. Deng, and H. Zhang, “Cllms:
Consistency large language models,” 2024. [Online]. Available:
https://arxiv.org/abs/2403.00835
[125] A. Zhuang, G. Zhang, T. Zheng, X. Du, J. Wang, W. Ren,
S. W. Huang, J. Fu, X. Yue, and W. Chen, “Structlm: Towards
building generalist models for structured knowledge grounding,” 2024.
[Online]. Available: https://arxiv.org/abs/2402.16671
[126] R. Roberson, G. Kaki, and A. Trivedi, “Analyzing the effectiveness of
large language models on text-to-sql synthesis,” jan 2024. [Online].
Available: https://doi.org/10.48550/arXiv.2401.12379

You might also like