p1737-kim
p1737-kim
1737
In this paper, we perform extensive experiments using 14 bench- gated recurrent units (GRUs), or residual networks (ResNets) have
marks including a new benchmark (WTQ) and TPC-H for measur- been proposed. For example, an LSTM cell maintains an additional
ing the accuracy of NL2SQL methods. We conduct an experiment cell state ct which remembers information over time and three gates
to identify the problem of the measures introduced above. Specif- to regulate the flow of information into and out of the cell. That is,
ically, we measure the performance of the existing methods using ht and ct are computed using the gates from ct−1 , ht−1 , and xt .
three existing measures, omitting the fourth one which involves the Sequence-to-sequence models: A sequence-to-sequence model
manual inspection. We use a precise metric considering semantic (Seq2Seq) translates a source sentence into a target sentence. It
equivalence and propose a practical tool to measure it automati- consists of an encoder and a decoder, each of which is implemented
cally. In order to accurately measure the translation accuracy of by an RNN – usually an LSTM. The encoder takes a source sen-
NL2SQL, we need to judge the semantic equivalence between two tence and generates a fixed-size context vector, while the decoder
SQL queries. However, the existing technique for determining se- takes the context vector C and generates a target sentence.
mantic equivalence [11] is not comprehensive to use for our ex- Attention mechanism: One fundamental problem in the basic Seq-
periments, since it only supports restrictive forms of SQL queries. 2Seq is that the final RNN hidden state hτ in the encoder is used
We propose a practical tool to judge semantic equivalence by using as the single context vector for the decoder. Encoding a long sen-
database technologies such as query rewrite and database testing tence into a single context vector would lead to information loss
and measure accuracy based on it. Note that this has been an im- and inadequate translation, which is called the hidden state bottle-
portant research topic called semantic equivalence of SQL queries neck problem [2]. To avoid this problem, the attention mechanism
in our field. Nevertheless, we still have no practical tool for sup- has been actively used.
porting semantic equivalence for complex SQL queries. Seq2Seq with attention: At each step of the output word genera-
The main contributions of this paper are as follows: 1) We pro- tion in the decoder, we need to examine all the information that the
vide a survey and a taxonomy of existing NL2SQL methods, and source sentence holds. That is, we want the decoder to attend to
classify the latest methods. 2) We fairly and empirically compare different parts of the source sentence at each word generation. The
eleven state-of-the-art methods using 14 benchmarks. 3) We show attention mechanism makes Seq2Seq learn what to attend to, based
that all the previous studies use misleading measures in their per- on the source sentence and the generated target sequence thus far.
formance evaluation. To solve this problem, we propose a practi- The attention encoder passes all the hidden states to the decoder
cal tool for validating translation results by taking into account the instead of only the last hidden state [1, 27]. In order to focus on the
semantic equivalence of SQL queries. 4) We analyze the experi- parts of the input that are relevant to this decoding time step, the at-
mental results in depth to understand why one method is superior tention decoder 1) considers all encoder hidden states, 2) computes
to another for a particular benchmark. 5) We report several surpris- a softmax score for each hidden state, and 3) computes a weighted
ing, important findings obtained. sum of hidden states using their scores. Then, it gives more atten-
The remainder of the paper is organized as follows. Section 2 in- tion to hidden states with high scores.
troduces fundamental natural language processing and deep learn- Pointer network: Although the attention mechanism solves the
ing concepts. We introduce existing NL2SQL benchmarks used in hidden state problem, there is another fundamental problem in lan-
recent studies in Section 3. In Section 4, we show a brief history guage translation, called the unknown word problem [19]. In the
and review the state-of-the-art NL2SQL methods. We explain our Seq2Seq model, the decoder generates a unique word in a prede-
validation methodology and propose a practical tool for measuring fined vocabulary using a softmax classifier, where each word cor-
accuracy in Section 5. Section 6 presents experimental results and responds to an output dimension. However, the softmax classifier
in-depth analysis for them. We discuss insights and questions for cannot predict words out of the vocabulary.
future research in Section 7. We conclude in Section 8. In order to solve the unknown word problem, pointer networks
have been widely used. A pointer network generates a sequence
2. BACKGROUND of pointers to the words of the source sentence [37]. Thus, it is
considered as a variation of the Seq2Seq model with attention.
Dependency parsing: A syntactic dependency [15] is a binary
asymmetric relation between two words in a sentence. Each de-
pendency bears a grammatical function (e.g., subject, object, deter- 3. BENCHMARKS
miner, modifier) of each word wrt. another word. It can be repre- Existing NL2SQL methods address limited-scope problems un-
sented as a typed and directed arrow from one word to another. All der several explicit or implicit assumptions about qnl , qsql , or D.
arrows in a sentence usually form a rooted tree, called a syntactic These assumptions are directly related to the characteristics of its
dependency tree, where the words of the given sentence are nodes target benchmark since each method is optimized for a particular
and h is the parent of d for each arrow h → d. Dependency parsing benchmark without considering the general problem.
is a task for finding syntactic dependencies of a given sentence [23]. WikiSQL: WikiSQL [52] is the most widely used and largest bench
Recurrent neural networks (RNNs): A basic RNN takes an in- mark, containing 26,531 tables and 80,654 (qnl , qsql ) pairs over a
put sequence of vectors [x1 x2 . . . xτ ] of length τ as well as an given single table. Tables are extracted from HTML tables from
initial hidden state h0 , and generates a sequence of hidden states Wikipedia. Then, each qsql is automatically generated for a given
[h1 h2 . . . hτ ] and a sequence of output vectors [y1 y2 . . . yτ ]. Spe- table under the constraint that the query produces a non-empty re-
cifically, ht at time step t is calculated by ht = fθ (xt , ht−1 ), sult set. Each qnl is generated using a simple template and para-
where fθ is a function with a parameter θ, which is commonly re- phrased through Amazon Mechanical Turk.
ferred to as an RNN cell. All SQL queries in WikiSQL are based on a simple syntax pat-
If an RNN cell is implemented as just a fully connected layer tern, that is, SELECT <aggregation function> <column name>
with an activation function, it would not effectively accumulate in- FROM T [ WHERE <column name> <operator> <constant val-
formation from previous time steps in its hidden state of the RNN. ue> (and <column name> <operator> <constant value>)* ],
Such a basic RNN would not effectively handle long sequences and where T is a given single table. This allows only a single projected
face the notorious vanishing and exploding gradient problem [5]. column and selection with conjunctions. Note that this grammar
In order to avoid the problem, long short-term memory (LSTM), expresses none of grouping, ordering, join, or nested queries.
1738
ATIS/GeoQuery: ATIS [33, 12, 22, 51] and GeoQuery [49, 50, an interactive learning algorithm and a data augmentation tech-
32, 17, 22] are widely used for semantic parsing which is the task nique using templates and paraphrasing. DBPal [4] used a data
of translating qnl into a formal meaning representation. ATIS is augmentation technique similar to NSP, which uses more varied
about flight booking and contains a database of 25 tables and 5,410 templates and more diverse paraphrasing techniques than NSP. On
(qnl , qsql ) pairs. GeoQuery consists of seven tables in the US ge- the other hand, a new benchmark named WikiSQL was published
ography database and 880 (qnl , qsql ) pairs [32]. Unlike WikiSQL, in [52]. Accordingly, many studies [42, 24, 44, 21, 53, 46] have
all queries in ATIS and GeoQuery are on a single domain. Thus, been conducted to improve accuracy on WikiSQL. Seq2SQL [52],
if we use them for training a deep-learning model, the model will SQLNet [42], Coarse2Fine [24], and STAMP [53] proposed a new
work on a specific domain only. Both benchmarks have various deep-learning model specific to WikiSQL. PT-MAML [21] adapted
queries including join and nested queries. ATIS does not have any the latest learning algorithm called meta learning [14] for Wik-
grouping or ordering query, whereas GeoQuery does. iSQL. TypeSQL [46] tagged each word in the question with a data
MAS: MAS [25] has been proposed to evaluate NaLIR [25]. Mi- type or a column name and used the tags as input to its deep learn-
crosoft Academic Search provides a database of academic social ing model. More recently, as [48] proposed a new NL2SQL bench-
networks and a set of queries. The authors of NaLIR select 196 mark named Spider, SyntaxSQLNet [47], GNN [6], and IRNet [20]
queries of them. MAS is on a single domain like ATIS and Geo- targeting Spider have been proposed.
Query. It contains a database of 17 tables and 196 (qnl , qsql ) pairs.
MAS has various SQL queries containing join, grouping, and nes- 4.2 Taxonomy
ted queries, but not ordering queries. Each qnl in MAS has the We provide a general taxonomy for NL2SQL, and then classify
following constraints. First, qnl starts with “return me”. While the recent methods according to the taxonomy. The taxonomy in
a user may pose a query using an interrogative sentence or a list Figure 2 considers three dimensions and four sub-dimensions in
of keywords in real world situations, MAS does not include such the technique dimension.
cases. Second, each qnl is grammatically correct. In our survey, we include existing methods published at ma-
Spider: Recently, [48] proposed a new NL2SQL benchmark named jor conferences of database and natural language processing areas
Spider. [48] claims that existing benchmarks have limited quality, from 2014 to September, 2019. We review a total of 16 meth-
that is, they have a few queries, simple queries only, or on a single ods [25, 35, 43, 3, 42, 24, 44, 21, 53, 46, 47, 6, 20].
database. Spider is a large-scale cross-domain benchmark with 200
databases of 138 different domains and 10,181 (qnl , qsql ) pairs. 4.2.1 Input
Regarding the input dimension, we have four sub-dimensions:
4. NL2SQL METHODS pre-processing of qnl ; a table/database as input; schema only and
schema + value; and additional inputs (Figure 3). Existing methods
4.1 A brief history take two inputs, qnl and a database D, consisting of a set of tables
The construction of natural language interfaces for databases has along with the database schema SD . Here, each table has a set of
been studied in both the database and natural language communi- records, and each record has a set of column values, VD . Note that
ties for decades. Figure 1 shows its brief history. Templar is presented in the ‘technique’ dimension only, since it is
In the 1980s, methods using intermediate logical representation an optimization technique for rule-based methods.
were proposed [39, 18]. They translate qnl into logical queries All existing methods take an English text as qnl . They use dif-
independent of the underlying database schema, and then convert ferent pre-processing techniques as follows: All DL-based methods
these logical queries to database queries. However, they still rely use pre-trained embedding vectors of tokens (in qnl , SD , or VD ) as
on hand-crafted mapping rules for translation. Word2Vec [28]. For rule-based methods, NaLIR parses qnl into the
From the early 2000s, more advanced rule-based methods [31, corresponding dependency parse tree. SQLizer transforms qnl into
25, 35, 43, 3] were proposed. [31] used an off-the-shelf natural a logical form using Sempre [30]. ATHENA uses a sequence of
language parser in order to integrate the advances in natural lan- words. Some techniques also require additional inputs: an open-
guage processing without training a parser for a specific database. domain knowledge base such as Freebase [7] for detecting named
However, the coverage was limited to semantically tractable ques- entities; domain-specific dictionary/ontology; a pre-built mapping
tions [31]. This limitation is mainly caused by the assumption that table between phrases in qnl and SQL keywords; WordNet [29];
there is a one-to-one correspondence between words in the question and a word embedding matrix for calculating word similarity.
and a subset of database elements. In order to broaden coverage, Although all methods use general pre-processing techniques, and
[25, 43, 35] proposed a ranking-based approach. During the map- there is no explicit constraint on qnl , each method has implicit as-
ping between words in the question and the database elements, they sumptions on qnl . For example, NaLIR uses a manually-built map-
found multiple candidate mappings and calculated and ranked the ping table between NL phrases and SQL keywords. To generate
mapping score for each candidate. NaLIR [25] further improved proper SQL keywords, qnl must include phrases in the table.
performance with user interaction. ATHENA [35] intermediately For D, we examine their domain adaptability, constraints, and
used a domain specific ontology to exploit its richer semantic in- utilization. Unlike other methods taking a (qnl , D) pair as an input
formation. SQLizer [43] parses qnl into a logical form using Sem- for inference, NSP and DBPal use D only for vocabulary building
pre [30], then iteratively refines the form. Templar [3] is an op- during training. During inference, they assume that qnl is over D
timization technique for mapping and join path generation using seen in the training. Some assume that D consists of a single table
a query log. Although these methods achieved significant perfor- only. ATHENA requires a specific dictionary and an ontology for
mance improvement, they still rely on manually-defined rules. each D. For database utilization, some use SD only, while others
Recently, deep-learning-based (DL-based) methods [22, 52, 42, use both SD and VD .
24, 4, 44, 21, 53, 46, 47, 6, 20] have been actively proposed in
the NLP community by exploiting the state-of-the-art deep learn- 4.2.2 Technique: translation
ing technologies. One of the main challenges in developing a DL- We explain the translation dimension first for a better under-
based NL2SQL method is the lack of training data. NSP [22] used standing of the existing methods (Figure 4). In the translation
1739
Figure 1: A brief history of natural language interface to relational databases.
1740
umn names, and constant values. PT-MAML decodes a sequence Linking: GNN calculates a linking score between each word in qnl
guided by a fixed syntax pattern as in [38], while STAMP dynami- and the database schema entity (i.e., table or column). This tech-
cally selects the type of layers for each decoding step. SyntaxSQL- nique, named schema linking, was developed to put information
Net, GNN, and IRNet generate a syntax tree. Specifically, Syn- about references into the entities of SD within qnl for the model.
taxSQLNet has nine types of modules for the individual parts of The schema linking module has two sub-modules; one is to cal-
SQL, such as aggregation, the WHERE condition, group by, order culate similarity between word embedding vectors of the question
by, and intersect/union. It dynamically selects one of the types to word and schema entity, and the other is to calculate similarity us-
be generated for each decoding step, guided by a specific subset of ing a neural network with inputs of features obtained by directly
the SQL syntax. GNN and IRNet use a grammar-based sequence- comparing words. The linking scores are input to both the qnl en-
to-tree decoder [41, 9, 45, 34] which generates a derivation tree coder and the SD encoder. Unlike the string-matching-based tag-
rather than a sequence of words. The grammar-based decoder can ging of TypeSQL, GNN’s schema linking module is trainable and
immediately check for grammatical errors at every decoding step, is learned together with the encoder-decoder.
allowing the generation of various SQL queries, including join and IRNet also uses a schema-linking technique, like GNN, but the
nested queries, without syntax errors. The other methods belonging linking of IRNet is based on string matching like TypeSQL. Un-
to slot-filling use the same fixed template as the syntax pattern of like TypeSQL, however, IRNet does not use VD and allows par-
WikiSQL, so they have three types of decoders, each for projection, tial matching. IRNet tags each entity mention with the type of the
aggregation, and selection. corresponding entity. IRNet also tags each referenced entity in a
During the translation, IRNet generates an intermediate repre- database with a unique ID indicating that it is referenced in the
sentation named SemQL. The authors argue that, due to significant question. IRNet uses heuristic rules to choose one of multiple en-
difference between the SQL (context-free) grammar and the nat- tities. The tags of each word and each database entity are used as
ural language grammar, it is difficult to translate qnl into qsql di- input to the qnl encoder and to the SD encoder, respectively.
rectly [20]. SemQL is more abstract than SQL and thus easily cap- All rule-based methods construct mapping between phrases in
tures the intent expressed in qnl . For example, the SemQL query qnl and entities of SD or VD based on similarity matching. NaLIR,
does not have the GROUP BY, HAVING, or FROM clauses. In- SQLizer, and Templar try to find the best mapping, while ATHENA
stead, only conditions in the WHERE and HAVING clauses need considers all mapping candidates in the translation phase. For rank-
to be expressed. IRNet has a method for translating a SemQL ing, NaLIR and SQLizer use a similarity-based score, while Tem-
query into the equivalent SQL query. When columns expressed plar uses both similarity information and a given SQL query log.
in SemQL are from multiple tables, the method heuristically adds Anonymizing: NSP and DBPal aim to reduce their vocabulary by
primary key-foreign key (pk-fk) join conditions during the transla- using constant value anonymization, which converts each constant
tion to SQL. Clearly, many rich functionalities in SQL are lost in value into a specific symbol in qnl and qsql . DBPal does not specify
SemQL, which is an inherent disadvantage of SemQL. how it detects constant values in qnl . NSP uses a simple heuristic
Unlike rule-based methods, the advantage of DL-based methods algorithm to anonymize a constant value, that is, it finds a value
is that they can learn the background knowledge of training exam- in the given ground truth SQL query by searching for a quotation
ples, so that it is possible to generate the desired SQL even if qnl is mark and then matches the value with some words in qnl . In fact,
not concrete. For example, in GeoQuery, DL-based methods may this is not applicable in practice since the ground truth SQL query
understand that a term such as “major cities” means “city.popula- for qnl is unknown. In an ideal situation, NSP would not store
tion > 150,000” from the training data. any word in its vocabulary except SQL keywords and database
schemas. This means that NSP assumes that constant values can
4.2.3 Technique: input enrichment always be anonymized, which is impossible at the current level
of technology. NSP will fail to translate if a query has any non-
Before the translation phase, some NL2SQL methods try to ob-
anonymous constant values.
tain or remove information about inputs. Six of the 16 methods are
omitted since there is no input enrichment in these methods.
4.2.4 Technique: post-translation
Some NL2SQL methods complete the translation by either fill-
ing in the missing information or by refining qsql in the post-transla-
tion phase. Figure 6 shows four kinds of post-processing. NSP
and DBPal recover anonymized constant values in qsql . ATHENA,
NaLIR, and IRNet, which use an intermediate representation, trans-
late the representation to qsql . IRNet and SyntaxSQLNet complete
qsql by adding join predicates using a heuristic algorithm. Templar
Figure 5: Taxonomy over the ‘input enrichment’ dimension. infers join predicates by using the SQL query log. NaLIR and NSP
can utilize user feedback about the correctness of the translation.
Tagging: TypeSQL performs string matching between phrases in DialSQL can encode a dialogue, so that it refines qsql repeatedly
qnl and the entities in a database or a knowledge base. Then, for throughout the dialogue with a user.
each entity mention, TypeSQL tags it with the name of the corre-
sponding entity (i.e., its type). The tag of each word is embedded 4.2.5 Technique: training
into a vector and is concatenated to an embedding vector of the Every DL-based method trains its model by supervised learning.
word. A sequence of the concatenated vectors of words in qnl is The following six DL-based methods propose modified learning al-
used as input to the natural language encoder. gorithms. NSP and DBPal augment training data with predefined
PT-MAML performs tagging only for constant values using sim- templates and paraphrasing techniques, such as [16]. In order to
ilarity matching. After finding constant values in qnl , PT-MAML fine-tune the selection module, Seq2SQL and STAMP addition-
normalizes it to the corresponding entity in VD . The type of each ally use reinforcement learning, which compares the execution re-
entity is added to qnl just before the position of the entity. sults of the generated SQL statement with that of the corresponding
1741
Figure 6: Taxonomy over the ‘post-translation’ dimension.
ground truth. PT-MAML exploits meta-learning techniques [14] by Figure 7: Taxonomy over the ‘output’ dimension.
dividing queries into six tasks according to the aggregation func-
tion: sum, avg, min, max, count, and select. DialSQL constructs a
simulator to generate dialogues used in training.
Templar uses an SQL query log (i.e., a set of SQL queries) to
reinforce the key mapping and join inference. Specifically, given
D and an SQL log on D, it decomposes each query in the log into
individual query fragments and then constructs a query fragment
graph for D; the graph statistically captures the occurrence and
co-occurrence frequencies of query fragments in the log. Then, it Figure 8: Taxonomy: validation methodology.
defines specific score functions for both keyword mapping and join
inference based on the query fragment graph.
queries. Thus, we have to rely on expensive manual efforts for
4.2.6 Output judging semantic equivalence.
In order to reduce the manual effort significantly, we exploit re-
Figure 7 shows a taxonomy over the ‘output’ dimension. The
sult matching and parse tree matching. Both can be used as early
supported SQL syntax of every method is limited but it is often un-
termination conditions in our tool. That is, if two queries have dif-
clear, especially for rule-based methods. Hence, we group existing
ferent execution results, we are sure that the queries are semanti-
methods into three categories, and we indicate which design choice
cally inequivalent. If two queries have the same syntactic structure,
leads to the restriction. There are four reasons: 1) pre-defined syn-
the queries are semantically equivalent.
tax patterns and/or types of slots, 2) heuristic translation rules, 3)
We improve the quality of result matching by using database test-
intermediate representations with limited coverage of syntax, and
ing techniques [8]. In database testing, we generate datasets (i.e.,
4) limited training examples.
database instances), run queries over these datasets, and find bugs
All methods generate ranked SQL queries and return the best
in database engines. Thus, we generate different datasets so that
one. The rule-based methods use specific ranking algorithms as ex-
every query has non-empty results for at least one of these datasets.
plained in Section 4.2.2, while the DL-based methods use implicit
As we increase the number of database instances, different queries
ranking based on softmax scores.
would produce different execution results for some database in-
stances. Thus, we can effectively judge that two queries are se-
5. VALIDATION METHODOLOGY mantically inequivalent by executing on the generated datasets.
We improve the quality of syntactic matching by exploiting query
5.1 Semantic equivalence rewriting techniques. We use a query rewriter which transforms an
One important issue in NL2SQL is how to measure translation SQL query into an equivalent, normalized query. This is done by
accuracy. Translation accuracy in NL2SQL is measured as the using various rewrite rules. If two SQL queries are semantically
number of correctly translated SQL queries over the total number equivalent to each other, the rewriter is likely to transform them into
of test queries. We can judge the correctness by comparing each the same normalized query. By comparing two rewritten queries,
translated SQL query using an NL2SQL method with the gold SQL we can determine whether they are semantically equivalent.
query given in the test dataset. In this section, we propose a tool In this paper, we propose a multi-level framework for determin-
for validating translation results, which is based on the semantic ing the semantic equivalence of two SQL queries (Figure 9). Note
equivalence to overcome the limitations of existing measures. that the order of the individual steps does not affect the overall
First, we need a formal definition of semantic equivalence of two effectiveness (i.e. the number of resolved cases of whether two
SQL queries. Given two SQL queries q1 and q2 , they are semanti- queries are semantically equivalent or inequivalent). First, we com-
cally equivalent iff they always return the same results on any input pare the execution results of two SQL queries. If their execution
database instance. Otherwise, they are semantically inequivalent. results are not equivalent, we determine that they are semantically
To correctly compare two SQL queries, we must use the semantic inequivalent. However, when the size of a given database is small,
equivalence of the two queries. it is highly likely that two completely different SQL queries return
Existing NL2SQL methods do not consider the semantic equiva- the same empty results. We resolve this problem by comparing ex-
lence, or require a lot of manual efforts when measuring accuracy. ecution results on the generated datasets using the database testing
Figure 8 shows the validation methodology of each method. technique as well as on the given database. Next, we use an existing
prover, such as Cosette, that exploits automated constraint solving
5.2 Validation tool and interactive theorem proving and returns a counter example or a
Despite a lot of research on semantic equivalence, the state-of- proof of equivalence for a limited set of queries. For queries that are
the-art tools such as Cosette [11, 10] support a limited form of SQL not supported by the prover, we use the query rewriter in a commer-
1742
cial DBMS and compare the parse trees of the two rewritten SQL respectively. For the Advising benchmark, we use two versions
queries. Specifically, two queries are equivalent if their parse trees of splits published in [13], namely question-based split and query-
are structurally identical; i.e., every node in one parse tree has the based split. The question-based split is a traditional method used
corresponding matching node in the other. After a series of com- in the other benchmarks. This method regards each pair (qnl , qsql )
parisons, unresolved cases can be manually verified. Alternatively, as a single item so that each pair belongs to either a training set,
the tool determines whether any unresolved cases are semantically a validation set, or a test set. Meanwhile, the query-based split
inequivalent, which would lead to slightly incorrect decisions. In ensures that each qsql belongs to either a training set, a validation
our extensive experiments, our tool achieved 99.61% accuracy. set, or a test set. We use validation examples for evaluating on
Spider since test examples of Spider are not published. For MAS,
IMDB, and YELP, we manually wrote a qsql for each qnl , since
these benchmarks do not contain gold SQL queries. For ATIS, we
manually removed 93 incorrect examples.
Table 1: The statistics for the NL2SQL benchmarks.
Total Training Validation Test Size
Benchmark Tables Rows
queries queries queries queries (MB)
WikiSQL [52] 80654 56355 8421 15878 26531 459k 420
ATIS
5317 4379 491 447 25 162k 39.2
[33, 12, 22, 51]
Advising [13]
4387 2040 515 1832 15 332k 43.8
(querysplit)
Advising [13]
4387 3585 229 573 15 332k 43.8
(questionsplit)
GeoQuery
880 550 50 280 7 937 0.14
[49, 50, 32, 17, 22]
Scholar [22] 816 498 100 218 10 144M 8776
Patients [4] 342 214 19 109 1 100 0.016
Figure 9: A flowchart of our multi-level framework. Restaurant
251 157 14 80 3 18.7k 3.05
[36, 32]
We implement the multi-level validation tool by using 1) Evo- MAS [25] 196 123 11 62 17 54.3M 4270
IMDB [43] 131 82 7 42 16 39.7M 1812
SQL [8] as a database instance generator, 2) Cosette as a prover, YELP [43] 128 80 7 41 7 4.48M 2232
and 3) the query rewriter in IBM DB2. The manual validation is Spider [48] 9693 8659 1034 - 873 1.57M 184
done by graduate-level computer science students. WTQ [30]
9287 5804 528 2955 2102 58.0k 35.6
[Ours]
6. EXPERIMENTS Since a simple query is defined for a given table, it can only be
In this section, we first show the effectiveness of the proposed found in WikiSQL, Patients, and WTQ where a table is specified for
validation tool in Section 5, and then evaluate the performance of each query. Table 2 shows the number of simple queries, indicated
the eleven methods reviewed in Section 4. [35, 43, 4, 53, 44] are ex- by the suffix “-s” to distinguish them from the original benchmarks.
cluded from our evaluation since the authors did not disclose their Note that WikiSQL-s is same as WikiSQL.
source codes or binary executables. This performance evaluation
consists mainly of two parts: 1) experiments using simple queries Table 2: The statistics for simple queries.
Training Validation Test Total
(following the syntax pattern in WikiSQL) (Section 6.2), and 2) Benchmark
queries queries queries queries
experiments using complex queries (Section 6.3). WikiSQL-s 56355 8421 15878 80654
The main goals of this experimental study are as follows: 1) Patients-s 61 8 33 102
WTQ-s 1090 63 315 1468
We show that all existing accuracy measures are misleading. 2)
We evaluate the effectiveness of our validation tool. 3) We evalu-
Experimental setup. We use the source codes provided by the
ate the performance of the eleven NL2SQL methods by using 13
authors of each method. In order to show the effect on accuracy
benchmarks including a new benchmark (WTQ). We additionally
caused by the use of database entities, we evaluate the accuracy
use TPC-H 1 . 4) We analyze translation errors in depth and identify
of TypeSQL with database values (TypeSQL-C) and without them
the advantages and disadvantages of each method.
(TypeSQL-NC). For Templar, we evaluate the augmented NaLIR
Benchmarks. We use a total of 13 NL2SQL benchmarks including
as in [3] by extending its SQL parser to support the benchmarks
a newly released benchmark, WTQ. We additionally use TPC-H.
used in our experiments. We use qsql ’s in training and validation
The WTQ benchmark consists of 9,287 randomly sampled ques-
sets for the SQL log. For Spider, we use only the training set as
tions from the existing WikiTableQuestions [30]. WikiTableQues-
the log. We do not evaluate the augmented Pipeline since it re-
tions has a salient feature, compared to the existing benchmarks:
quires manual parsing of qnl into a set of keywords for all exam-
complex questions in different domains. WikiTableQuestions con-
ples. For NaLIR and Templar, which assume that the SQL exe-
sists of questions for web tables on various domains, and it has
cution result is a set, we measure their accuracy by ignoring the
complex queries that include various operations such as ordering,
DISTINCT keyword in the SELECT clause. We fixed a bug in
grouping, and nested queries. However, the WikiTableQuestions
the source code of PT-MAML that caused an error if a test query
benchmark contains qnl ’s and their execution results without gold
contains words not in the training data. For DL-based methods,
qsql ’s. Thus, we collect qsql ’s through crowd-sourcing.
if the authors published a hyper-parameter setting for a particular
Table 1 shows the statistics of the 13 benchmarks. If the data
dataset, we used the same setting. Otherwise, we performed exten-
split (i.e., training, validation, and test examples) is published, we
sive grid searches using the hyper-parameters in Table 3, and re-
use the published one, otherwise we perform the random split with
peated all experiments five times. For NSP on ATIS, the accuracy
the ratio of 11:1:5.6, for training, validation, and test examples,
variance was relatively high, so we repeated the experiment ten
1
http://www.tpc.org/tpch/ times. The hyper-parameter tuning took about 3,600 GPU hours.
1743
Coarse2Fine raises an error if all SQL queries in a mini-batch do Table 5: Comparison of accuracy measures (%).
Benchmark accsem accsyn accex accstr
not contain a WHERE clause due to the numerical instability in WikiSQL 9.47 9.47 25.67 0.0
its algorithm. This error occurred in about 40% of experiments ATIS 70.25 10.29 77.18 0.0
during tuning hyper-parameters on WTQ-s. We applied the fol- Advising
0.27 0.27 29.8 0.27
(querysplit)
lowing stopping criteria: Each model was trained for 300 epochs, Advising
and we set the checkpoint at the epoch that had the lowest vali- 44.85 44.15 72.43 44.15
(questionsplit)
dation loss. Our codes and benchmarks are publicly available at GeoQuery 70.0 69.29 78.21 68.93
https://github.com/postech-db-lab-starlab/NL2SQL. Scholar 48.17 37.61 46.79 33.49
Patients 69.72 68.81 69.72 68.81
Restaurant 63.75 63.75 76.25 56.25
Table 3: Hyper-parameters. MAS 51.61 46.77 53.23 46.77
The dimension of Learning rate IMDB 16.67 16.67 19.05 16.67
{100, 300} {1e-3, 1e-4}
a word embedding vector (LR) YELP 0.0 0.0 19.51 0.0
The number of layers {1, 2} Batch size {64, 200} Spider 0.0 0.0 0.87 0.0
Dropout rate {0.3, 0.5} LR decay {1, 0.98, 0.8} WTQ 2.06 1.83 5.41 0.03
The dimension of context vector {50, 300, 600, 800}
We now show the effectiveness of our multi-level validation tool.
Reproduction of DL-based methods. Table 4 contains accuracy First, we demonstrate that each step of our tool is effective by cal-
results of the original papers and our experiments with the origi- culating the number of resolved cases at each step. We perform
nal source codes. Note that existing codes for measuring accuracy this evaluation on the translation results of NSP. To ensure general-
have bugs, but we use them solely for the reproduction purposes. ity, we excluded WikiSQL and Spider from this experiment; Wik-
For example, when a translated query contains syntax errors, accex iSQL has simple queries only which are easy to determine semantic
in NSP judges that the query returns an empty result. Thus, when equivalence. NSP shows accex near to 0% on Spider so that it is
the corresponding gold SQL returns an empty result, accex in NSP easy to determine semantic inequivalence on Spider. We evaluate
determines that they are equivalent, which is wrong. accsyn of GN- on the other eleven benchmarks.
N/IRNet has bugs in comparison with join predicates and group by Figure 10 shows the ratio of resolved pairs of the translated and
columns. Except for the results of GNN on Spider, we reproduced gold queries at each step for each benchmark. Each legend cor-
all experiments with an accuracy difference of at most 2.14%. The responds to one of the six flows shown in Figure 9. As shown in
small discrepancies occurred whenever the authors did not spec- Figure 10, the ratio of resolved cases at each step varies largely
ify the random seed they used. For GNN, the authors published depending on the dataset. The average ratio of resolved cases on
a hyper-parameter set which is better than that which was used in eleven benchmarks at “(A),” “(B),” “(C),” “(D),” “(E),” “(F),” and
their original paper. Hence, our result shows a higher accuracy. “(G)” is 69.59%, 10.27%, 6.39%, 0.77%, 9.64%, 0.39%, and 2.95-
%, respectively. The sum of the ratios of “(F)” and “(G)” is the
Table 4: Reproduction of the accuracy in original papers (%). percentage of queries unresolved by our tool’s automatic process;
Orig- Orig- 3.34% of the total pairs were unresolved by the automated process.
Method Benchmark Ours Method Benchmark Ours
inal inal
NSP GeoQuery 82.5 80.36 TypeSQL-C WikiSQL 75.4 74.97 While we must rely on the manual inspection to achieve 100% ac-
NSP ATIS 79.24 78.13 PT-MAML WikiSQL 62.8 62.72 curacy, our tool achieves 99.61% accuracy on average in the eleven
NSP Scholar 67 67.43 Coarse2Fine WikiSQL 71.7 70.78 benchmarks by determining all of unresolved cases to be inequiva-
TypeSQL- Syntax-
NC
WikiSQL 66.7 67.06
SQLNet
Spider 18.9 17.4 lent. Although the effect of the proposed tool may vary depending
SQLNet WikiSQL 61.3 61.27 GNN Spider 40.7 47.2 on the dataset, it has a very small error (0.39%) on average.
Seq2SQL WikiSQL 51.6 51.33 IRNet Spider 53.2 53.0
(A)
ATIS
(B)
Advising (querysplit)
(C)
Advising (questionsplit) (D)
6.1 Validating translation results GeoQuery (E)
Scholar (F)
We show that all accuracy measures used in the previous studies Patients
(G)
are misleading. We define accsem as Nsem /N , where Nsem is the Restaurant
number of generated queries that are semantically equivalent to the MAS
gold SQL queries, and N is the total number of queries. accex is IMDB
YELP
an accuracy measure comparing the execution results of two SQL WTQ
queries on a given database. accstr is calculated by comparing two 0 20 40 60 80 100
SQL queries through string matching. accsyn is based on syntactic Figure 10: Resolved cases (%) for each step of our validation tool.
equivalence of two SQL queries. For the first step in Figure 9 and
accex , we set an execution timeout to 30 seconds for each qsql . We further measure the number of unresolved cases when ac-
Table 5 shows the comparison results of accuracy measures on tivating only one step at a time, i.e., each step corresponding to
various benchmarks. In this experiment, we measure accuracy for “(A),” “(B),” “(C),” and “(D),” or “(E).” The average ratio of unre-
the translation results of NSP. This experiment aims to compare solved cases on the eleven benchmarks is 30.41%, 48.76%, 83.87%,
various accuracy measures, and it does not matter which NL2SQL and 83.69%, respectively. These values are much larger than the
method is used. Using simple queries, however, the difference value of 3.34% which uses all steps. The results show that our tool
among measures may not be revealed. Thus, we select NSP, which consists of complementary steps, and integrating these steps is an
can translate complex queries, without loss of generality. The re- effective approach for removing unresolved cases.
sults show that the measurement error of existing measures is sig-
nificant. accsyn , accex , and accstr differ from accsem by up to 6.2 Experiments using simple queries
59.96% on ATIS, 27.58% on Advising (questionsplit), and 70.25% Since SyntaxSQLNet, GNN, and IRNet generate SQL queries
on ATIS, respectively. These would be even larger as SQL queries ignoring constant values, we compare their translation results with
become more complex. gold SQL queries without constant values. Given a generated SQL
1744
query and a gold SQL query, we replace all constant values in both Table 7: Partial accuracy on simple queries (%).
Benchmark Method accsel accagg accwh,col accwh,op accwh,val
with indicators. Then, we compare the two queries using our vali- NaLIR 0.80 0.96 0.74 0.83 0.77
dation tool. We denote the accuracy as acc−val . Templar 0.80 0.96 0.74 0.83 0.77
Table 6 shows accsem and acc−val of all methods on the three NSP 41.92 75.99 32.20 67.69 37.66
SyntaxSQLNet 76.38 89.54 68.21 92.51 -
benchmarks which have simple queries. In order to analyze er- Seq2SQL 88.52 89.75 65.43 92.54 84.05
WikiSQL
rors in depth, we measure the accuracy for each part of the SQL PT-MAML 85.70 88.63 82.23 90.12 85.39
query (Table 7). We report accuracy values for projected columns SQLNet 90.75 90.16 81.93 92.57 81.57
TypeSQL-NC 92.51 89.94 86.44 93.97 86.01
(accsel ), aggregation functions in the SELECT clause (accagg ), Coarse2Fine 91.46 90.41 86.01 96.13 92.87
and columns, operators, and constant values in the WHERE clause TypeSQL-C 92.17 90.12 92.13 96.12 94.97
(accwh,col , accwh,op , and accwh,val , respectively). IRNet 93.37 89.71 85.43 93.69 -
GNN 94.43 90.20 92.35 94.01 -
Rule-based methods: The rule-based methods show low accu- NaLIR 6.06 6.06 15.15 15.15 15.15
racy on all benchmarks. This is mainly due to the mapping table Templar 3.03 6.06 6.06 6.06 6.06
used in the translation process, which significantly limits qnl ’s that NSP 93.94 90.91 93.94 84.85 93.94
can be handled. For example, NaLIR mis-translates the question SyntaxSQLNet 72.73 51.52 66.67 72.73 -
Seq2SQL 87.88 63.64 60.61 63.64 60.61
(A) in Table 8, since the mapping table does not have a mapping Patients-s
PT-MAML 93.94 57.58 81.82 75.76 93.94
from “longest” phrase to “MAX” function. The accuracy of NaLIR SQLNet 81.82 63.64 90.91 84.85 84.85
might be slightly improved by extending the mapping table for each TypeSQL-NC 96.97 69.70 90.91 90.91 72.73
Coarse2Fine 87.88 51.52 87.88 90.91 96.97
benchmark, but it must be done manually and requires considerable TypeSQL-C 90.91 72.73 81.82 84.85 96.97
effort. Although Templar enhances the mapping by using an SQL IRNet 72.73 66.67 75.76 78.79 -
log, such a linguistic mapping cannot be captured without using GNN 87.88 78.79 93.94 81.82 -
NaLIR 2.22 3.49 1.59 1.90 1.90
qnl ’s together. Since this fundamental problem is left unsolved, it Templar 2.22 3.49 1.59 1.90 1.90
does not improve accuracy in this experiment. This is because all NSP 16.51 33.65 11.43 29.84 9.21
three benchmarks have little or no training query per table, so that SyntaxSQLNet 43.17 60.00 34.29 50.48 -
Seq2SQL 26.35 85.08 28.89 65.71 27.30
using a query log is not helpful. The join path generation technique WTQ-s
PT-MAML 36.51 49.52 38.10 43.17 40.63
of Templar also has no impact for single-table queries. SQLNet 26.03 88.57 27.30 66.35 28.25
TypeSQL-NC 40.00 87.30 37.78 67.62 32.70
Coarse2Fine 27.30 86.67 27.94 76.19 18.41
Table 6: Accuracy on simple queries (%). TypeSQL-C 39.68 90.16 35.56 68.25 36.51
WikiSQL Patients-s WTQ-s
IRNet 62.86 72.38 51.43 65.40 -
accsem acc−val accsem acc−val accsem acc−val
GNN 21.27 50.16 13.33 31.75 -
NaLIR 0.49 0.50 0.00 0.00 1.59 1.59
Templar 0.49 0.50 0.00 0.00 1.59 1.59
NSP 9.49 10.84 81.82 81.82 0.32 0.32
SyntaxSQLNet - 49.91 - 21.21 - 21.59 Patients and Patients-s are 1,077 and 500, respectively. The results
Seq2SQL 51.33 54.02 33.33 33.33 1.27 2.54 show that most methods benefit from the data augmentation tech-
PT-MAML 60.65 63.54 39.39 39.93 18.41 20.00 nique (0%-27.27%).
SQLNet 61.27 67.92 33.33 39.39 1.27 2.22 Even after applying the data augmentation technique to all meth-
TypeSQL-NC 67.14 72.50 45.45 63.64 5.71 9.84
Coarse2Fine 70.78 72.56 42.42 45.45 2.54 6.98 ods, NSP still has the best performance (81.82%) for Patients-s.
TypeSQL-C 74.97 76.52 48.48 51.52 10.16 14.60 The first reason is that 3.3% of test queries in Patients-s cannot be
IRNet - 73.60 - 57.58 - 44.13 generated in most DL-based methods except for NSP, GNN, and
GNN - 79.01 - 60.61 - 4.44 IRNet. Since they generate constant values by using the pointer to
words mechanism in qnl , they generate an incorrect qsql if a con-
Adaptability to unseen databases: If a method performs well stant value does not appear in qnl . For example, the question (B)
for unseen databases, we say that the method has high adaptabil- in Table 8 doesn’t have the constant value “flu” which is a part of
ity. NSP performs poorly for WikiSQL compared to the other DL- the gold qsql . On the other hand, NSP can generate constant values
based methods since NSP has limited adaptability to new databases. correctly if the values exist in the output vocabulary. Furthermore,
In NSP, entities in SD are stored as a part of the output vocabulary. it was helpful for NSP to use all training queries for training. The
Therefore, NSP cannot support queries on tables not seen in train- accuracy of NSP trained with simple queries stays at 69.7%, which
ing data. WikiSQL has 26,531 tables, and the vocabulary size is is the same level as the accuracy of TypeSQL-NC with the data aug-
55,294. The large size of vocabulary also results in significant per- mentation. Even though SyntaxSQLNet, GNN, and IRNet do not
formance degradation of NSP. On the other hand, the other methods suffer from the problems described above, their accuracy is lower
need not maintain large vocabularies, since they use a schema as an than that of NSP. This shows that the simple model of NSP can
input and the pointer mechanism. Note that in Patients, the out- be more powerful on benchmarks with small numbers of training
put vocabulary size is 45. Another issue is that NSP cannot take a examples. For example, given the question (A) in Table 8, NSP
particular database as input. This is due to the inability of NSP to predicts the correct column in the SELECT clause. NSP can infer
utilize the additional information about the table. We tested a mod- the mapping between ‘hospitalization period’ in qnl and the col-
ified NSP which selects the given table T for each qnl in the FROM umn ‘length of stay’ from four training examples. However, Syn-
clause, and selects columns from T . However, accsem of NSP stays taxSQLNet, GNN, and IRNet select the column ‘age,’ which is the
at 11.5%, which was still far lower than the other DL-based meth- most frequent one in the training data. Since the other methods are
ods. Note that these two problems also occur on WTQ-s. clearly inferior to these methods, we omit their detailed analysis
Robustness with small datasets: The data augmentation tech- due to space limitation.
nique of NSP is helpful for Patients-s since the number of train- NL complexity: The complexity of natural language is affected by
ing queries in Patients-s is relatively small (61 queries). In order various factors such as linguistic diversity in questions, a variety
to show the effectiveness of the data augmentation, we conduct of domains, target operations of SQL, and number of sentences.
additional experiments by applying the technique to all DL-based WTQ has high complexity since it has diverse forms of questions
methods. The numbers of training examples after augmentation for including multiple independent clauses, 2,102 tables from diverse
1745
domains, and complex queries containing group by, order by, and ping. Given two query fragments, qf1 and qf2 , if qf2 has larger
nested queries. In this section, we first examine the first two, lin- co-occurrences in the query log than qf1 , the mapping of Templar
guistic diversity in questions and the variety of domains, with sim- can be distracted to choose qf2 , even if qf1 is more semantically
ple queries. The other two points will be discussed in Section 6.3. aligned with a given question. For question (E) in Table 8, ‘do-
All methods have significantly poorer performance on WTQ-s main.name’ is qf1 and ‘publication.title’ is qf2 in our example.
than WikiSQL. Since WTQ-s has a wide variety of qnl ’s, various Templar incorrectly maps ‘area’ in qnl to ‘publication.title,’ while
(unseen) questions are in the test data. For example, the ques- NaLIR correctly maps it to ‘domain.name.’
tion (C) in Table 8 has multiple independent clauses and negation, Adaptability: On Spider, SaI methods are more adaptable than
where all methods fail to translate properly. Furthermore, compli- the others. Spider is a full cross-domain benchmark, that is, the
cated column names/constant values in a number of tables result in underlying databases for validation queries are not in the training
more complex questions. For example, in the question (D) in Ta- data. Thus, NSP cannot support queries in Spider at all, whereas
ble 8, the column name ‘number of autos da fe’ is not English. Most SaI methods, SyntaxSQLNet, GNN, and IRNet can, as explained
DL-based methods do not understand any column name which can- in Section 4.2.2. However, SyntaxSQLNet showed lower perfor-
not be found in the word embedding matrix. mance than GNN and IRNet. The accuracy of table prediction
Aligning table/column references in qnl to SD : Exploiting the of SyntaxSQLNet is 41.3%, which is much lower than 67.0% and
associations between qnl and SD can greatly affects the accuracy. 73.7% of GNN and IRNet. GNN and IRNet utilize the information
On WikiSQL, GNN has the highest value of acc−val (79.01%), of association between qnl and SD and design a schema encoder
and TypeSQL-C and IRNet hold the second and third ranks, re- to handle various database schemas, whereas SyntaxSQLNet does
spectively. According to Table 7, GNN accurately predicted the not. That is, even among SaI methods, adaptability greatly varies
column names compared to the other methods. This is due to the depending on the way to treat SD . The SQL log-based technique of
schema linking technique used by GNN. TypeSQL-C and IRNet Templar has no impact on Spider, since it cannot utilize the query
also show better accuracy of column prediction than the others ex- fragment graph of the training SQL queries.
cept GNN. Both use the tag information obtained by leveraging a Robustness with small datasets: We observed that SyntaxSQL-
database schema. In conclusion, how well we exploit the associa- Net, IRNet and GNN are not trained properly in some benchmarks
tion greatly affects the accuracy in WikiSQL. having small numbers of examples. In particular, in Scholar, IRNet
IRNet has significantly higher accuracy than all the other meth- was not properly trained, so we could not obtain any meaningful
ods on WTQ-s. accwh,op or accagg of IRNet is similar to or less trained model. In IMDB, SyntaxSQLNet makes the same query,
than those of other methods. However, IRNet shows significantly “SELECT T2.name FROM genre as T1 JOIN actor as T2 WHERE
better accuracy in generating column names (accsel and accwh,col ). T1.genre = hvaluei OR T2.name = hvaluei” in 50% of test cases.
IRNet chooses columns more accurately than other methods, espe- We also observed that GNN was not properly trained in Scholar,
cially when the column names are long, and there are several sim- MAS, IMDB, or YELP; regardless of the question, GNN produced
ilar columns in the table. For example, IRNet is the only method an odd output such as “SELECT DISTINCT company.name FROM
that correctly translates the question (D) in Table 8. IRNet per- company WHERE company.id = company.id AND company.id AND
forms tagging in advance by comparing table/column names and company.id = hvaluei”. This trend is evident when a benchmark
phrases in qnl during pre-processing, which can be a great help for has fewer training queries relative to the high variety of questions.
WTQ-s. TypeSQL and GNN have similar modules, but they are SQL construct coverage: SyntaxSQLNet, GNN, and IRNet sup-
not suitable for WTQ-s which has long and varied column names port limited forms of qsql for the following reasons. SyntaxSQL-
containing unusual words; TypeSQL does not allow partial string Net uses limited types of slots as explained in Section 4.2.2. The
matching and does not perform schema linking but tagging, and the grammar-based decoders of GNN and IRNet can generate more
schema linking of GNN is based on pre-trained word embedding. general queries, but their grammars support a subset of the SQL
Effectiveness of learning algorithms: We perform an additional syntax. SemQL of IRNet has a much smaller coverage than SQL.
experiment with Seq2SQL and PT-MAML, which uses reinforce- 43.9% of test queries in the twelve benchmarks cannot be supported
ment learning and meta learning respectively, after changing their by IRNet. Most seriously, 80.1% of test queries in ATIS, 47.1% of
learning algorithm to supervised learning with teacher forcing [40] test queries in Advising (querysplit), and 54.5% of test queries are
which is used by the other DL-based NL2SQL methods. Without not supported by IRNet. The public implementation of GNN only
their learning algorithms, Seq2SQL and PT-MAML showed aver- supports the minimal syntax of SQL to support Spider; it does not
ages of 2.9% and 4.9% degradation in accuracy, respectively. Both support queries having OR conjunctions, parentheses, or join con-
methods gain their accuracy on average from their learning algo- ditions in the WHERE clause. It also does not support LIMIT state-
rithms, but they are not superior to other DL-based methods. ments without ORDER BY or GROUP BY. Neither does it support
the IS NULL operator. We have extended the grammar of GNN
6.3 Experiments using complex queries to support all of them. However, 37.4% of test queries in twelve
benchmarks are still unsupported due to the limitations of the in-
We evaluate the existing methods on all test queries in twelve
ternal data structure used by GNN. For example, it cannot handle
benchmarks, excluding WikiSQL which has simple queries only.
self-join, correlated nested queries, or arithmetic operations having
Table 9 shows the accuracy of six methods, NaLIR, Templar, NSP,
more than two operands. IRNet has similar limitations, too.
SyntaxSQLNet, GNN, and IRNet. We exclude the other methods
As queries become more complex and diverse, there has been an
from this experiment since they can support simple queries only.
increase in the case where NSP generates invalid queries that can-
In summary, all six methods have serious errors when translating
not be executed. Table 10 shows types of translation error by NSP.
complex queries in the twelve benchmarks as seen in Table 9.
Due to space limitation, we report the results on four benchmarks
Rule-based methods: Rule-based methods show low accuracy in
with the most diverse error cases. 59.5% of these invalid queries
general, mostly due to the same issue discussed in Section 6.2.
are due to translating incorrect table names or column names. An
While Templar has the same or better accuracy than NaLIR on most
SQL query that accesses a table that is not in the FROM clause can-
benchmarks, it shows lower accuracy than of NaLIR on MAS. We
not be executed. For example, the qsql for question (G) in Table 8
observe that, for MAS, Templar often fails to find the correct map-
1746
Table 8: Translation examples. Mis-translated parts are colored in red, where absent parts are shown with strikethroughs.
qnl Dataset qsql
(Gold, NSP) SELECT MAX(length of stay) FROM patients
(A) Display the longest
Patients (NaLIR) SELECT MAX length of stay FROM patients
hospitalization period.
(GNN, IRNet) SELECT MAX(age) FROM patients (SyntaxSQLNet) SELECT MIN(age) FROM patients
(B) What is the cumulation of durations of stay (Gold) SELECT SUM(length of stay) FROM patients WHERE diagnosis = ‘flu’
Patients
of inpatients where diagnosis is influenza? (Coarse2Fine) SELECT SUM(length of stay) FROM patients WHERE diagnosis = ‘influenza’
(Gold) SELECT loser FROM T WHERE year = 1920
(TypeSQL-C) SELECT year FROM T WHERE winner = ‘1920’
(C) Who ran in the year 1920,but did not win? WTQ (PT-MAML) SELECT ‘number of votes winner’ FROM T WHERE year = 1920
(SyntaxSQLNet) SELECT year FROM T WHERE year = <val> OR winner = <val>
SELECT year FROM T WHERE year = <val>INTERSECT
(IRNet)
SELECT year FROM T WHERE year 6= <val>
(Gold) SELECT tribunal FROM T WHERE ‘number of autos da fe’ = 0
(D) Which Spanish tribunal (IRNet) SELECT tribunal FROM T WHERE ‘number of autos da fe’ = <val>
was the only one to not have WTQ (GNN) SELECT tribunal FROM T WHERE ‘number of autos da fe’ = 0
any autos da fe during this time period? (SyntaxSQLNet) SELECT tribunal FROM T WHERE tribunal = <val>
(TypeSQL-C) SELECT ‘executions in effigie’ FROM T WHERE penanced = ‘’
SELECT DISTINCT d.name FROM journal j, domain journal dj, domain d
(Gold, NaLIR)
(E) Return me the area of PVLDB MAS WHERE j.jid = dj.jid AND dj.did = d.did AND j.name = ‘PVLDB’
SELECT DISTINCT p.title FROM journal j, publication p
(Templar)
WHERE j.jid = p.jid AND j.name = ‘PVLDB’
SELECT DISTINCT pc.workload FROM course c, program course pc
Advising (Gold)
WHERE pc.course id = c.course id AND c.department = ‘EECS’ AND c.number = 751
(F) How is the workload in EECS 751? (querysplit) (GNN) SELECT DISTINCT name FROM program WHERE name = <val> AND name = <val>
SELECT DISTINCT pc.workload FROM course c, program course pc
(question- (GNN)
WHERE pc.course id = c.course id AND c.department = <val> AND name = <val>
split)
SELECT DISTINCT pc.workload FROM course c, program course pc
(NSP)
WHERE pc.course id = c.course id AND c.department = ‘EECS’ AND c.number = 559
SELECT DISTINCT count(p.paperId) FROM paper p, venue v
(Gold)
(G) Number of papersin SIGIR conference Scholar WHERE p.venueId = v.venueId AND v.venueName = ‘SIGIR’
SELECT DISTINCT writes.authorId, count(p.paperId) FROM paper p, venue v
(NSP)
WHERE p.venueId = v.venueId AND v.venueName = ‘SIGIR’
SELECT opponents FROM T WHERE opponents = ‘Guam’ OR opponents = ‘Bangladesh’
(Gold)
(H) Who has been an opponent GROUP BY opponents ORDER BY COUNT(opponents) DESC LIMIT 1
WTQ
more often, Guam or Bangladesh? SELECT venue FROM T WHERE opponents = <val> OR opponents = <val>
(IRNet)
GROUP BY opponents ORDER BY competition DESC LIMIT 1
SELECT venue FROM T WHERE opponents LIKE <val> AND opponents LIKE <val>
(SyntaxSQLNet)
GROUP BY opponents ORDER BY competition DESC LIMIT 1
Table 9: Accuracy on all test queries in various benchmarks (%). Generalizability to unseen examples: If a method performs well
NaLIR Templar NSP SyntaxSQLNet GNN IRNet
Benchmark for unseen SQL query patterns, we say that the method has high
accsem accsem accsem acc-val acc-val acc-val acc-val
ATIS 0.0 0.0 70.25 76.06 0.67 5.13 8.71 generalizability. On Advising (querysplit), the accuracy of all DL-
Advising based methods is close to zero. The query-based split is difficult
0.0 0.0 0.27 5.4 0.38 1.75 0.16
(querysplit)
Advising to handle, since all the SQL queries in the test data are ones that
0.0 0.0 44.85 79.76 0.0 23.73 13.79 have never been seen in the training [13]. For the question (F) in
(questionsplit)
GeoQuery 0.36 0.0 70.0 72.5 49.64 38.21 60.36 Table 8, for example, GNN generates qsql completely differently
Scholar 0.0 0.0 48.17 55.96 1.38 1.38 0.0 from the correct translation on Advising (querysplit), while it cor-
Patients 0.0 0.0 69.72 69.72 10.09 53.21 46.79
Restaurant 0.0 0.0 63.75 63.75 0.0 0.0 66.25 rectly translates on Advising (questionsplit).
MAS 32.26 12.90 51.61 54.84 0.0 0.0 14.52 NL complexity: Diversity of qsql typically increases the complex-
IMDB 0.0 11.90 16.67 21.43 0.0 2.38 14.29 ity of qnl . The number of possible expressions in qnl increases ex-
YELP 4.88 9.76 0.0 9.76 0.0 0.0 7.32 ponentially with the number of SQL operations in the correspond-
Spider 0.39 0.39 0.0 0.0 14.70 45.16 50.87
WTQ 0.47 0.47 2.06 3.32 10.93 2.06 15.47 ing qsql . For example, to properly translate question (H), one has to
understand both ‘more often’ and ‘Guam or Bangladesh.’ No meth-
ods, including IRNet, that performed best on WTQ-s, understand
accesses the column writes.authorId, while the table writes is not
such varied patterns, and they suffer from low accuracy. Therefore,
in the FROM clause. Therefore, this query results in a syntax er-
their accuracy has been decreased on WTQ compared to WTQ-s.
ror. Many queries are in this case, and most of these errors occur
Constant value anonymization: On Advising (questionsplit), the
in complex queries involving join. NSP memorizes join conditions
difference between accsem and acc−val of NSP is 34.91%, which is
that appear in the training and generates one of the memorized join
significant. This is the percentage of test queries that NSP correctly
conditions. This simple approach does not consider the relation-
translates except constant values. For example, NSP mis-translates
ship among FROM, SELECT, and join, therefore it often generates
the question (F) in Table 8 by one constant value. This is because
inappropriate join conditions.
the correct value ‘751’ is not anonymized. This observation sug-
Table 10: Error cases (NSP). gests the following two points: 1) the constant value anonymization
Single table queries Multiple table queries technique of NSP is limited, and 2) there is a research opportunity
Benchmark
Correct
Incorrect
Correct
Incorrect to precisely translate constant values.
syntax syntax wrong wrong
error
others
error from/join nesting
others Handling multiple sentence questions: In order to further ana-
ATIS 46 0 15 268 43 3 53 19 lyze how well these methods translate more complex and realis-
Advising
28 4 23 229 23 35 27 204
tic queries, we conducted an additional experiment using the fa-
(questionsplit) mous benchmark, TPC-H. TPC-H has a total of 22 queries to por-
GeoQuery 134 1 21 62 15 6 22 19
Scholar 0 6 1 105 39 13 4 50
tray the activity of a product supplying enterprise. Since TPC-H
1747
doesn’t have enough queries to use for training, we train DL-based Table 12: A comparison results summarized by whether each
methods using Spider, and then test them on the TPC-H queries. method is poor (H), fair (-), or good (N) for each dimension.
SyntaxSQLNet
In order for DL-based methods to better understand the database
Coarse2Fine
PT-MAML
schema of the TPC-H dataset, we changed the format of table and
TypeSQL
Seq2SQL
Templar
SQLNet
NaLIR
column names slightly by removing meaningless symbols and sep-
IRNet
GNN
NSP
arating each word. For example, we modify the column name Ability
“L EXTENDEDPRICE” to “extended price”. We use the given Adaptability to new databases - H H - - - - - - N N
explanation of each query named Business Question in the offi- Robustness with small datasets - N - - - - - - - -
SQL construct coverage - - - H H H H H - - -
cial documentation as qnl . We rephrase them in a narrative format. Generalizability to unseen qsql H H H H H H H H H H
Note that all qnl in TPC-H are multiple-sentence questions. Supporting linguistic diversity H H H H H H H H H H H
Surprisingly, all methods show 0% accuracy on TPC-H. NaLIR Handling long, multiple sentences H H H H H H H H H H H
relies on the manually-built mapping table and processes a single
sentence only, so it cannot handle long, complex qnl s in TPC- It would be an interesting research topic to develop a schema en-
H. Seq2SQL, SQLNet, PT-MAML, TypeSQL, and Coarse2Fine coding technique that improves adaptability, while still enabling
also do not support any question since there is no simple query high accuracy in a single domain.
in TPC-H. NSP doesn’t work either since the database (i.e., the Aligning table/column references in qnl to SD can help improve
test dataset (TPC-H)) is different from the training dataset (Spider). accuracy. However, the existing techniques are still at a basic level
Templar also does not benefit from using the given query log. Syn- and need to be significantly enhanced. We have shown that the
taxSQLNet, GNN, and IRNet claim to support various queries in alignment techniques of TypeSQL and GNN are effective only for
cross-domain environments. However, they also fail to translate all some benchmarks. String matching rather than the matching based
queries in TPC-H correctly. Table 11 shows the error cases of the on word embedding vectors is often helpful especially for han-
three methods. IRNet cannot support all queries in TPC-H due to dling long and varied database entity names. Real-world databases
the limited coverage of SemQL. SyntaxSQLNet and GNN can sup- may have a lot of rare words which cannot be found in the pre-
port two (Q2 and Q18) and three (Q2, Q6, and Q18) of 22 queries, trained word embedding. In this case, IRNet, which is based on
respectively, but it mistranslates all the queries. For example, Q6 is the string match, shows a distinct benefit, as seen in our experi-
for “SELECT SUM(lineitem.extended price * lineitem.discount) mental results on WTQ-s. However, its accuracy values for column
FROM lineitem WHERE lineitem.ship date ≥ ‘1994-01-01’ AND prediction (i.e., accsel and accwh,col ) on WTQ-s are about 5-60%,
lineitem.ship date < ‘1994-02-01’ AND lineitem. discount BE- which are still low. Those become even worse on complex queries
TWEEN 0.05 AND 0.07 AND lineitem.quantity < 24”, GNN an- (i.e., WTQ). If there is no phrase in qnl that partially matches the
swers “SELECT SUM(lineitem.quantity) FROM lineitem WHERE corresponding schema entity name, as in question (C), IRNet fails
lineitem.discount BETWEEN hvaluei AND hvaluei”, which is com- to find the correct alignment. Developing alignment techniques
pletely wrong. This experiment clearly illustrates the limitations of is a remaining problem. Overcoming these limitations could be
the latest deep-learning-based methods. a promising future research directions.
Generating constant values in SQL queries is another challeng-
Table 11: Error cases on TPC-H (SyntaxSQLNet/GNN/IRNet). ing issue. Specifically, in the experiments on complex queries, all
Error cases # queries tested methods fail to correctly generate constant values in most
(GNN) 11
Arithmetic operation
(IRNet, SyntaxSQLNet) 12
cases or even have no functionality for generating constant values
Not supported queries
CASE-WHEN-THEN statement 3 at all. Constant value anonymization of NSP does not work well
More than 2 values in IN statement 3 when there are many constant values in VD . This could also be ab
CREATE VIEW or Nesting in FROM clauses 2
Function ‘extract’ 2
important direction for future research.
SELECT count(*) as cnt ... ORDER BY cnt 2
Correlated nested query 1 8. CONCLUSION
EXISTS statement 1
Self join 1
In this paper, we present a comprehensive survey of existing
More than four columns in SELECT clause (IRNet) 2 NL2SQL methods and perform thorough performance evaluations.
Wrong translation
(GNN) 3 We introduce a taxonomy of NL2SQL methods and classify the lat-
(SyntaxSQLNet) 2 est methods. We fairly and empirically compare the state-of-the-art
Total 22/22
methods using many benchmarks. Specifically, we noticed the crit-
ical problem present in the previous experimental studies that use
misleading measures. Thus, we accurately measured the quality
7. INSIGHTS AND QUESTIONS FOR FU- of the NL2SQL methods by considering the semantic equivalence
TURE RESEARCH of SQL queries. We analyzed the experimental results in depth
using various benchmarks including our one. From those results
Table 12 shows a summarized comparison of the eleven NL2SQL and analysis, we reported several important findings. We believe
methods. There is no consistent winner over all benchmarks since that this is the first work that thoroughly evaluates and analyzes the
all methods focus partially on limited-scope problems. Further- state-of-the-art NL2SQL methods on various benchmarks.
more, the overall accuracy of each method is severely degraded
when qnl and/or qsql become more complex and diverse (0% on
TPC-H). As shown in Table 12, there are lots of challenging issues 9. ACKNOWLEDGMENTS
remaining in NL2SQL: handling unseen databases/qsql ’s, being ro- This work was supported by the National Research Foundation
bust on small training datasets, extending SQL construct coverage, of Korea(NRF) grant funded by the Korea government(MSIT) (No.
and supporting various qnl including multiple sentence questions. NRF-2017R1A2B3007116) and Institute of Information communi-
For cross-domain adaptability, schema entities must be treated as cations Technology Planning Evaluation(IITP) grant funded by the
input rather than output vocabulary. On the other hand, in single- Korea government(MSIT) (No. 2018-0-01398, Development of a
domain benchmarks, NSP outperforms the SaI methods in general. Conversational, Self-tuning DBMS).
1748
10. REFERENCES database with intermediate representation. In ACL, pages
[1] J. Abreu, L. Fred, D. Macêdo, and C. Zanchettin. 4524–4535, 2019.
Hierarchical attentional hybrid neural networks for document [21] P. Huang, C. Wang, R. Singh, W. Yih, and X. He. Natural
classification. In ICANN, pages 396–402, 2019. language to structured query generation via meta-learning. In
[2] D. Bahdanau, K. Cho, and Y. Bengio. Neural machine NAACL-HLT, pages 732–738, 2018.
translation by jointly learning to align and translate. In ICLR, [22] S. Iyer, I. Konstas, A. Cheung, J. Krishnamurthy, and
2015. L. Zettlemoyer. Learning a neural semantic parser from user
[3] C. Baik, H. V. Jagadish, and Y. Li. Bridging the semantic gap feedback. In ACL, pages 963–973, 2017.
with SQL query logs in natural language interfaces to [23] D. Jurafsky and J. H. Martin. Speech and language
databases. In ICDE, pages 374–385, 2019. processing: an introduction to natural language processing,
[4] F. Basik, B. Hättasch, A. Ilkhechi, A. Usta, S. Ramaswamy, computational linguistics, and speech recognition, 2nd
P. Utama, N. Weir, C. Binnig, and U. Çetintemel. Dbpal: A Edition. Prentice Hall series in artificial intelligence. Prentice
learned nl-interface for databases. In SIGMOD, pages Hall, Pearson Education International, 2009.
1765–1768, 2018. [24] M. Lapata and L. Dong. Coarse-to-fine decoding for neural
[5] Y. Bengio, P. Y. Simard, and P. Frasconi. Learning long-term semantic parsing. In ACL, pages 731–742, 2018.
dependencies with gradient descent is difficult. IEEE Trans. [25] F. Li and H. V. Jagadish. Constructing an interactive natural
Neural Networks, 5(2):157–166, 1994. language interface for relational databases. PVLDB,
[6] B. Bogin, J. Berant, and M. Gardner. Representing schema 8(1):73–84, 2014.
structure with graph neural networks for text-to-sql parsing. [26] Y. Li, D. Tarlow, M. Brockschmidt, and R. S. Zemel. Gated
In ACL, pages 4560–4565, 2019. graph sequence neural networks. In 4th International
[7] K. D. Bollacker, R. P. Cook, and P. Tufts. Freebase: A shared Conference on Learning Representations, ICLR 2016, San
database of structured general human knowledge. In AAAI, Juan, Puerto Rico, May 2-4, 2016, Conference Track
pages 1962–1963, 2007. Proceedings, 2016.
[8] J. Castelein, M. F. Aniche, M. Soltani, A. Panichella, and [27] Z. Lin, M. Feng, C. N. dos Santos, M. Yu, B. Xiang,
A. van Deursen. Search-based test data generation for SQL B. Zhou, and Y. Bengio. A structured self-attentive sentence
queries. In ICSE, pages 1120–1230, 2018. embedding. CoRR, abs/1703.03130, 2017.
[9] J. Cheng, S. Reddy, V. A. Saraswat, and M. Lapata. Learning [28] T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and
structured natural language representations for semantic J. Dean. Distributed representations of words and phrases
parsing. In ACL, pages 44–55, 2017. and their compositionality. In NIPS, pages 3111–3119, 2013.
[10] S. Chu, B. Murphy, J. Roesch, A. Cheung, and D. Suciu. [29] G. A. Miller. Wordnet: A lexical database for english.
Axiomatic foundations and algorithms for deciding semantic Commun. ACM, 38(11):39–41, 1995.
equivalences of SQL queries. PVLDB, 11(11):1482–1495, [30] P. Pasupat and P. Liang. Compositional semantic parsing on
2018. semi-structured tables. In ACL, pages 1470–1480, 2015.
[11] S. Chu, C. Wang, K. Weitz, and A. Cheung. Cosette: An [31] A. Popescu, A. Armanasu, O. Etzioni, D. Ko, and A. Yates.
automated prover for SQL. In CIDR, 2017. Modern natural language interfaces to databases: Composing
[12] D. A. Dahl, M. Bates, M. Brown, W. M. Fisher, statistical parsing with semantic tractability. In COLING,
K. Hunicke-Smith, D. S. Pallett, C. Pao, A. I. Rudnicky, and 2004.
E. Shriberg. Expanding the scope of the ATIS task: The [32] A. Popescu, O. Etzioni, and H. A. Kautz. Towards a theory
ATIS-3 corpus. In ARPA Human Language Technology of natural language interfaces to databases. In IUI, pages
Workshop, 1994. 149–157, 2003.
[13] C. Finegan-Dollak, J. K. Kummerfeld, L. Zhang, [33] P. J. Price. Evaluation of spoken language systems: the ATIS
K. Ramanathan, S. Sadasivam, R. Zhang, and D. R. Radev. domain. In DARPA Speech and Natural Language Workshop,
Improving text-to-sql evaluation methodology. In ACL, pages pages 91–95, 1990.
351–360, 2018. [34] M. Rabinovich, M. Stern, and D. Klein. Abstract syntax
[14] C. Finn, P. Abbeel, and S. Levine. Model-agnostic networks for code generation and semantic parsing. In ACL,
meta-learning for fast adaptation of deep networks. In ICML, pages 1139–1149, 2017.
pages 1126–1135, 2017. [35] D. Saha, A. Floratou, K. Sankaranarayanan, U. F. Minhas,
[15] R. Frank. Phrase structure composition and syntactic A. R. Mittal, and F. Özcan. ATHENA: an ontology-driven
dependencies, volume 38. Mit Press, 2004. system for natural language querying over relational data
[16] J. Ganitkevitch, B. V. Durme, and C. Callison-Burch. PPDB: stores. PVLDB, 9(12):1209–1220, 2016.
the paraphrase database. In HLT-NAACL, pages 758–764, [36] L. R. Tang and R. J. Mooney. Automated construction of
2013. database interfaces: Integrating statistical and relational
[17] A. Giordani and A. Moschitti. Translating questions to SQL learning for semantic parsing. In EMNLP, pages 133–141.
queries with generative parsers discriminatively reranked. In Association for Computational Linguistics, 2000.
COLING, pages 401–410, 2012. [37] O. Vinyals, M. Fortunato, and N. Jaitly. Pointer networks. In
[18] B. J. Grosz. TEAM: A transportable natural-language NIPS, pages 2692–2700, 2015.
interface system. In ANLP, pages 39–45, 1983. [38] C. Wang, M. Brockschmidt, and R. Singh. Pointing out sql
[19] Ç. Gülçehre, S. Ahn, R. Nallapati, B. Zhou, and Y. Bengio. queries from text. Technical Report MSR-TR-2017-45, 2018.
Pointing the unknown words. In ACL, pages 140–149, 2016. [39] D. H. D. Warren and F. C. N. Pereira. An efficient easily
[20] J. Guo, Z. Zhan, Y. Gao, Y. Xiao, J. Lou, T. Liu, and adaptable system for interpreting natural language queries.
D. Zhang. Towards complex text-to-sql in cross-domain Am. J. Comput. Linguistics, 8(3-4):110–122, 1982.
1749
[40] R. J. Williams and D. Zipser. A learning algorithm for complex and cross-domain text-to-sql task. In EMNLP,
continually running fully recurrent neural networks. Neural pages 1653–1663, 2018.
Computation, 1(2):270–280, 1989. [48] T. Yu, R. Zhang, K. Yang, M. Yasunaga, D. Wang, Z. Li,
[41] C. Xiao, M. Dymetman, and C. Gardent. Sequence-based J. Ma, I. Li, Q. Yao, S. Roman, Z. Zhang, and D. R. Radev.
structured prediction for semantic parsing. In ACL, 2016. Spider: A large-scale human-labeled dataset for complex and
[42] X. Xu, C. Liu, and D. Song. Sqlnet: Generating structured cross-domain semantic parsing and text-to-sql task. In
queries from natural language without reinforcement EMNLP, pages 3911–3921, 2018.
learning. CoRR, abs/1711.04436, 2017. [49] J. M. Zelle and R. J. Mooney. Learning to parse database
[43] N. Yaghmazadeh, Y. Wang, I. Dillig, and T. Dillig. Sqlizer: queries using inductive logic programming. In AAAI, pages
query synthesis from natural language. PACMPL, 1050–1055, 1996.
1(OOPSLA):63:1–63:26, 2017. [50] L. S. Zettlemoyer and M. Collins. Learning to map sentences
[44] S. Yavuz, I. Gur, Y. Su, and X. Yan. Dialsql: Dialogue based to logical form: Structured classification with probabilistic
structured query generation. In ACL, pages 1339–1349, categorial grammars. In UAI, pages 658–666, 2005.
2018. [51] L. S. Zettlemoyer and M. Collins. Online learning of relaxed
[45] P. Yin and G. Neubig. A syntactic neural model for CCG grammars for parsing to logical form. In
general-purpose code generation. In ACL, pages 440–450, EMNLP-CoNLL, pages 678–687, 2007.
2017. [52] V. Zhong, C. Xiong, and R. Socher. Seq2sql: Generating
[46] T. Yu, Z. Li, Z. Zhang, R. Zhang, and D. R. Radev. Typesql: structured queries from natural language using reinforcement
Knowledge-based type-aware neural text-to-sql generation. learning. CoRR, abs/1709.00103, 2017.
In NAACL-HLT, pages 588–594, 2018. [53] M. Zhou, G. Cao, T. Liu, N. Duan, D. Tang, B. Qin, X. Feng,
[47] T. Yu, M. Yasunaga, K. Yang, R. Zhang, D. Wang, Z. Li, and J. Ji, and Y. Sun. Semantic parsing with syntax- and
D. R. Radev. Syntaxsqlnet: Syntax tree networks for table-aware SQL generation. In ACL, pages 361–372, 2018.
1750