A Comprehensive Evaluation of Neural SPARQL Query Generation From Natural Language Questions
A Comprehensive Evaluation of Neural SPARQL Query Generation From Natural Language Questions
ABSTRACT In recent years, the field of neural machine translation (NMT) for SPARQL query generation
has witnessed significant growth. Incorporating the copy mechanism with traditional encoder-decoder
architectures and using pre-trained encoder-decoder and large language models have set new performance
benchmarks. This paper presents various experiments that replicate and expand upon recent NMT-based
SPARQL generation studies, comparing pre-trained language models (PLMs), non-pre-trained language
models (NPLMs), and large language models (LLMs), highlighting the impact of question annotation and the
copy mechanism and testing various fine-tuning methods using LLMs. In particular, we provide a systematic
error analysis of the models and test their generalization ability. Our study demonstrates that the copy
mechanism yields significant performance enhancements for most PLMs and NPLMs. Annotating the data is
pivotal to generating correct URIs, with the ‘‘tag-within’’ strategy emerging as the most effective approach.
Additionally, our findings reveal that the primary source of errors stems from incorrect URIs in SPARQL
queries that are sometimes replaced with hallucinated URIs when using base models. This does not happen
using the copy mechanism, but it sometimes leads to selecting wrong URIs among candidates. Finally, the
performance of the tested LLMs fell short of achieving the desired outcomes.
INDEX TERMS SPARQL query generation, knowledge base, copy mechanism, non pre-trained, and pre-
trained encoders-decoders.
2024 The Authors. This work is licensed under a Creative Commons Attribution 4.0 License.
VOLUME 12, 2024 For more information, see https://creativecommons.org/licenses/by/4.0/ 125057
P. A. K. K. Diallo et al.: Comprehensive Evaluation of Neural SPARQL Query Generation
the copy mechanism was proposed [1], which allows tokens (BART [9] and T5 [10]) and two LLMs (Llama2 [11]
from the input to be directly copied into the output based on a and Code Llamav2 7B [12];
knowledge base vocabulary that includes KB URIs. However, 2) We evaluate the impact of the copy mechanism
copy-based models require the annotation of KB URIs in the and question annotations and experiment with ‘‘raw-
natural language questions. question’’ (non-annotated) questions, ‘‘tag-within’’
The recent development of pretrained language models and questions where we replace natural language elements
their application to SPARQL query generation has opened with their KBs URIs counterparts and tag-end ques-
new potential avenues [2], [3], [4], [5], [6]. In fact, Lehmann tions, where we list KB URIs with their labels at the
et al. [7] suggested the use of controlled natural language as end of the questions;
a target for Knowledge Graph Question Answering (KGQA) 3) We experiment with standard fine-tuning and instruc-
semantic parsing. They hypothesized that pretraining LLMs tion fine-tuning on two Large Language Models
on textual data can facilitate parsing into controlled natural (LLMs), namely Llama [11] and Code Llama [12] and
language for KGQA with limited training data requirements, measure the impact of training data size on the results;
reducing the cost and effort of collecting high-quality training 4) We perform a fine-grained analysis of the generation
data. SPARQL can be considered as such a controlled natural errors with their type distribution for all the models;
language. 5) We test the generalization capabilities of the
Initial experiments on pretrained language models for best-performing models using questions reformulated
SPARQL query generation indicate that while they may in different settings.
outperform their non-pretrained counterparts [8], they still To our knowledge, such detailed study has not been done
exhibit some limitations in handling unknown URIs to some yet especially including the performance of large language
degree, but most importantly, they exhibit poor generalization models, the generalization abilities of all models, and the
abilities as their performance drops when new question fine-grained identification of errors made at generation time.
templates are used at test time [3]. With this in mind, while
most datasets rely on template-based questions, the ability of
II. STATE OF THE ART
neural models trained on template questions to handle natural
A. NON-PRE-TRAINED MODELS FOR SPARQL
(formulated by humans, paraphrases) questions remains
GENERATION
unexplored.
In this paper, our goal is to expand upon these experiments Using non-pre-trained encoder-decoders for SPARQL gen-
and provide a systematic comparison of several models. eration has gained a lot of attention in recent years. Refer-
In particular, we aim to identify the failures of state- ence [13] presented a method for translating NL statements
of-the-art neural query generators, in terms of SPARQL to SPARQL expressions using an LSTM [14] encoder-
structures, incorrect URIs, and hallucinated URIs. We also decoder model. Reference [15] proposed an architecture
test models’ generalization capabilities with natural - non called Neural SPARQL Machines (NSpM), for translating
template-based- questions. Additionally, we include Large NL expressions to encoded forms of SPARQL queries. The
Language Models (LLMs) to assess whether they exhibit the framework involved generating train entries and feeding them
same shortcomings. In this context, we address the following to a sequence to sequence model. Reference [16] proposed
research questions: two new encoder-decoder architectures, ConvSeq2Seq [17]
and Transformer [18], on the Monument [15], LC-QuAD 1.0
1) Does the annotation of KB elements in the natural [19] and DBNQA [20] datasets. They evaluated their models
language questions improve the SPARQL query gen- with BLEU score [21], query accuracy as well as the F1 score
eration for all models? based on the tokens of the candidate and target translation.
2) Does the integration of a copy mechanism in NPLMs Reference [1] incorporated a neural copy mechanism on top
and PLMs improve the accuracy of the KB elements in of ConvSeq2Seq and Transformer architectures and used
the SPARQL queries? tagged versions of the questions that identify explicitly
3) Are Large Language Models (LLMs) effective in entities and relations URIs in the questions. The approach
this task, and which fine-tuning/prompting technique was evaluated on the same datasets used by [16], resulting
performs better? in significant improvements in BLEU scores. They also
4) What are the most common generation errors, and computed the accuracy of answers obtained after running the
what types of tokens are often generated instead of the generated queries against the knowledge base. The approach
expected types? demonstrated robust performance.
5) How do models trained on template-based questions
perform on naturally reformulated questions? B. PRE-TRAINED MODELS FOR SPARQL GENERATION
Our contributions include the following aspects: Building on the success of encoder-decoder models for
SPARQL query generation and the recent development
1) Using annotation, we compare the results of two of pre-trained models for sequence-to-sequence problems,
NPLMs (ConvSeq2Seq and Transformers), two PLMs recent approaches have proposed the use of pre-trained
models for SPARQL query generation. Reference [5] pro- performance on standard KBQA datasets. In particular, the
posed an algorithm for identifying entities and relations approach generates logical forms before retrieving entities
within a question and subsequently generating query struc- and relations, followed by a conversion to SPARQL queries.
tures with placeholders, which are then populated with URIs Kovriguina et al. [30] introduce SPARQLGEN, a generative
in a post-processing step. For generating query structures, approach for generating SPARQL queries using GTP-3
they used BERT [22], GPT2 [23], and BART-large [9]. [32], leveraging various types of contexts in the prompt to
Reference [24] proposed a method for embedding questions influence query generation performance. Their approach is
using various sources of information, such as POS tagging, similar to ours for LPLMs except that we use only a single
and feeding the representation to a GPT2-based model [23]. prompt structure and use some of the latest LPLMs Llama2
They annotated KB elements in the questions with special [11] and Code Llamav2 7B [12]. While all these approaches
tokens depending on the models. These special tokens are leverage large pre-trained language models, they do not focus
replaced by the correct KB elements in a post-processing on their performance in out-of-distribution settings and do not
phase. Reference [8] used BART [9], T5-small [10], T5- analyze their limitations in detail.
base [10], and a non-pre-trained model based on Pointer
Generation Network [25]. They added URIs and their labels III. METHODOLOGY
at the end of the questions (aka ‘‘tag-end’’ annotation). The A. TASK AND DATA FORMAT
approach generated several queries using a beam search and Each dataset is a standard, publicly available dataset,
ran each query in the order of the beam probabilities, keeping composed of a set of question-query pairs called entries. The
the first one that produced a non-empty answer. The models question is formulated in English and the target is a SPARQL
were evaluated on LC-QuAD 1.0 [19] and LC-QuAD 2.0 [26] query. All the datasets are generated automatically using
datasets, using the F1 score on the answers returned by the templates. These templates match question-query structures
generated queries. with placeholders that are later filled with specific URIs.
We call this matching pair a global template. Since some
C. COPY MECHANISM question structures can be associated with several query
Previous approaches have explored the idea of transferring structures and vice versa, we also refer to question templates
information from the question to the SPARQL query in and query templates as the isolated question or query
different ways. Some approaches used placeholders [5] or structures. For example, the global template Question: ‘‘what
cross-attention weights to map words from the question to the is the <1> of <2> ?’’ / Query: select distinct ?uri
query [13]. Recently, explicit copy mechanisms inspired by where {<2> <1> ?uri} generated the following Entry
CopyNet [27] and Pointer-Generator Networks (PGN) [25] Question: ‘‘what is the office of richard coke ?’’ and its associ-
have been incorporated into encoder-decoder architectures. ated Entry Query: select distinct ?uri where {
These mechanisms allow the decoder to generate tokens dbr:Richard_Coke dbp:office ?uri }.
based on probabilities derived from the decoder’s logits
or by copying tokens from the input. CopyNet [27] uses 1) ANNOTATIONS
a learned weight matrix to calculate the probability of To identify the impact of annotations on models’ perfor-
copying each word from the input. PGN [25] computes a mance, we experiment with three schemes.
copy probability and copy scores based on cross-attention
weights, combining them with generation scores using the a: ‘‘RAW’’ QUESTIONS
copy probability. Reference [8] directly used the PGN [25] Questions without annotation are designated as ‘‘raw-
architecture, whereas [1] proposed a modified version where question’’ in our results.
the URIs are masked in the input sequence.
In this work, we plan to unify the evaluation of both NPLM b: ‘‘TAG-WITHIN’’ QUESTIONS
and PLMs under the same question annotation and the same
‘‘Tag-within’’ questions are generated by substituting the
copy / no copy settings. To our knowledge, none of the
natural language words within the placeholders of the
available approaches have incorporated a copy mechanism in
question template with the corresponding URIs and literals
architectures such as BART [9] or T5 [10].
extracted from the placeholders of the query template. Having
question and query templates for each dataset entry allows
D. LARGE PRETRAINED LANGUAGE MODELS us to easily match the position of a token in the question
Finally, recent works have also explored Large Language to the position of the identifier of the knowledge base URI
Models (LLMs) for SPARQL query generation [28], [29], when performing this type of annotation. The following
[30]. For instance, Muennighoff [28] proposes SGPT, example illustrates the question and query templates and the
an extension of SBERT that enhances GPT models for resulting tagged version of the original raw question. Figure 1
semantic search. Luo et al. [31] introduce ChatKBQA, describes the process of tagging a raw question:
a novel KBQA framework built on fine-tuned open-source For instance, the ‘‘tag-within’’ question associated to the
LLMs, which combines generation and retrieval for improved example ‘‘which person has opponent ike clanton ?’’ is ‘‘who
2) PRE-TRAINED MODELS
To compare our results with [8], we used BART-base [9] and
T5-small [10] as pre-trained encoder-decoder architectures.
We also compare these architectures with the non-pre-trained
ones in the same experimental settings to show the impact
of pre-training. For SPARQL schema elements and URIs,
we followed the encoding described by [8]. Each SPARQL
schema element and each URI prefix is considered as a
special token. With T5, we used the sentinel tokens and with
BART, we added new tokens to represent these elements.
structure generation, followed by a copy of the URIs at We can express pG , pC and pcopy as such:
some specific positions. We extend the use of the copy (
mechanism [1], which is based on only copying elements σS (DECi (ti |w̃0:m ; t0:i−1 )) ∀ti ∈ S
pG = (2)
from a KB vocabulary and masking these elements to the 0 /S
∀ti ∈
encoder and decoder blocks. Our extension includes literals (
in the KB vocabulary and the addition of the copy to the pre- σK ∩w0:m (Ak,i ) ∀ti ∈ K s.t. ∃k : wk = ti
pC = (3)
trained models’ architectures. This is further explained in the 0 otherwise
following sections. pcopy = σ (DECi (ti |w̃0:m ; t0:i−1 )) × B) (4)
1) VOCABULARIES With w̃0:m being the tokens given as input to the encoder
where the words from K have been masked, DECi being the
We define three distinct vocabularies:
logits of the decoder at timestep i, Ak,i is the cross-attention
• The English natural language vocabulary, denoted as W , weight computed by the encoder-decoder between the k-th
composed of the questions’ tokens. term of the input and the i-th term of the predicted output,
• The SPARQL vocabulary, denoted as S, which includes B ∈ R|S|×1 is a matrix with weights learned during training,
SPARQL keywords. σ (·) is the sigmoid function and σX (·) is the softmax function
• The KB vocabulary, denoted as K , which includes URIs computed over set X .
(classes, properties, and resources) as well as literals. In the end, with this copy mechanism, the encoder
Each entry of our datasets is composed of a question with vocabulary is W , the decoder vocabulary is S, and only tokens
tokens from W and a query with tokens from K ∪ S. Our from K can be copied from the question to the query.
annotated versions of the dataset tag the question elements
with URIs from K that appear in the query. In the ‘‘tag-end’’ IV. EXPERIMENTS
setting, the size of W might be slightly larger since the labels In our experiments, we combined various parameters,
used next to the URIs are not necessarily the same as those including six different base model architectures, the addition
used in the original question. That is because we use the URI’s or removal of the copy mechanism for pre-trained and
English label defined in the KB. On the opposite, in the ‘‘tag- non-pre-trained models, two types of question annotations,
within’’ version, the size of W is massively reduced since and data from three different datasets. We conducted eight
the words used to describe specific URIs in the question are experiments with Large Language Models (LLMs) using
replaced by tokens from K . two different models and two fine-tuning strategies (standard
fine-tuning and instruction fine-tuning) with varying sample
2) DESCRIPTION proportions. For Pre-trained Language Models (PLMs) and
The copy mechanism computes, at each generation step, Non-Pre-trained Language Models (NPLMs), we conducted
a probability of copy for the next token. This probability 56 experiments. We carried out 56 tests, comprising ten
weights the generation score and copy score of the next token. adaptations of existing studies and 46 original results,
For each word of the target vocabulary, the generation score is which we will discuss in more detail in the following
based on the logits of the decoder. The copy score is based on sections.
the cross-attention weight between the next predicted token
and each token of the input. We arbitrarily chose to use A. DATASETS
the last attention head for every model. Unlike PGN [25], We experiment with three datasets: LC-QuAD1.0, LC-
this implementation of the copy mechanism is designed to QuAD2.0 and DBNQA. Our primary motivation for these
only copy tokens from a specific vocabulary, here the KB choices is to ensure comparative reproducibility and facil-
vocabulary K . When the input sentence is provided to the itate the extension of our experiments. Additionally, the
encoder, the words from this vocabulary are masked. Then, substantial size of these datasets and the active community
the copy scores are only computed for these masked tokens. contributions to the underlying knowledge bases, particu-
More formally, at generation step i, the generation larly Wikidata (linked to the Wikipedia project), played a
probability of a token in S ∪ K is computed as follows: crucial role in our decision. These datasets are specifically
designed for SPARQL query generation, ensuring high-
quality question-query pairs across a wide range of domains
pti = pcopy × pC + (1 − pcopy ) × pG (1) and topics. Datasets statistics can be seen in Table 1
and Table 2. The vocabulary sizes for the three datasets
where pti = P(ti |w0:m ; t0:i−1 ), with ti being the i-th token are presented in Table 2. We also report the size of
generated, t0:i−1 being the tokens previously generated, and the out-of-vocabulary (OOV) tokens as it highlights the
w0:m being the original input tokens before masking. difficulties mentioned in the introduction. We can notice
pcopy is the probability of using copy instead of generation, that the set S does not include OOV tokens whereas
pC the probability of copying a token, pG is the probability of the sets W and K often feature a significant amount of
generating a token according to the decoder. OOV tokens.
DBNQA. For non-pre-trained experiments, we trained our questions and completely fail with ‘‘tag-end’’ questions.
models for 500, 150 and 50 epochs (respectively for LC- ConvSe2Seq models reach good performances with both
QuAD 1.0, LC-QuAD 2.0 and DBNQA). For pre-trained settings. When using the copy mechanism, we don’t run
models, we used 200, 50 and 20 epochs. Each reported experiments on ‘‘raw-question’’ questions because they don’t
result is the mean of three different runs with the different include URIs to copy.
random seeds. We train the model train for the number of
specified epochs for each run and keep the model with the best 3) IMPACT OF THE COPY MECHANISM
validation loss for testing. The training is done with teacher The copy mechanism almost always has a huge positive
forcing, i.e. the decoder is supposed to predict the next token impact on performance. The impact of copy is always more
of the gold standard query. For generation at test time, we use significant with ‘‘tag-within’’ questions than with ‘‘tag-
greedy decoding. end’’ questions. Finally, we can notice that the DBNQA
dataset is the one that benefits the less from the copy
C. METRICS mechanism. This is probably because the huge amount of
To evaluate the performance of our models, we use three data allows the non-pre-trained models to learn already well
metrics. First, we report the BLEU-score [21], which is a enough without the copy mechanism. There is, however, still
popular NMT metric that compares the predicted query to a significant jump in performance compared to non-copy
the gold standard query at the token level. We also report models.
two question-answering metrics. The first one is the accuracy
of the produced answer. Each predicted query is ran against B. PRE-TRAINED MODELS
the KB and we compare the returned answer to the answer 1) BASE MODELS
of the gold standard. This is the most relevant metric to We compare BART and T5 and reproduce the results
compare to studies such as [8], even though we did not use of [8] on top of additional experiments. We can note that
the strategy of keeping the first non-empty answer amongst T5 consistently outperforms BART, and that ‘‘tag-within’’
the top-n generated queries. Indeed, our models only generate questions always imply similar or better performance than
one query. Finally, we compute the F1-score of the answers. ‘‘tag-end’’ questions as shown in Table 4. We can also see
We then average the F1-scores of each entry for each of the that for the ‘‘tag-end’’ questions without the copy mechanism,
three runs to get the F1-score of the model on a given test set. our results are better or close to the results reported by [8] on
LC-QuAD 1.0 and LC-QuAD 2.0. Our results for BART are
V. RESULTS much better than what they report, whereas we obtain a drop
We report results rounded to integer to facilitate comparison of 3 points for T5 on LC-QuAD 2.0. This might be due to the
between results within the same tables. The results for non- fact that we include the literals at the end of the questions,
pre-trained models can be found in Table 3. contrary to [8]. Finally, our generation strategy is different.
We evaluate the greedy generation of our models whereas [8]
A. NON-PRE-TRAINED MODELS kept the top-10 beam-generated queries, ran each of them on
1) REPRODUCTION RESULTS the endpoint, and only evaluated the first one to return a non-
Parts of our results reproduce existing studies. For models empty answer.
without and with copy, results have already been reported
on LC-QuAD 1.0 and DBNQA without annotation [1], [16] 2) COPY-BASED MODELS
and with ‘‘tag-within’’ questions [1]. We can observe that We can note that BART benefits from the copy mechanism
our results slightly improve those reported by [1], [16] for much more than T5. Except for LC-QuAD 2.0 with ‘‘tag-
LC-QuAD 1.0 with Transformer and ConvSeq2Seq models within’’ questions, the copy mechanism always allows a rise
with and without the copy mechanism. We also report similar in performance for BART. The case of LC-QuAD 2.0 with
results for DBNQA with ConvSeq2Seq with and without the ‘‘tag-within’’ questions coupled with copy-based models
copy mechanism. shows a specific difficulty that is discussed in Section VII.
On the contrary, the copy mechanism only allows a slight
2) ANNOTATION IMPACT rise in performance for T5 with the LC-QuAD 1.0 dataset.
For almost all of our results, we can see that the ques- However, it considerably lowers the results with the ‘‘tag-
tion annotation improves the performance. Performances end’’ questions on both the LC-QuAD 2.0 and DBNQA
diminish a little only for Transformer on DBNQA and for datasets.
ConvSeq2Seq on LC-QuAD 2.0 with ‘‘tag-within’’ questions
compared to no annotation (‘‘raw-question’’). Except for 3) ANNOTATION IMPACT
these two counter examples, we see a consistent improvement Overall, without the copy mechanism, we can notice that
due to annotation. even though annotations helped improve performances for
In the copy-based architectures, we can note that non-pre-trained models, the impact is much more noticeable
Transformers reach good performances with ‘‘tag-within’’ for pre-trained models which often reach much better
performance with ‘‘tag-within’’ questions compared to the our instruction explicitly specifies the URIs to be used for
‘‘raw-question’’ setting. Notably, we can report that non-pre- generating the query, as illustrated in Figure 2. In contrast,
trained models gain around 2.02% of F1 points on average the standard fine-tuning approach does not incorporate such
going from no annotation (‘‘raw-question’’) to ‘‘tag-within’’ explicit instructions; instead, it utilizes inputs with tags
question, whereas pre-trained models gain around 48.70% of (questions wherein URIs corresponding to knowledge base
F1 points on average. elements are appended at the end of the input sequence) or
without tags to emulate ‘‘raw-question’’.
C. LARGE LANGUAGE MODELS As shown in Tables 5 and 6, the performance of our
We also conduct an empirical evaluation of four LLMs, Large Language Models (LLMs) falls short in comparison
namely Llama [11], Code Llama [12], Mistral 7B v0.3 [33] to our T5 and ConvSeq2Seq models. All the LLMs we
and Mistral 7B Instruct. Due to computational and time experimented with exhibit suboptimal performance levels,
limitations, we only exploit the LC-QuAD 2.0 dataset in these with the best F1 scores being 13% and 23%, respectively,
experiments due to its higher difficulty level. Our evaluation for Llama and Code Llama. For Mistral’s models, the best
uses different portions of the training data, specifically F1 score are respectively 10% and 13% for Mistral 7B and
25% and 50% of the training set, corresponding to 5,440 Mistral-Instruct 7B. These results collectively indicate that,
and 10,880 entries from the train set, respectively. When despite the tagging of questions and the extent of data utilized
fine-tuning with instruction, we also try using 100% of for fine-tuning, these LLMs do not substantially enhance
the train set to get more insight into the models’ behavior their proficiency in generating effective SPARQL queries.
concerning the train sets’ size. This is because this fine-tuning Our first hypothesis explaining the weak performance of
method shows more sensitivity to the train set size than the LLMs is the limited amount of SPARQL query-related data
standard fine-tuning. For Mistral’s model we just use 100% in pre-training datasets. The second is that these models
of the train set during fine-tuning. rely on their parametric memories to generate URIs, which
We explore two distinct fine-tuning approaches. First, the precisely does not work and is actually our reason for
standard fine-tuning method entails providing the model using the copy mechanism in non-pretrained and pretrained
with the question as input and the corresponding SPARQL language models to avoid errors in the generation of URIs and
query as output. Second, we adopt the instruction fine- literals. Finally, it is worth noting that the instruction boosts
tuning method, which augments the input-output pair with the performance for these models as it helps them better
an additional instruction to guide the model’s task com- understand the task and mitigates undesirable behaviors such
prehension. To simulate the ‘‘tag-end’’ instruction prompt, as generating extra text.
TABLE 7. F1 Performance in % for different SOTA models. Based on token types, our objective is to determine which
type of token is generated instead of the expected type. This
makes it possible to study exactly how models are wrong: for
example, how often does a model generate a URI instead of
a SPARQL keyword, how often does it misplace a URI in
triples (when there is more than one URI in the same query).
We also measure if URIs in particular can be hallucinated,
that is, how often does a query include fake URIs. Examples
of errors are shown in Table 9.
‘‘<=’’, ‘‘>=’’, ‘‘+’’, ‘‘−’’, ‘‘*’’, ‘‘/’’, ‘‘str’’, ‘‘ucase’’, 2) ERROR DISTRIBUTION FOR PRETRAINED MODELS
‘‘lcase’’ and ‘‘concat’’. We have roughly the same observations with the pre-trained
• Literals (abbreviated as Lit): include faulty generations models except that there are unknown tokens errors in LC-
of a literal value. A literal is recognized as a string QuAD 2.0, which has a much higher number of OOV tokens
in double quotes or can be a numeric literal (ex 42), in the test set, as shown in Table 2. Indeed, on LC-QuAD
a boolean literal (true, false), or a date. 1.0, BART and T5, with or without the copy mechanism,
• Variables (abbreviated as Var): Errors made on vari- make errors in the generation of URIs and SVocab. Copy-
ables. based models make wrong choices of URIs among the copy
• Unknown (abbreviated by Unk) designates an error URI candidates. As previously with NPLMs, on LC-QuAD
made on the generation of the token ‘‘<unk>’’ that 2.0, some errors are made on variables due to the greater
represents all the OOV elements. complexity of this dataset. From the point of view of the
impact of the annotation, we see that without the copy,
B. ERROR TYPE DISTRIBUTION we have approximately the same rate of URI errors (41.7 %)
We align the reference query and the generated query and whatever the tagging method, and this rate increases by 12 %
detect mismatches at the token level. in the ‘‘raw-question’’ data.
3) ERROR DISTRIBUTION FOR LARGE PRETRAINED MODELS is to test the ability of the models to handle natural
As for the LLMs, the generation errors are essentially made questions when trained on template-based questions.
very largely at the level of the URIs, then in a smaller 2) We trained these models with the original questions
percentage on Variables, and finally in a small percentage on and then tested them on the train set’s reformulated
the SVocab regardless of the fine-tuning method used. Almost questions. Our objective here is to test the ability of the
at all positions where URIs are expected, LLMs generate models to handle natural questions that are paraphrases
incorrect URIs and occasionally predict URIs that do not of template-based questions encountered during the
exist. This hallucination is due to the use of these models’ training phase.
parametric memories during the generation of the query. 3) We trained the models on the train set ’s reformulated
In addition, the Variables and SVocab are mixed together in questions and tested them on the test set’s reformulated
the query, i.e., at the position where a variable is expected, questions. Our objective here is to measure the
an SVocab is generated, and vice versa. performance of the models on natural, non-template-
based questions.
4) SUMMARY The results of these experiments are shown in Table 10,
The error distribution in query generation varies among Table 11 and Table 12. Since copying is performed on tagged
different types of models but is mostly made on the URIs questions, there aren’t any results for the ConvSeq2Seq
and SVocab levels. NPLMs, when generating queries for LC- Copy models in the ‘‘raw-question’’ configuration. The ‘‘tag-
QuAD 1.0 and LC-QuAD 2.0 datasets, struggle with errors within’’ method is not shown in these tables, as refor-
across various token types, including URIs, SVocab, and Unk. mulations are exclusively applied to ‘‘tag-end’’ questions.
Copying helps maintain query structure but doesn’t always This is because when we use reformulated questions,
select the correct URIs. PLMs have a similar pattern of we can no longer leverage any template for tagging the
errors, with less Unk tokens due to their larger vocabulary. question.
These models also exhibit URIs and SVocab generation It is clear from Table 10 that query generation is more
errors, especially on LC-QuAD 1.0. With LLMs, most challenging, as the train questions’ structures differ from
errors occur with URIs, followed by Variables and SVocab, those of the test set. Conversely, as shown in Table 11, the
and this hallucination is exacerbated by the absence of a models exhibit approximately the same performance in query
copy mechanism. Variables and SVocab tokens are often generation when reformulated questions are paraphrases
interchanged during generation. In all cases, maintaining of train questions. Nevertheless, a substantial decrease in
tagging methods is beneficial, and the ‘‘tag-within’’ strategy model performance is observed when models trained on
works best when using the copy mechanism. original questions are tested with reformulated questions.
Furthermore, a significant decline is observed across all
C. GENERALIZATION CAPABILITIES OF THE BEST MODELS
configurations for models trained and tested on reformulated
questions, as shown in Table 12. This reaffirms the notion that
For this part, we considered the best models and optimal con-
templates constitute a significant component for achieving a
figurations in two datasets and evaluated their generalization
robust alignment between a question and its corresponding
capacity. For both LC-QuAD 1.0 and LC-QuAD 2.0, the best
SPARQL query. But this also highlights that current models
models are T5 Base and ConvSeq2Seq Copy, with the ‘‘tag-
are not yet ready to be used - as is - for natural questions.
within’’ setting. To test the generalization abilities of these
It is worth noting that T5 Base is less affected by the
models, we carried out the following three experiments:
‘‘noise’’ introduced with question reformulation compared to
1) We first trained these models with the original ConvSeq2Seq, owing to its pre-training, which endows it with
(template-based) questions and then tested them on the language knowledge and enhanced adaptability to changes
test set’s reformulated questions. Our objective here while preserving semantics (e.g., synonyms).
TABLE 10. Performances after training on original questions and testing on the test set reformulated questions.
TABLE 11. Performances after training on original questions and testing on the train set reformulated questions.
VII. DISCUSSION outperformed the other models with a large margin. However,
Overall, the PLMs outperform the NPLMs and the LLMs the pretraining in T5 does not solve the problem of identifying
in both LC-QuAD 1.0 and LC-QuAD 2.0, even though we the correct URIs, which is why question annotation is
obtained pretty good results with the ConvSeq2Seq model important. However, there is a drop in performance when
with copy. Counter-intuitively, we obtained low performance reformulated or raw questions are used at test time with
with the LLMs after trying various kinds of training (standard models trained on tagged questions due to lexical variations.
fine-tuning, instruction tuning) with different subsets of the Nonetheless, the generalization capabilities of these models
train set. This low performance is partly justified by the small are better than those trained on raw questions or without the
amount of data related to SPARQL in the datasets used to copy mechanism.
pretrain these models. For example, in Code Llama [12], most LLMs obtained a lower performance for SPARQL query
data relate to Python, C++, Java, PHP, TS, C#, and Bash. generation, even when pre-trained on code generation. The
Given the high costs and suboptimal performance of LLMs, importance of question annotation for improved performance
their use for this task is questionable in their current form. shows the necessity of having better question’s semantic
Alternative models should be developed that do not depend representations adaptable to knowledge base schema and
on parametric memories for URI generation. resources. In future work, we plan to investigate how to learn
objective functions that target annotation and generation at
A. THE IMPACT OF THE TAGGING AND COPY the same time.
MECHANISMS
Question annotations always positively impact the perfor- 1) DIFFICULTY OF THE ‘‘TAG-END’’ SETTING FOR THE COPY
mance of the generation for non-pretrained and pretrained MECHANISM
language models. From Table 14 - 17, we observe that in less We noticed that many models struggle with the ‘‘tag-end’’
complex datasets like LC-QuAD 1.0 and DBNQA, the results questions when using the copy mechanism, this mostly occurs
are similar accross models, even though T5 seems to be with some Transformer-based models, namely our non-pre-
consistently the top performer. With LC-QuAD 2.0, the PLMs trained Transformer and T5. This is probably due to the fact
TABLE 12. Performances after training on reformulated questions and testing on the test set reformulated questions.
that placing KB elements in the query is conditioned by the mistakes due to the wrong choice of URIs to copy among
adequate location of the corresponding URI in the question. the candidates. For example, from Table 13, we can see that
In the ‘‘tag-within’’ setting, the cross attention heads easily 27% of the incorrect queries generated by the BART model
learn to map the position in the question to the right position with copy are caused by a wrong choice among the candidate
in the query. However, with ‘‘tag-end’’ questions, the model URIs. By considering the masking that comes with the copy,
has to associate the KB element to its label and then map this our hypothesis is that we increase the difficulty to elect a
label to the corresponding natural language mention in the candidate URI for the copy. This masking also accentuates
original question. These additional steps might be the source the tendency of the models to choose the most frequent query
of the challenge observed with some Transformer models. template for questions having the same question template.
We can suppose that the convolution operation is more suited These limitations could be addressed by unmasking URIs
to addressing these steps, maybe because of the proximity in the case of pre-trained models. Yet we would be facing
between the URI and its label at the end of the annotated a ‘‘spelling’’ problem that is the problem that occurs when
questions. Additionally, BART also seems to overcome these the tokenizer split the URIs into fragments and does not
difficulties. We can suggest that its pre-training tasks based on properly merge them back at generation. Other types of
denoising might have led to better short context associations. copy mechanisms could be explored that would not require
masking the URIs, would keep them as a single token
without adding them to the model’s vocabulary. This will be
2) LIMITATIONS OF THE COPY MECHANISM investigated in future work.
The main limitation of this mechanism is that it mainly
relies on the structure of the question-query pairs and on
the positional mapping of the URIs between questions and
queries. It is possible that these models only focus on the B. ERROR TYPES
task of question template - query template mapping and 1) COMPARISON OF MODELS IN TERMS OF PERFORMANCE
placeholder filling. This might pose problems when we AND DISTRIBUTION OF THE ERROR TYPES
expose the model to a question-query pair that is generated Table 13 lists the overall performance of our different models
by a template not associated to enough or any training entries, for all configurations and the distribution of error types in
or, as we saw, to reformulated questions that do not follow a percentage. To fully understand this table, the reader first
known template. needs to look at the metrics that give the overall performance
Another limitation of our copy-based architectures is (BLUE score, Accuracy, or F1 score) before looking at the
that they are oblivious to the URIs’ semantics. The distribution of errors because a configuration can have a high
encoder-decoder does not see these tokens since they are error rate for a given error type while conserving a good
masked, and the copy mechanism only uses positions to copy performance. All values in the ‘‘Error Distribution’’ columns
the tokens. This might cause problems when the expected represent errors in query generation on the corresponding
query is independent of the question structure. When we type. For example, in the first line of Table 13, we see that the
mask the URIs, the copy layer calculates the probability BART model has a BLEU, Accuracy, and F1 of 84%, 72%,
distributions of the elements to copy from a list of candidate and 75%, respectively, and that 67% of incorrect queries are
URIs that occur in the query. This has the effect of causing due to incorrect URIs with 52% being URIs that exist in the
errors at the level of the URIs because, as we have seen in knowledge base that are incorrectly placed in the SPARQL
the analysis of the errors, we have an essential proportion of query and 15% being ‘‘Fake URIs.’’
Considering both LC-QuAD 1.0 and LC-QuAD 2.0 with for reformulated questions, the average F1 score is 33% and
the best annotation method, which is ‘‘tag-within,’’ we obtain 17.5%, respectively for T5 and ConvSeq2Seq Copy. Thus,
an average of 28% of incorrect URIs and 4% of ‘‘Fake URIs’’ we observe an overall drop in performance of 60% for T5
for the T5 model without the copy mechanism. Conversely, and a drop of 54.5% for NPLMs compared to the results
for the model ConvSeq2Seq, we have 27.5% of incorrect with original questions. Pre-trained Language Models are
URIs and 0% of ‘‘Fake URIs.’’ Plus, there is 1% more ‘‘Fake more resilient to question reformulation than Non-Pre-trained
Uris’’ for the ‘‘tag-end’’ setting compared to the ‘‘tag-within’’ Language Models. These generalization results confirm the
setting. real-world representativeness of the chosen datasets. LC-
To go deeper into the analysis of the errors for the QuAD1 and LC-QuAD2 effectively represent real-world
best models (T5 and ConvSeq2Seq Copy), we also include scenarios for SPARQL query generation, covering a wide
Tables 14 - 17 which give an analysis of the errors by showing range of topics and complexities commonly encountered in
which token types are generated in place of the expected knowledge graph querying, despite the limitation of having
ones. These analyses reveal that often, instead of the expected only one paraphrase per question. Future work should focus
URIs, incorrect URIs are predicted by all models. While on generating multiple reformulations for each question to
copy models go wrong by making wrong choices between enhance robustness and performance under lexical variations.
candidates, base models occasionally generate URIs that do
not exist, as we can see in Table 13. There are more ‘‘Fake
URIs’’ with LC-QuAD 1.0 than LC-QuAD 2.0 because the D. OTHER CONSIDERATIONS
morphology of URIs in DBpedia is more sensitive to errors. 1) LC-QUAD 2.0
On the other hand, with Wikidata, an error can lead to another Most results on LC-QuAD 2.0 are low (compared to results
URI, which, even if not the expected one, exists in the on other datasets) and in particular with non-pre-trained
knowledge base. Finally, we find a relatively high frequency models. This is most probably because of the large number
of SPARQL vocabulary (SVocab) or Variables that are put in of unknown words in the test set (5,751 in LC-QuAD
place of URIs. 2.0 versus 368 in LC-QuAD 1.0 as shown in Table 2)
and the URIs formulation in Wikidata which is a set of
2) SYNTACTIC AND NON-SYNTACTIC QUERIES numbers that cannot be mapped to known semantic structures
We also divide the errors types based on how they impact compared to DBPedia URIs which use words. While LC-
query functionality, that is, syntactic and non-syntactic errors. QuAD1.0 and DBNQA have easily ‘‘understandable’’ URIs,
Syntactic errors involve issues like unbalanced parentheses, such as ‘‘http://dbpedia.org/resource/Barack_Obama’’ for
braces, or typos in keywords and lead to non-executable Barack Obama in LC-QuAD1.0, the URIs in LC-
query. Non-syntactic errors occur in queries that are syn- QuAD2.0 are challenging for the models due to their
tactically correct but refer to non-existent identifiers in the format. For instance, the entity Barack Obama has the URI
knowledge base. For this latter type, the query is executable ‘‘http://www.wikidata.org/entity/Q76’’ in Wikidata, which is
but leads to a wrong answer. We conducted a statistical a simple concatenation of letters and numbers, making it more
analysis to quantify these errors, calculating the percentages difficult for the models to interpret. We also have, in a lesser
of executable and non-executable queries. The results showed extent, the presence of literals (for example specific titles of
that for LC-QuAD1.0 and LC-QuAD2.0, 30.2% and 21.25% movies). Additionally, LC-QuAD 2.0 includes much more
of the queries, respectively, were non-executable due to difficult query structures as suggested by the large SPARQL
syntactic errors, while 69.8% and 78.75% had non-syntactic vocabulary size reported in Table 2. Another challenge is that
errors. This confirms that most errors stem from issues with some templates of LC-QuAD 2.0 share a question template.
knowledge base identifiers. Future research should focus on This means that for the same question template, we can
addressing these errors. have different query templates. Since our copy mechanism
is based on masking the KB vocabulary, two sentences
C. THE GENERALIZATION CAPABILITIES OF THE MODELS that share the same question template will be considered
As shown in section VI-C, after conducting various exper- exactly the same sentence by the encoder-decoder block
iments to test generalization capabilities, including training and only the copy block will be able to ‘‘see’’ the words
on original questions and testing on reformulated questions, that differentiate them. Therefore, questions that share a
results revealed that models face challenges when the question template are generated by the models following the
question structure differs during testing, resulting in a most frequent query template associated with this question
critical decreased performance. Models trained and tested template. For instance, the question structure ‘‘what is the
on reformulated questions also exhibited lower performance, <mask> for <mask> of <mask>’’ can be generated by two
highlighting the importance of templates for aligning ques- global templates. The first template generates entries such as:
tions with SPARQL queries. Considering the best models for Question: ‘‘what is the country for head of state of mahmoud
the experiments on original questions with ‘‘tag-end’’ setting, abbas’’ / Query: select distinct ?sbj where
the average F1 score is respectively 93% and 72% for T5 { ?sbj wdt:P35 wd:Q127998. ?sbj wdt:P31
and ConvSeq2Seq Copy, respectively. On the other hand, wd:Q6256 } and appears 1431 times in the train set. The
TABLE 13. Models Comparison in terms of performance (based on BLEU, Accuracy and F1-score) and error type distribution.
TABLE 14. Error distribution in % for T5 LCQ 1 ‘‘tag-within’’. TABLE 19. Fine-Tuning Llama v2 7B with 50% of ‘‘tag-end’’ questions of
the train set.
TABLE 23. Fine-Tuning Code Llamav2 7B with 50% of tag-end questions TABLE 26. Hyperparameters for our models.
of the train set.
We can also note that all our copy models with ‘‘tag-
within’’ questions have very similar performances, around
73% of accuracy. This cap in performance does not occur for
‘‘tag-end’’ questions. In the ‘‘tag-end’’ setting, the complete
NL question is passed to the models. Even though the KB
elements are masked, the semantics of the question can be
identified by its natural language formulation. We can see,
for instance, that BART manages to outperform the cap of
70% of accuracy with the copy mechanism on ‘‘tag-end’’
questions.
2) OTHER MODELS
In this study, we used NMT techniques based on
encoders-decoders that are fully available to train and test. FIGURE 3. Experimental diagram of our work.
copy block. We compared non-pre-trained and pre-trained lengths in each batch. Even if the generated and expected
models. In the case of pre-trained models (BART, T5), we are queries differ in word count, padding ensures equal token
the first to evaluate adding a copy layer for this task. Given numbers.
the lack of homogeneous evaluation metrics in the state of the
art, we also compare three datasets using the BLEU score, the 2) ADAPTATION OF THE COPY MECHANISM TO MODELS
accuracy, and the F1-score computed on non-empty answers Non-pretrained models struggle with out-of-vocabulary
returned by the generated queries. We also show the impact words, particularly URIs from knowledge bases, leading to
of question annotation on non-pre-trained and pre-trained incorrect queries. Even when URIs appear during training,
models. they are often rare, making them hard to generate correctly.
Our results demonstrate that the copy mechanism improves The copy mechanism helps non-pretrained models address
the performances of non-pre-trained models by a significant this issue. Pretrained models, though designed to transfer
margin, including in the ‘‘tag-within’’ setting. We also show input to output, still face challenges with generating correct
that the copy mechanism can improve pre-trained models’ URIs, especially in Wikidata. Adapting the copy mechanism
performance in some cases (BART) and lower them in others to pretrained models involves using the decoder’s outputs
(T5). We also make a detailed analysis of the errors made and the input sentence before masking. The challenge is
by all models. In copy models, errors in generating URIs are masking URIs in the question and query. This requires
due to a wrong choice among hidden tokens (URIs) in the adding necessary tokens to the pretrained model’s tokenizer
input. Finally, we note that even the best PLM and NPLM are vocabulary, initializing new embeddings if needed, and
not flexible to question reformulation and, thus, do not have incorporating the KB vocabulary without altering the model.
adequate generalization capabilities. All tokens with an index higher than the last index of the
Our future plans include leveraging paraphrases and appro- constructed vocabulary are masked, as they represent URIs
priate methods to learn question representation, utilizing or literals that should be copied directly from the input to the
models with both parametric and non-parametric memories output.
to prevent incorrect URI generation, and decomposing tasks
into smaller steps with LLMs. 3) SELECTION OF HYPERPARAMETERS
In our experiments, we did not focus on hyperparameters
A. ERROR ANALYSIS FOR LLMS optimization but rather on comparing approaches. Therefore,
to maintain the validity of comparisons, hyperparameters are
The followings tables show the detailed errors analysis type
based on the suggestions of state-of-the-art approaches [1],
by type for Llama v2 7B and Code Llama v2 7B models and
[8], [16] and the limits of the hardware resources available
settings. All experiences are conducted over LC-QuAD 2.0.
to us. For Transformer and ConvSeq2Seq architectures, the
hyperparameters are inspired by [16]. For BART and T5
B. TECHNICAL DETAILS OF EXPERIENCES models, the hyperparameters follow the recommendations
1) PRE-PROCESSING AND SEGMENTATION of [8]. Table 26 summarizes the main hyperparameters of
Non-pretrained models need a fixed vocabulary created each architecture. Only T5-small was used, as [8] reports
from the training corpus. Our pre-processing transforms better performance with this version on LC-QuAD 1.0 and
sentences into sequences of lowercase tokens, separated similar performance between T5-small and T5-large on LC-
by spaces, and standardizes SPARQL terms. We convert QuAD 2.0. BART-base is used to compare results with those
questions and queries into indices in this vocabulary before of [8].
model input. Annotated questions treat tagged terms (tokens Table 27 shows the number of epochs and batch sizes
between ≪ and ≫) as single tokens. Pretrained models use used. Adjustments were made to fit machine RAM, notably
their own vocabulary and tokenizer, which divides words reducing both for DBNQA. Lowering the batch size for
into subwords. To compare with non-pretrained models, DBNQA with the Transformer architecture significantly
we apply the same pre-processing and add spaces before impacted performance, so the batch size was maximized
punctuation. Post-processing adjusts segmenter outputs to within hardware limits.
ensure correct spacing. Following [40], we use special tokens
for the SPARQL vocabulary: we use sentinel tokens for T5 4) TRAINING AND VALIDATION
and new tokens for BART. To reduce URI segmentation For each run, the model is trained for the indicated number
errors, we add URI prefixes to these tokens and rely on a of epochs, and the one with the best validation loss is
post-processing step to recover correctly formatted queries kept for testing. The loss function used is cross-entropy.
e.g. replace ‘‘wd:’’ by ‘‘<http://www.wikidata.org/entity/>’’, The training-validation-evaluation process is repeated three
‘‘dbo:’’ by ‘‘<http://dbpedia.org/ontology/>’’. Inputs are times with different random seeds. The reported results are
batched for model processing, with tokenizer converting the averages of the three outcomes. Training and validation
sentences to token sequences, replaced by indices in the are performed using the teacher forcing method since the
vocabulary. Padding tokens <pad> ensure uniform sequence expected query is known. The entire expected sequence is
given to the decoder during the generation phase with a decoder. This is a combinatorial problem that cannot be
causal attention mask, ensuring each word is influenced solved in a reasonable time. Given maximum sequence
only by previous words. Thus, the output will be the next lengths of l tokens and output vocabularies of about 104 ,
predicted token for each step i given the correct sequence we need to evaluate the probabilities of approximately 104×l
of tokens up to step i. This method significantly speeds up sequences. We used a greedy search heuristic, meaning only
training as it requires only one pass through the decoder one sequence is generated at a time, consisting of the most
for each data point. Therefore, no query generation occurs probable token at each generation step.
during training or validation. It is worth noting that batch
sampling was done differently depending on the experiments. C. EXPERIMENTAL DIAGRAM
For PLMs without copying, batches are randomly sampled The following diagram summarizes our experiments. We gen-
independently, whereas for other experiments, training data erated two tagged versions of the raw questions from
is randomly partitioned to obtain batches (each data point is the datasets. Using these three formats (raw, tag-within,
seen once per epoch). The first version is inspired by [8] and tag-end), we trained non-pretrained models, pre-trained
the second by [1]. The choice to use two different methods models, and large language models (LLMs) for SPARQL
was made due to the superior results of the first method in the query generation. For non-pretrained and pretrained models,
case of base PLMs. we experiment with both base models and models with
copy mechanisms. For LLMs, we used base models with
5) GENERATION ON THE TEST SET standard and instruction fine-tuning. Finally, we tested the
The goal of generation is to produce the sequence of tokens generalization capabilities of the models with reformulated
with the highest product of probabilities generated by the questions.
D. EXAMPLES OF ERRORS MADE BY THE MODELS [15] T. Soru, E. Marx, D. Moussallem, G. Publio, A. Valdestilhas, D. Esteves,
DURING SPARQL QUERIES GENERATION and C. B. Neto, ‘‘SPARQL as a foreign language,’’ in Proc. Posters Demos
Track 13th Int. Conf. Semantic Syst. (EMANTiCS), vol. 2044, Amsterdam,
We randomly selected models and chose five error-containing The Netherlands, J. D. Fernández and S. Hellmann, Eds., Sep. 2017,
SPARQL queries. We considered different configurations: pp. 1–7. [Online]. Available: http://ceur-ws.org/Vol-2044/paper14/
base models, models with copy mechanism, and various [16] X. Yin, D. Gromann, and S. Rudolph, ‘‘Neural machine translating from
natural language to SPARQL,’’ Future Gener. Comput. Syst., vol. 117,
question versions (raw_question, tag_within, tag_end). The pp. 510–519, Apr. 2021, doi: 10.1016/j.future.2020.12.013.
examples are shown in Table 28 and 29. [17] J. Gehring, M. Auli, D. Grangier, D. Yarats, and Y. N. Dauphin,
‘‘Convolutional sequence to sequence learning,’’ in Proc. 34th Int. Conf.
Mach. Learn., in Proceedings of Machine Learning Research, vol. 70,
ACKNOWLEDGMENT
D. Precup and Y. W. Teh, Eds., Aug. 2017, pp. 1243–1252. [Online].
The authors would like to express their gratitude to Com- Available: http://proceedings.mlr.press/v70/gehring17a.html
pute Canada (Calcul Quebec) for providing computational [18] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez,
resources. L. Kaiser, and I. Polosukhin, ‘‘Attention is all you need,’’ in Proc. Adv.
Neural Inf. Process. Syst., Annu. Conf. Neural Inf. Process. Syst., Long
Beach, CA, USA, I. Guyon, U. von Luxburg, S. Bengio, H. M. Wallach,
REFERENCES R. Fergus, S. V. N. Vishwanathan, and R. Garnett, Eds. Red Hook, NY,
[1] R. Hirigoyen, A. Zouaq, and S. Reyd, ‘‘A copy mechanism for handling USA: Curran Associates, Dec. 2017, pp. 5998–6008. [Online]. Available:
knowledge base elements in SPARQL neural machine translation,’’ in https://proceedings.neurips.cc/paper/2017/hash/3f5ee243547dee91fbd053
Proc. Findings Assoc. Comput. Linguistics (AACL-IJCNLP), Nov. 2022, c1c4a845aa-Abstract.html
pp. 226–236. [Online]. Available: https://aclanthology.org/2022.findings- [19] P. Trivedi, G. Maheshwari, M. Dubey, and J. Lehmann, ‘‘LC-QuAD:
aacl.22 A corpus for complex question answering over knowledge graphs,’’ in
[2] J.-H. Lin and E. J.-L. Lu, ‘‘SPARQL generation with an NMT-based Proc. Int. Semantic Web Conf., in Lecture Notes in Computer Science,
approach,’’ J. Web Eng., vol. 21, pp. 1471–1490, Jul. 2022. Vienna, Austria, C. d’Amato, M. Fernández, V. A. M. Tamma, F. Lécué,
[3] P. A. K. K. Diallo, S. Reyd, and A. Zouaq, ‘‘A comprehensive evaluation of P. Cudré-Mauroux, J. F. Sequeda, C. Lange, and J. Heflin, Eds., Cham,
neural SPARQL query generation from natural language questions,’’ 2023, Switzerland: Springer, 2017, pp. 210–218, doi: 10.1007/978-3-319-68204-
arXiv:2304.07772. 4_22.
[4] H. Tran, L. Phan, J. Anibal, B. T. Nguyen, and T.-S. Nguyen, ‘‘SPBERT: [20] A.-K. Hartmann, T. Soru, and E. Marx. (Apr. 2018). Generating a
An efficient pre-training BERT on SPARQL queries for question answer- Large Dataset for Neural Question Answering Over the DBpedia
ing over knowledge graphs,’’ in Proc. 28th Int. Conf. Neural Inf. Process. Knowledge Base. [Online]. Available: https://www.researchgate.net/
(ICONIP), Bali, Indonesia. Cham, Switzerland: Springer, Dec. 2021, publication/324482598_Generating_a_Large_Dataset_for_Neural_Questi
pp. 512–523. on_Answering_over_the_DBpedia_Knowledge_Base
[5] X. Huang, J.-J. Kim, and B. Zou, ‘‘Unseen entity handling in complex [21] K. Papineni, S. Roukos, T. Ward, and W.-J. Zhu, ‘‘BLEU: A method
question answering over knowledge base via language generation,’’ for automatic evaluation of machine translation,’’ in Proc. 40th Annu.
in Proc. Findings Assoc. Comput. Linguistics (EMNLP), Nov. 2021, Meeting Assoc. Comput. Linguistics, Philadelphia, PA, USA, Jul. 2002,
pp. 547–557. [Online]. Available: https://aclanthology.org/2021.findings- pp. 311–318. [Online]. Available: https://aclanthology.org/P02-1040/
emnlp.50 [22] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, ‘‘BERT: Pre-training
[6] B. B. Naik, T. J. V. R. Reddy, K. R. V. Karthik, and P. Kuila, ‘‘An SQL of deep bidirectional transformers for language understanding,’’ in Proc.
query generator for cross-domain human language based questions based Conf. North Amer. Chapter Assoc. Comput. Linguistics, Hum. Lang.
on NLP model,’’ Multimedia Tools Appl., vol. 83, no. 4, pp. 11861–11884, Technol., vol. 1, Minneapolis, MI, USA, Jun. 2019, pp. 4171–4186.
Jan. 2024. [Online]. Available: https://aclanthology.org/N19-1423
[7] J. Lehmann, P. Gattogi, D. Bhandiwad, S. Ferré, and S. Vahdati, ‘‘Language [23] A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, and I. Sutskever,
models as controlled natural language semantic parsers for knowledge ‘‘Language models are unsupervised multitask learners,’’ OpenAI
graph question answering,’’ in Proc. Eur. Conf. Artif. Intell. (ECAI), Blog, vol. 1, no. 8, pp. 2–6, 2019. [Online]. Available: https://api.
vol. 372, 2023, pp. 1348–1356. semanticscholar.org/CorpusID:160025533
[8] D. Banerjee, P. A. Nair, J. N. Kaur, R. Usbeck, and C. Biemann, ‘‘Modern
[24] M. R. A. H. Rony, U. Kumar, R. Teucher, L. Kovriguina, and J. Lehmann,
baselines for SPARQL semantic parsing,’’ in Proc. 45th Int. ACM SIGIR
‘‘SGPT: A generative approach for SPARQL query generation from natural
Conf. Res. Develop. Inf. Retr., Madrid, Spain, E. Amigó, P. Castells,
language questions,’’ IEEE Access, vol. 10, pp. 70712–70723, 2022, doi:
J. Gonzalo, B. Carterette, J. S. Culpepper, and G. Kazai, Eds., Jul. 2022,
10.1109/ACCESS.2022.3188714.
pp. 2260–2265, doi: 10.1145/3477495.3531841.
[25] A. See, P. J. Liu, and C. D. Manning, ‘‘Get to the point: Summarization
[9] M. Lewis, Y. Liu, N. Goyal, M. Ghazvininejad, A. Mohamed, O. Levy,
with pointer-generator networks,’’ in Proc. 55th Annu. Meeting Assoc.
V. Stoyanov, and L. Zettlemoyer, ‘‘BART: Denoising sequence-to-
Comput. Linguistics, Vancouver, BC, Canada, Jul. 2017, pp. 1073–1083,
sequence pre-training for natural language generation, translation, and
doi: 10.18653/v1/p17-1099.
comprehension,’’ in Proc. 58th Annu. Meeting Assoc. Comput. Linguistics,
D. Jurafsky, J. Chai, N. Schluter, and J. R. Tetreault, Eds., Jul. 2020, [26] M. Dubey, D. Banerjee, A. Abdelkawi, and J. Lehmann, ‘‘LC-QuAD
pp. 7871–7880, doi: 10.18653/v1/2020.acl-main.703. 2.0: A large dataset for complex question answering over Wikidata and
[10] C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, Y. Zhou, DBpedia,’’ in Proc. 18th Int. Semantic Web Conf. (ISWC), in Lecture Notes
W. Li, and P. J. Liu, ‘‘Exploring the limits of transfer learning with a unified in Computer Science, vol. 11779, Auckland, New Zealand, C. Ghidini,
text-to-text transformer,’’ J. Mach. Learn. Res., vol. 21, no. 140, pp. 1–67, O. Hartig, M. Maleshkova, V. Svátek, I. F. Cruz, A. Hogan, J. Song,
2020. [Online]. Available: http://jmlr.org/papers/v21/20-074.html M. Lefrançois, and F. Gandon, Eds., Cham, Switzerland: Springer,
[11] H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A. Lachaux, T. Lacroix, Oct. 2019, pp. 69–78, doi: 10.1007/978-3-030-30796-7_5.
B. Rozière, N. Goyal, E. Hambro, F. Azhar, A. Rodriguez, A. Joulin, [27] J. Gu, Z. Lu, H. Li, and V. O. K. Li, ‘‘Incorporating copying mechanism
E. Grave, and G. Lample, ‘‘LLaMA: Open and efficient foundation in sequence-to-sequence learning,’’ in Proc. 54th Annu. Meeting Assoc.
language models,’’ 2023, arXiv:2302.13971. Comput. Linguistics, Berlin, Germany, Aug. 2016, pp. 1631–1640, doi:
[12] B. Rozière et al., ‘‘Code llama: Open foundation models for code,’’ 2023, 10.18653/v1/p16-1154.
arXiv:2308.12950. [28] N. Muennighoff, ‘‘SGPT: GPT sentence embeddings for semantic search,’’
[13] F. F. Luz and M. Finger, ‘‘Semantic parsing natural language into SPARQL: 2022, arXiv:2202.08904.
Improving target language representation with neural attention,’’ 2018, [29] S. Yang, M. Teng, X. Dong, and F. Bo, ‘‘LLM-based SPARQL generation
arXiv:1803.04329. with selected schema from large scale knowledge base,’’ in Proc. China
[14] S. Hochreiter and J. Schmidhuber, ‘‘Long short-term memory,’’ Neural Conf. Knowl. Graph Semantic Comput. Cham, Switzerland: Springer,
Comput., vol. 9, no. 8, pp. 1735–1780, Nov. 1997. 2023, pp. 304–316.
[30] L. Kovriguina, R. Teucher, D. Radyush, and D. Mouromtsev, ‘‘SPAR- PAPA ABDOU KARIM KAROU DIALLO
QLGEN: One-shot prompt-based approach for sparql query generation,’’ received the master’s degree from the School of
in Proc. Int. Conf. Semantic Syst., 2023, pp. 3–4. [Online]. Available: Control and Computer Engineering, North China
https://api.semanticscholar.org/CorpusID:265309659 Electric Power University, Beijing, in 2021. He is
[31] H. Luo, H. E, Z. Tang, S. Peng, Y. Guo, W. Zhang, C. Ma, G. Dong, currently pursuing the Ph.D. with Polytechnique
M. Song, W. Lin, Y. Zhu, and L. A. Tuan, ‘‘ChatKBQA: A generate-then- Montreal. His current research interests include
retrieve framework for knowledge base question answering with fine-tuned neural machine translation, question answering,
large language models,’’ 2023, arXiv:2310.08975. language model, knowledge base, information
[32] T. B. Brown et al., ‘‘Language models are few-shot learners,’’ in Proc. retrieval, and semantic web.
Adv. Neural Inf. Process. Syst., Annu. Conf. Neural Inf. Process. Syst.
(NeurIPS), H. Larochelle, M. Ranzato, R. Hadsell, M. Balcan, and
H. Lin, Eds. Red Hook, NY, USA: Curran Associates, Dec. 2020,
pp. 1877–1901. [Online]. Available: https://proceedings.neurips.cc/ SAMUEL REYD received the master’s degree
paper/2020/hash/1457c0d6bfcb4967418bfb8ac142f64a-Abstract.html from Polytechnique Montreal, where he studied
[33] A. Q. Jiang, A. Sablayrolles, A. Mensch, C. Bamford, D. S. Chaplot, SPARQL query generation with neural machine
D. de las Casas, F. Bressand, G. Lengyel, G. Lample, L. Saulnier, translation methods. He is currently pursuing the
L. R. Lavaud, M.-A. Lachaux, P. Stock, T. Le Scao, T. Lavril, T. Wang, Ph.D. degree with Telecom Paris. He focused
T. Lacroix, and W. El Sayed, ‘‘Mistral 7B,’’ 2023, arXiv:2310.06825. on the comparison of state-of-the-art approaches,
[34] L. Zheng, W.-L. Chiang, Y. Sheng, S. Zhuang, Z. Wu, Y. Zhuang, Z. Lin, including neural architectures and annotation
Z. Li, D. Li, E. P. Xing, H. Zhang, J. E. Gonzalez, and I. Stoica,
methods, the development of a copy mechanism
‘‘Judging LLM-as-a-judge with MT-bench and chatbot arena,’’ 2023,
for pre-trained models, and the study of the
arXiv:2306.05685.
generalization capabilities of these techniques. His
[35] Y. Zhou, X. Geng, T. Shen, W. Zhang, and D. Jiang, ‘‘Improving zero-shot
cross-lingual transfer for multilingual question answering over knowledge research interest includes cognitive explainability methods for intricate
graph,’’ in Proc. Conf. North Amer. Chapter Assoc. Comput. Linguistics cyber-physical systems.
Human Lang. Technol., 2021, pp. 5822–5834.
[36] S. Purkayastha, S. Dana, D. Garg, D. Khandelwal, and G. P. S. Bhargav, AMAL ZOUAQ is a Full Professor at Poly-
‘‘A deep neural approach to KGQA via SPARQL silhouette gener- technique Montreal and an Associate Academic
ation,’’ in Proc. Int. Joint Conf. Neural Netw. (IJCNN), Jul. 2022, Member at MILA (Quebec Artificial Intelligence
pp. 1–8.
Institute). She holds an FRQS (Dual) Chair in
[37] J. Ding, W. Hu, Q. Xu, and Y. Qu, ‘‘Leveraging frequent query
AI and Digital Health. She is also an IVADO
substructures to generate formal queries for complex question answering,’’
Professor, a member of the CLIQ-AI consortium
2019, arXiv:1908.11053.
[38] M. Chen et al., ‘‘Evaluating large language models trained on code,’’ 2021, (Computational Linguistics in Québec), and an
arXiv:2107.03374. Adjunct Professor at the University of Ottawa. Her
[39] L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. L. Wainwright, P. Mishkin, research interests include artificial intelligence,
C. Zhang, S. Agarwal, K. Slama, A. Ray, J. Schulman, J. Hilton, F. Kelton, natural language processing, and semantic web.
L. Miller, M. Simens, A. Askell, P. Welinder, P. Christiano, J. Leike, and She is the Director of the LAMA-WeST Research Laboratory,5 which
R. Lowe, ‘‘Training language models to follow instructions with human focuses on challenges related to representation learning, natural language
feedback,’’ 2022, arXiv:2203.02155. interfaces and question answering, automated reasoning, knowledge base
[40] B. M. Lake, ‘‘Compositional generalization through meta sequence-to- learning and alignment, ontology learning and modeling, and information
sequence learning,’’ in Proc. Adv. Neural Inf. Process. Syst., vol. 32, 2019, extraction and generation.
pp. 1–11.
5 http://www.labowest.ca/