0% found this document useful (0 votes)

78 views

A Comprehensive Evaluation of Neural SPARQL Query Generation From Natural Language Questions

This paper evaluates neural SPARQL query generation from natural language questions, focusing on the effectiveness of pre-trained and non-pre-trained language models, as well as the impact of the copy mechanism and question annotation. The study finds that the copy mechanism significantly enhances performance, particularly in generating correct URIs, while highlighting the limitations of large language models in achieving desired outcomes. Additionally, a systematic error analysis reveals that incorrect URIs often lead to hallucinations in generated queries, emphasizing the importance of data annotation strategies.

Uploaded by

ima2

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

78 views

A Comprehensive Evaluation of Neural SPARQL Query Generation From Natural Language Questions

Uploaded by

ima2

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 22

Received 25 July 2024, accepted 20 August 2024, date of publication 2 September 2024, date of current version 16 September 2024.

Digital Object Identifier 10.1109/ACCESS.2024.3453215

A Comprehensive Evaluation of Neural SPARQL

Query Generation From Natural Language
Questions
PAPA ABDOU KARIM KAROU DIALLO 1, SAMUEL REYD 2, AND AMAL ZOUAQ 1
1 LAMA-WeST Laboratory, Department of Computer Engineering and Software Engineering, Polytechnique Montreal, Montreal, QC H3T 1J4, Canada
2 Telecom Paris, 91120 Palaiseau, France

Corresponding author: Papa Abdou Karim Karou Diallo ([email protected])

This work was supported by the Natural Sciences and Engineering Research Council of Canada (NSERC) Discovery Grant Program.

ABSTRACT In recent years, the field of neural machine translation (NMT) for SPARQL query generation
has witnessed significant growth. Incorporating the copy mechanism with traditional encoder-decoder
architectures and using pre-trained encoder-decoder and large language models have set new performance
benchmarks. This paper presents various experiments that replicate and expand upon recent NMT-based
SPARQL generation studies, comparing pre-trained language models (PLMs), non-pre-trained language
models (NPLMs), and large language models (LLMs), highlighting the impact of question annotation and the
copy mechanism and testing various fine-tuning methods using LLMs. In particular, we provide a systematic
error analysis of the models and test their generalization ability. Our study demonstrates that the copy
mechanism yields significant performance enhancements for most PLMs and NPLMs. Annotating the data is
pivotal to generating correct URIs, with the ‘‘tag-within’’ strategy emerging as the most effective approach.
Additionally, our findings reveal that the primary source of errors stems from incorrect URIs in SPARQL
queries that are sometimes replaced with hallucinated URIs when using base models. This does not happen
using the copy mechanism, but it sometimes leads to selecting wrong URIs among candidates. Finally, the
performance of the tested LLMs fell short of achieving the desired outcomes.

INDEX TERMS SPARQL query generation, knowledge base, copy mechanism, non pre-trained, and pre-
trained encoders-decoders.

I. INTRODUCTION architecture, where input sentences pass through the encoder

The Semantic Web (SW) provides a framework for storing to generate a vector that holds their semantics, and the
and organizing structured data, following standards defined decoder produces tokens at each time step based on the
by the World Wide Web Consortium (W3C). To access data encoder’s output and previous tokens. However, classic
within a Knowledge Base (KB) on the SW, one must use (without pre-training) NMT architectures have common
the SPARQL query language, which can be challenging limitations. First, they require a fixed vocabulary, which
for non-experts and necessitates a familiarity with the KB cannot be updated without restarting the entire learning
structure and ontology. For greater usability of SW, it is process. Therefore, if a new word appears during the testing
crucial to develop models that allow users to query KBs easily phase, it is processed as an unknown (<unk>) token. Second,
using natural language. it is not trivial for a NMT model to learn to transform
One approach to generating queries based on natural natural language into KB elements (URIs), particularly since
language questions is to use neural machine translation most elements are seen very few times during training.
(NMT). This is typically performed using an encoder-decoder These limitations are particularly important in the case of
NMT-based SPARQL query generation, as unknown words
The associate editor coordinating the review of this manuscript and are often elements from the KB, which can lead to irrelevant
approving it for publication was Paolo Crippa . URIs in the generated queries. To address these limitations,

2024 The Authors. This work is licensed under a Creative Commons Attribution 4.0 License.
VOLUME 12, 2024 For more information, see https://creativecommons.org/licenses/by/4.0/ 125057
P. A. K. K. Diallo et al.: Comprehensive Evaluation of Neural SPARQL Query Generation

the copy mechanism was proposed [1], which allows tokens (BART [9] and T5 [10]) and two LLMs (Llama2 [11]
from the input to be directly copied into the output based on a and Code Llamav2 7B [12];
knowledge base vocabulary that includes KB URIs. However, 2) We evaluate the impact of the copy mechanism
copy-based models require the annotation of KB URIs in the and question annotations and experiment with ‘‘raw-
natural language questions. question’’ (non-annotated) questions, ‘‘tag-within’’
The recent development of pretrained language models and questions where we replace natural language elements
their application to SPARQL query generation has opened with their KBs URIs counterparts and tag-end ques-
new potential avenues [2], [3], [4], [5], [6]. In fact, Lehmann tions, where we list KB URIs with their labels at the
et al. [7] suggested the use of controlled natural language as end of the questions;
a target for Knowledge Graph Question Answering (KGQA) 3) We experiment with standard fine-tuning and instruc-
semantic parsing. They hypothesized that pretraining LLMs tion fine-tuning on two Large Language Models
on textual data can facilitate parsing into controlled natural (LLMs), namely Llama [11] and Code Llama [12] and
language for KGQA with limited training data requirements, measure the impact of training data size on the results;
reducing the cost and effort of collecting high-quality training 4) We perform a fine-grained analysis of the generation
data. SPARQL can be considered as such a controlled natural errors with their type distribution for all the models;
language. 5) We test the generalization capabilities of the
Initial experiments on pretrained language models for best-performing models using questions reformulated
SPARQL query generation indicate that while they may in different settings.
outperform their non-pretrained counterparts [8], they still To our knowledge, such detailed study has not been done
exhibit some limitations in handling unknown URIs to some yet especially including the performance of large language
degree, but most importantly, they exhibit poor generalization models, the generalization abilities of all models, and the
abilities as their performance drops when new question fine-grained identification of errors made at generation time.
templates are used at test time [3]. With this in mind, while
most datasets rely on template-based questions, the ability of
II. STATE OF THE ART
neural models trained on template questions to handle natural
A. NON-PRE-TRAINED MODELS FOR SPARQL
(formulated by humans, paraphrases) questions remains
GENERATION
unexplored.
In this paper, our goal is to expand upon these experiments Using non-pre-trained encoder-decoders for SPARQL gen-
and provide a systematic comparison of several models. eration has gained a lot of attention in recent years. Refer-
In particular, we aim to identify the failures of state- ence [13] presented a method for translating NL statements
of-the-art neural query generators, in terms of SPARQL to SPARQL expressions using an LSTM [14] encoder-
structures, incorrect URIs, and hallucinated URIs. We also decoder model. Reference [15] proposed an architecture
test models’ generalization capabilities with natural - non called Neural SPARQL Machines (NSpM), for translating
template-based- questions. Additionally, we include Large NL expressions to encoded forms of SPARQL queries. The
Language Models (LLMs) to assess whether they exhibit the framework involved generating train entries and feeding them
same shortcomings. In this context, we address the following to a sequence to sequence model. Reference [16] proposed
research questions: two new encoder-decoder architectures, ConvSeq2Seq [17]
and Transformer [18], on the Monument [15], LC-QuAD 1.0
1) Does the annotation of KB elements in the natural [19] and DBNQA [20] datasets. They evaluated their models
language questions improve the SPARQL query gen- with BLEU score [21], query accuracy as well as the F1 score
eration for all models? based on the tokens of the candidate and target translation.
2) Does the integration of a copy mechanism in NPLMs Reference [1] incorporated a neural copy mechanism on top
and PLMs improve the accuracy of the KB elements in of ConvSeq2Seq and Transformer architectures and used
the SPARQL queries? tagged versions of the questions that identify explicitly
3) Are Large Language Models (LLMs) effective in entities and relations URIs in the questions. The approach
this task, and which fine-tuning/prompting technique was evaluated on the same datasets used by [16], resulting
performs better? in significant improvements in BLEU scores. They also
4) What are the most common generation errors, and computed the accuracy of answers obtained after running the
what types of tokens are often generated instead of the generated queries against the knowledge base. The approach
expected types? demonstrated robust performance.
5) How do models trained on template-based questions
perform on naturally reformulated questions? B. PRE-TRAINED MODELS FOR SPARQL GENERATION
Our contributions include the following aspects: Building on the success of encoder-decoder models for
SPARQL query generation and the recent development
1) Using annotation, we compare the results of two of pre-trained models for sequence-to-sequence problems,
NPLMs (ConvSeq2Seq and Transformers), two PLMs recent approaches have proposed the use of pre-trained

125058 VOLUME 12, 2024

P. A. K. K. Diallo et al.: Comprehensive Evaluation of Neural SPARQL Query Generation

models for SPARQL query generation. Reference [5] pro- performance on standard KBQA datasets. In particular, the
posed an algorithm for identifying entities and relations approach generates logical forms before retrieving entities
within a question and subsequently generating query struc- and relations, followed by a conversion to SPARQL queries.
tures with placeholders, which are then populated with URIs Kovriguina et al. [30] introduce SPARQLGEN, a generative
in a post-processing step. For generating query structures, approach for generating SPARQL queries using GTP-3
they used BERT [22], GPT2 [23], and BART-large [9]. [32], leveraging various types of contexts in the prompt to
Reference [24] proposed a method for embedding questions influence query generation performance. Their approach is
using various sources of information, such as POS tagging, similar to ours for LPLMs except that we use only a single
and feeding the representation to a GPT2-based model [23]. prompt structure and use some of the latest LPLMs Llama2
They annotated KB elements in the questions with special [11] and Code Llamav2 7B [12]. While all these approaches
tokens depending on the models. These special tokens are leverage large pre-trained language models, they do not focus
replaced by the correct KB elements in a post-processing on their performance in out-of-distribution settings and do not
phase. Reference [8] used BART [9], T5-small [10], T5- analyze their limitations in detail.
base [10], and a non-pre-trained model based on Pointer
Generation Network [25]. They added URIs and their labels III. METHODOLOGY
at the end of the questions (aka ‘‘tag-end’’ annotation). The A. TASK AND DATA FORMAT
approach generated several queries using a beam search and Each dataset is a standard, publicly available dataset,
ran each query in the order of the beam probabilities, keeping composed of a set of question-query pairs called entries. The
the first one that produced a non-empty answer. The models question is formulated in English and the target is a SPARQL
were evaluated on LC-QuAD 1.0 [19] and LC-QuAD 2.0 [26] query. All the datasets are generated automatically using
datasets, using the F1 score on the answers returned by the templates. These templates match question-query structures
generated queries. with placeholders that are later filled with specific URIs.
We call this matching pair a global template. Since some
C. COPY MECHANISM question structures can be associated with several query
Previous approaches have explored the idea of transferring structures and vice versa, we also refer to question templates
information from the question to the SPARQL query in and query templates as the isolated question or query
different ways. Some approaches used placeholders [5] or structures. For example, the global template Question: ‘‘what
cross-attention weights to map words from the question to the is the <1> of <2> ?’’ / Query: select distinct ?uri
query [13]. Recently, explicit copy mechanisms inspired by where {<2> <1> ?uri} generated the following Entry
CopyNet [27] and Pointer-Generator Networks (PGN) [25] Question: ‘‘what is the office of richard coke ?’’ and its associ-
have been incorporated into encoder-decoder architectures. ated Entry Query: select distinct ?uri where {
These mechanisms allow the decoder to generate tokens dbr:Richard_Coke dbp:office ?uri }.
based on probabilities derived from the decoder’s logits
or by copying tokens from the input. CopyNet [27] uses 1) ANNOTATIONS
a learned weight matrix to calculate the probability of To identify the impact of annotations on models’ perfor-
copying each word from the input. PGN [25] computes a mance, we experiment with three schemes.
copy probability and copy scores based on cross-attention
weights, combining them with generation scores using the a: ‘‘RAW’’ QUESTIONS
copy probability. Reference [8] directly used the PGN [25] Questions without annotation are designated as ‘‘raw-
architecture, whereas [1] proposed a modified version where question’’ in our results.
the URIs are masked in the input sequence.
In this work, we plan to unify the evaluation of both NPLM b: ‘‘TAG-WITHIN’’ QUESTIONS
and PLMs under the same question annotation and the same
‘‘Tag-within’’ questions are generated by substituting the
copy / no copy settings. To our knowledge, none of the
natural language words within the placeholders of the
available approaches have incorporated a copy mechanism in
question template with the corresponding URIs and literals
architectures such as BART [9] or T5 [10].
extracted from the placeholders of the query template. Having
question and query templates for each dataset entry allows
D. LARGE PRETRAINED LANGUAGE MODELS us to easily match the position of a token in the question
Finally, recent works have also explored Large Language to the position of the identifier of the knowledge base URI
Models (LLMs) for SPARQL query generation [28], [29], when performing this type of annotation. The following
[30]. For instance, Muennighoff [28] proposes SGPT, example illustrates the question and query templates and the
an extension of SBERT that enhances GPT models for resulting tagged version of the original raw question. Figure 1
semantic search. Luo et al. [31] introduce ChatKBQA, describes the process of tagging a raw question:
a novel KBQA framework built on fine-tuned open-source For instance, the ‘‘tag-within’’ question associated to the
LLMs, which combines generation and retrieval for improved example ‘‘which person has opponent ike clanton ?’’ is ‘‘who

VOLUME 12, 2024 125059

P. A. K. K. Diallo et al.: Comprehensive Evaluation of Neural SPARQL Query Generation

2) PRE-TRAINED MODELS
To compare our results with [8], we used BART-base [9] and
T5-small [10] as pre-trained encoder-decoder architectures.
We also compare these architectures with the non-pre-trained
ones in the same experimental settings to show the impact
of pre-training. For SPARQL schema elements and URIs,
we followed the encoding described by [8]. Each SPARQL
schema element and each URI prefix is considered as a
special token. With T5, we used the sentinel tokens and with
BART, we added new tokens to represent these elements.

3) LARGE PRE-TRAINED MODELS

In order to evaluate Large Language Models, we made use
of Ludwig,1 an open-source framework for training and
deploying machine learning models. Ludwig streamlines the
FIGURE 1. Procedure for question tagging. intricate process of constructing, training, and deploying
these models. We experimented with two LLMs: Llama [11],
as it is among the top-performing models in several NLP
is the ≪ dbo:Person≫ whose ≪ dbo:opponent≫ tasks, and Code Llama [12] which is specialized for code
is ≪ dbr:Ike_Clanton≫ ? ’’. In contrast to the generation. Using Ludwig, we fine-tuned these models in
approach proposed by [8] and [1], this study includes not two ways: first, we performed fine-tuning with just the input
only URIs but also literals. For example, the question ‘‘Give (NL question) and output (SPARQL query). Then, we used a
me organization that contains the word zollkriminalamt in prompt with an instruction explaining the task, followed in the
their name’’ expects the query to filter the organization names input sequence by the NL question and the SPARQL query as
based on the appearance of the string ‘‘zollkriminalamt’’. output. We also provided the KB elements needed to generate
This modification is motivated by the fact that literals are the SPARQL query in the latter method. Providing the URIs
also KB elements that should be transferred directly from the through the prompt’s instruction allows us to simulate the
question to the query, similarly to URIs. ‘‘tag_end’’ setting as an annotation method used with the
PLMs and the NPLMs.
c: ‘‘TAG-END’’ QUESTIONS We have also experimented two versions of Mistral:
In tag-end questions, the list of KB URIs available in the Mistral 7B v0.3 [33] and Mistral 7B Instruct v0.32 only in the
SPARQL query is appended to the question. Instead of instruction fine-tuning fashion, which is the recommended
replacing the corresponding natural language tokens, URIs way for this task.3 Mistral 7B v0.3 [33] is a state-of-the-art
are randomly placed at the end of the question, together language model designed for a wide range of natural language
with their corresponding English label. The inclusion of processing tasks. With 7 billion parameters, it strikes
labels next to the URIs improves the understanding of the a balance between model complexity and computational
semantics of the URIs. Each pair of URI/label is separated efficiency. To demonstrate the generalization capabilities of
by a <sep> token. The ‘‘tag-end’’ question associated to Mistral 7B [33], this models was fine-tuned using publicly
the previous example is ‘‘who is the person whose opponent available instruction datasets from HuggingFace, without
is ike clanton ? <sep> ≪ dbr:Ike_Clanton≫ ike employing any special techniques or proprietary data. The
clanton <sep> ≪ dbo:opponent≫ opponent resulting model, Mistral 7B Instruct, surpasses all other 7B
<sep> ≪ dbo:Person≫ person’’. models on MT-Bench [34] and performs on par with 13B chat
models. For these two models we use the provided libraries4
for memory efficient and performant fine-tuning.
B. BASE MODEL ARCHITECTURES
1) NON-PRE-TRAINED MODELS
C. COPY MECHANISM
We first tested non-pre-trained encoder-decoder models. As stated previously, the main limitation of neural query
Following [16] and [1], we used a Transformer model [18] generators is the accurate generation of URIs in the SPARQL
and a ConvSeq2Seq model [17]. We re-ran the experiments queries, and this is even more important when these URIs
presented in these two papers (‘‘raw-question’’ and ‘‘tag- are unknown. Thus our hypothesis is that the SPARQL
within’’ questions on LC-QuAD 1.0 [19] and DBNQA [20]) query generation task can be decomposed into a query
and extended the experiments with an additional dataset (LC-
QuAD 2.0 [26]) and an additional data annotation method 1 https://ludwig.ai/latest/
(‘‘tag-end’’ questions). These results shall demonstrate the 2 https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.3
limitations explained in the introduction and set a baseline 3 https://github.com/mistralai/mistral-finetune
for comparison with the following models. 4 https://docs.mistral.ai/guides/finetuning/

125060 VOLUME 12, 2024

P. A. K. K. Diallo et al.: Comprehensive Evaluation of Neural SPARQL Query Generation

structure generation, followed by a copy of the URIs at We can express pG , pC and pcopy as such:
some specific positions. We extend the use of the copy (
mechanism [1], which is based on only copying elements σS (DECi (ti |w̃0:m ; t0:i−1 )) ∀ti ∈ S
pG = (2)
from a KB vocabulary and masking these elements to the 0 /S
∀ti ∈
encoder and decoder blocks. Our extension includes literals (
in the KB vocabulary and the addition of the copy to the pre- σK ∩w0:m (Ak,i ) ∀ti ∈ K s.t. ∃k : wk = ti
pC = (3)
trained models’ architectures. This is further explained in the 0 otherwise
following sections. pcopy = σ (DECi (ti |w̃0:m ; t0:i−1 )) × B) (4)

1) VOCABULARIES With w̃0:m being the tokens given as input to the encoder
where the words from K have been masked, DECi being the
We define three distinct vocabularies:
logits of the decoder at timestep i, Ak,i is the cross-attention
• The English natural language vocabulary, denoted as W , weight computed by the encoder-decoder between the k-th
composed of the questions’ tokens. term of the input and the i-th term of the predicted output,
• The SPARQL vocabulary, denoted as S, which includes B ∈ R|S|×1 is a matrix with weights learned during training,
SPARQL keywords. σ (·) is the sigmoid function and σX (·) is the softmax function
• The KB vocabulary, denoted as K , which includes URIs computed over set X .
(classes, properties, and resources) as well as literals. In the end, with this copy mechanism, the encoder
Each entry of our datasets is composed of a question with vocabulary is W , the decoder vocabulary is S, and only tokens
tokens from W and a query with tokens from K ∪ S. Our from K can be copied from the question to the query.
annotated versions of the dataset tag the question elements
with URIs from K that appear in the query. In the ‘‘tag-end’’ IV. EXPERIMENTS
setting, the size of W might be slightly larger since the labels In our experiments, we combined various parameters,
used next to the URIs are not necessarily the same as those including six different base model architectures, the addition
used in the original question. That is because we use the URI’s or removal of the copy mechanism for pre-trained and
English label defined in the KB. On the opposite, in the ‘‘tag- non-pre-trained models, two types of question annotations,
within’’ version, the size of W is massively reduced since and data from three different datasets. We conducted eight
the words used to describe specific URIs in the question are experiments with Large Language Models (LLMs) using
replaced by tokens from K . two different models and two fine-tuning strategies (standard
fine-tuning and instruction fine-tuning) with varying sample
2) DESCRIPTION proportions. For Pre-trained Language Models (PLMs) and
The copy mechanism computes, at each generation step, Non-Pre-trained Language Models (NPLMs), we conducted
a probability of copy for the next token. This probability 56 experiments. We carried out 56 tests, comprising ten
weights the generation score and copy score of the next token. adaptations of existing studies and 46 original results,
For each word of the target vocabulary, the generation score is which we will discuss in more detail in the following
based on the logits of the decoder. The copy score is based on sections.
the cross-attention weight between the next predicted token
and each token of the input. We arbitrarily chose to use A. DATASETS
the last attention head for every model. Unlike PGN [25], We experiment with three datasets: LC-QuAD1.0, LC-
this implementation of the copy mechanism is designed to QuAD2.0 and DBNQA. Our primary motivation for these
only copy tokens from a specific vocabulary, here the KB choices is to ensure comparative reproducibility and facil-
vocabulary K . When the input sentence is provided to the itate the extension of our experiments. Additionally, the
encoder, the words from this vocabulary are masked. Then, substantial size of these datasets and the active community
the copy scores are only computed for these masked tokens. contributions to the underlying knowledge bases, particu-
More formally, at generation step i, the generation larly Wikidata (linked to the Wikipedia project), played a
probability of a token in S ∪ K is computed as follows: crucial role in our decision. These datasets are specifically
designed for SPARQL query generation, ensuring high-
quality question-query pairs across a wide range of domains
pti = pcopy × pC + (1 − pcopy ) × pG (1) and topics. Datasets statistics can be seen in Table 1
and Table 2. The vocabulary sizes for the three datasets
where pti = P(ti |w0:m ; t0:i−1 ), with ti being the i-th token are presented in Table 2. We also report the size of
generated, t0:i−1 being the tokens previously generated, and the out-of-vocabulary (OOV) tokens as it highlights the
w0:m being the original input tokens before masking. difficulties mentioned in the introduction. We can notice
pcopy is the probability of using copy instead of generation, that the set S does not include OOV tokens whereas
pC the probability of copying a token, pG is the probability of the sets W and K often feature a significant amount of
generating a token according to the decoder. OOV tokens.

VOLUME 12, 2024 125061

P. A. K. K. Diallo et al.: Comprehensive Evaluation of Neural SPARQL Query Generation

1) LC-QUAD 1.0 TABLE 2. Size of vocabularies.

The LC-QuAD 1.0 (referred to as LCQ1 for short in tables)

dataset [19] includes two versions of the question. The
first one is the automatically generated question based
on templates (aka generation templates). The second is
a human reformulation. For example, the question ‘‘how
many movies are there whose director is
Stanley Kubrick?’’ is reformulated as ‘‘how many
movies did Stanley Kubrick direct ?’’. The
dataset also includes, for each question, a template id
corresponding to the query template that generated the
SPARQL query. To apply our tagging methodology,
we identified the question templates for each question and
extracted the global templates associated to it. We found
35 global templates, 23 question templates and 34 query
templates (see section III-A for a definition). answers for each query in the three datasets. To extract the
answers, we used a DBpedia SPARQL endpoint for LC-
2) LC-QUAD 2.0 QuAD 1.0 and DBNQA. The DBpedia endpoint was based
The LC-QuAD 2.0 (referred to as LCQ2 for short in tables) on a 2016 dump, which was the version available at the
dataset [26] is richer than LC-QuAD 1.0 since it also includes time of the LC-QuAD 1.0 publication. We used the same
literals and filtering operations (i.e. the FILTER option in version for DBNQA due to the unavailability of dumps for the
the SPARQL query). It also includes reformulated questions. intended 2018 version. We also implemented a local Wikidata
We can see in Table 2 that the size of the SPARQL vocabulary 2021 endpoint for LC-QuAD 2.0 to reduce the number of
S is much bigger than in LC-QuAD 1.0 [19] and even bigger queries returning empty answers on other endpoint versions
than in DBNQA [20] despite the overall size difference. and for speed considerations at test time. All reported results
Another difficulty is that LC-QuAD 2.0 [26], by adding are computed only on the subsets of non-empty answers of
filtering operations, is the only one of our datasets to feature the test sets. This is especially important since this can lead to
literals. Hence, it is the most challenging of our datasets. inflated performances when models provide incorrect queries
We found 34 global templates, 27 question templates and that return empty answers, yet still receive positive scores
33 query templates. with answer-based metrics if the gold standard also includes
empty answers. After discarding empty answers, we keep
3) DBNQA 100% of the LC-QuAD 1.0 test set, 96.20% of the LC-QuAD
We extracted a subset of The DBNQA dataset [20], which 2.0 test set and 80.41% of the DBNQA test set.
is far more massive than our two other datasets. The reason
of using only a subset is that the original DBNQA dataset B. MODELS HYPERPARAMETERS
does not include any information about any of the generation For our experiments, we chose not to focus on the opti-
templates associated with each entry. To be able to tag the mization of the hyperparameters but on the comparison
questions, we used a subset of this dataset for which we could between approaches. Therefore, we based our experiments on
identify the global templates. These global templates were the experimental settings of state-of-the-art approaches and
then used to generate the ‘‘tag-within’’ questions. We found based on computational resources limitations.
512 global templates, 507 question templates and 157 query For the Transformer and the ConvSeq2Seq architectures,
templates. we followed [16] hyperparameters. For the BART and T5
models, we followed [8] recommendations. We also followed
TABLE 1. Size of datasets. the recommendations from [1] and [8] to train the models.
We chose to only experiment with T5-small since [8] reported
better performance with this version on LC-QuAD 1.0 and
similar performance between T5-small and T5-base on LC-
QuAD 2.0. We also chose BART-base to compare our results
with those of [8]. For more details about the hyperparameters
you can refer to B3.
We used batch sizes of 32, 16 and 5 for the non-pre-
trained models (respectively for LC-QuAD 1.0, LC-QuAD
4) DATASET ANSWERS 2.0 and DBNQA). We only adjusted the batch size of the
The datasets used in this study do not provide the answers Transformer for DBNQA to 32 because it helped raise
to the question-query pairs they contain. To address this performance. For pre-trained models, we used batch sizes of
problem, we enriched these datasets with the expected 16, 8 and 5 respectively for LC-QuAD 1.0, LC-QuAD 2.0 and

125062 VOLUME 12, 2024

P. A. K. K. Diallo et al.: Comprehensive Evaluation of Neural SPARQL Query Generation

DBNQA. For non-pre-trained experiments, we trained our questions and completely fail with ‘‘tag-end’’ questions.
models for 500, 150 and 50 epochs (respectively for LC- ConvSe2Seq models reach good performances with both
QuAD 1.0, LC-QuAD 2.0 and DBNQA). For pre-trained settings. When using the copy mechanism, we don’t run
models, we used 200, 50 and 20 epochs. Each reported experiments on ‘‘raw-question’’ questions because they don’t
result is the mean of three different runs with the different include URIs to copy.
random seeds. We train the model train for the number of
specified epochs for each run and keep the model with the best 3) IMPACT OF THE COPY MECHANISM
validation loss for testing. The training is done with teacher The copy mechanism almost always has a huge positive
forcing, i.e. the decoder is supposed to predict the next token impact on performance. The impact of copy is always more
of the gold standard query. For generation at test time, we use significant with ‘‘tag-within’’ questions than with ‘‘tag-
greedy decoding. end’’ questions. Finally, we can notice that the DBNQA
dataset is the one that benefits the less from the copy
C. METRICS mechanism. This is probably because the huge amount of
To evaluate the performance of our models, we use three data allows the non-pre-trained models to learn already well
metrics. First, we report the BLEU-score [21], which is a enough without the copy mechanism. There is, however, still
popular NMT metric that compares the predicted query to a significant jump in performance compared to non-copy
the gold standard query at the token level. We also report models.
two question-answering metrics. The first one is the accuracy
of the produced answer. Each predicted query is ran against B. PRE-TRAINED MODELS
the KB and we compare the returned answer to the answer 1) BASE MODELS
of the gold standard. This is the most relevant metric to We compare BART and T5 and reproduce the results
compare to studies such as [8], even though we did not use of [8] on top of additional experiments. We can note that
the strategy of keeping the first non-empty answer amongst T5 consistently outperforms BART, and that ‘‘tag-within’’
the top-n generated queries. Indeed, our models only generate questions always imply similar or better performance than
one query. Finally, we compute the F1-score of the answers. ‘‘tag-end’’ questions as shown in Table 4. We can also see
We then average the F1-scores of each entry for each of the that for the ‘‘tag-end’’ questions without the copy mechanism,
three runs to get the F1-score of the model on a given test set. our results are better or close to the results reported by [8] on
LC-QuAD 1.0 and LC-QuAD 2.0. Our results for BART are
V. RESULTS much better than what they report, whereas we obtain a drop
We report results rounded to integer to facilitate comparison of 3 points for T5 on LC-QuAD 2.0. This might be due to the
between results within the same tables. The results for non- fact that we include the literals at the end of the questions,
pre-trained models can be found in Table 3. contrary to [8]. Finally, our generation strategy is different.
We evaluate the greedy generation of our models whereas [8]
A. NON-PRE-TRAINED MODELS kept the top-10 beam-generated queries, ran each of them on
1) REPRODUCTION RESULTS the endpoint, and only evaluated the first one to return a non-
Parts of our results reproduce existing studies. For models empty answer.
without and with copy, results have already been reported
on LC-QuAD 1.0 and DBNQA without annotation [1], [16] 2) COPY-BASED MODELS
and with ‘‘tag-within’’ questions [1]. We can observe that We can note that BART benefits from the copy mechanism
our results slightly improve those reported by [1], [16] for much more than T5. Except for LC-QuAD 2.0 with ‘‘tag-
LC-QuAD 1.0 with Transformer and ConvSeq2Seq models within’’ questions, the copy mechanism always allows a rise
with and without the copy mechanism. We also report similar in performance for BART. The case of LC-QuAD 2.0 with
results for DBNQA with ConvSeq2Seq with and without the ‘‘tag-within’’ questions coupled with copy-based models
copy mechanism. shows a specific difficulty that is discussed in Section VII.
On the contrary, the copy mechanism only allows a slight
2) ANNOTATION IMPACT rise in performance for T5 with the LC-QuAD 1.0 dataset.
For almost all of our results, we can see that the ques- However, it considerably lowers the results with the ‘‘tag-
tion annotation improves the performance. Performances end’’ questions on both the LC-QuAD 2.0 and DBNQA
diminish a little only for Transformer on DBNQA and for datasets.
ConvSeq2Seq on LC-QuAD 2.0 with ‘‘tag-within’’ questions
compared to no annotation (‘‘raw-question’’). Except for 3) ANNOTATION IMPACT
these two counter examples, we see a consistent improvement Overall, without the copy mechanism, we can notice that
due to annotation. even though annotations helped improve performances for
In the copy-based architectures, we can note that non-pre-trained models, the impact is much more noticeable
Transformers reach good performances with ‘‘tag-within’’ for pre-trained models which often reach much better

VOLUME 12, 2024 125063

P. A. K. K. Diallo et al.: Comprehensive Evaluation of Neural SPARQL Query Generation

TABLE 3. Results for non-pre-trained models.

TABLE 4. Results for pre-trained models.

performance with ‘‘tag-within’’ questions compared to the our instruction explicitly specifies the URIs to be used for
‘‘raw-question’’ setting. Notably, we can report that non-pre- generating the query, as illustrated in Figure 2. In contrast,
trained models gain around 2.02% of F1 points on average the standard fine-tuning approach does not incorporate such
going from no annotation (‘‘raw-question’’) to ‘‘tag-within’’ explicit instructions; instead, it utilizes inputs with tags
question, whereas pre-trained models gain around 48.70% of (questions wherein URIs corresponding to knowledge base
F1 points on average. elements are appended at the end of the input sequence) or
without tags to emulate ‘‘raw-question’’.
C. LARGE LANGUAGE MODELS As shown in Tables 5 and 6, the performance of our
We also conduct an empirical evaluation of four LLMs, Large Language Models (LLMs) falls short in comparison
namely Llama [11], Code Llama [12], Mistral 7B v0.3 [33] to our T5 and ConvSeq2Seq models. All the LLMs we
and Mistral 7B Instruct. Due to computational and time experimented with exhibit suboptimal performance levels,
limitations, we only exploit the LC-QuAD 2.0 dataset in these with the best F1 scores being 13% and 23%, respectively,
experiments due to its higher difficulty level. Our evaluation for Llama and Code Llama. For Mistral’s models, the best
uses different portions of the training data, specifically F1 score are respectively 10% and 13% for Mistral 7B and
25% and 50% of the training set, corresponding to 5,440 Mistral-Instruct 7B. These results collectively indicate that,
and 10,880 entries from the train set, respectively. When despite the tagging of questions and the extent of data utilized
fine-tuning with instruction, we also try using 100% of for fine-tuning, these LLMs do not substantially enhance
the train set to get more insight into the models’ behavior their proficiency in generating effective SPARQL queries.
concerning the train sets’ size. This is because this fine-tuning Our first hypothesis explaining the weak performance of
method shows more sensitivity to the train set size than the LLMs is the limited amount of SPARQL query-related data
standard fine-tuning. For Mistral’s model we just use 100% in pre-training datasets. The second is that these models
of the train set during fine-tuning. rely on their parametric memories to generate URIs, which
We explore two distinct fine-tuning approaches. First, the precisely does not work and is actually our reason for
standard fine-tuning method entails providing the model using the copy mechanism in non-pretrained and pretrained
with the question as input and the corresponding SPARQL language models to avoid errors in the generation of URIs and
query as output. Second, we adopt the instruction fine- literals. Finally, it is worth noting that the instruction boosts
tuning method, which augments the input-output pair with the performance for these models as it helps them better
an additional instruction to guide the model’s task com- understand the task and mitigates undesirable behaviors such
prehension. To simulate the ‘‘tag-end’’ instruction prompt, as generating extra text.

125064 VOLUME 12, 2024

P. A. K. K. Diallo et al.: Comprehensive Evaluation of Neural SPARQL Query Generation

To compare our results to state-of-the-art models, we report

the models with the best performance from the LC-QuAD
1.0 and LC-QuAD 2.0 leaderboard as shown in table 7.
The first work, [35], focuses on multilingual question
answering over knowledge graphs (KGQA) in a zero-shot
transfer setting, utilizing unsupervised bilingual lexicon
induction (BLI) to create augmented training data for
multiple languages. The second work, [36], introduces SGPT,
a novel approach combining end-to-end and modular systems
with an advanced language model (GPT2 [23]). It employs
an embedding technique for complex question patterns,
incorporates graph-specific information into language model
parameters, and introduces new metrics called SP-BLEU and
SP-F1 metrics.
Given our RAM limitations, we employed the gradient
accumulation method to maintain a valid comparison.
By accumulating gradients four times for a batch size of 16,
we effectively simulated a batch size of 64, in line with [1].
This approach yielded results consistent with those depicted
in table 8. Here we report only the BLEU and Accuracy since
those are the metrics used by [1]. We observe slightly lower
performance in LC-QuAD 1.0.

VI. ERROR ANALYSIS

To further study the differences between all our models,
FIGURE 2. Example of prompts with the instruction, the input and the
output. we perform an error analysis to identify the kind of errors
made by the models.

D. BEST RESULTS A. TOKEN/ERROR TYPES

We compare our accuracy scores with the former best The tokens comprising the SPARQL queries generated by
approaches reported in [1] and [8] and our F1-scores with our models can be systematically categorized into six distinct
previous works as shown in Table 7. For each dataset, classes: URIs, SPARQL keywords, functions or operators,
we report new state-of-the-art results in terms of F1 score and literals, variables, and unknown tokens. Consequently, our
answer accuracy. The following elements summarize our best analysis identifies six different error types, each attributed to
results for each dataset: the specific token category where it was observed. Therefore,
• LC-QuAD 1.0: we report 96% of F1 and 96% of we delineate the following error categories:
Accuracy with ConvSeq2Seq and 96% of F1 and 97% of • Knowledge graph URIs (abbreviated as URIs): indi-
Accuracy for T5, all with the copy mechanism and ‘‘tag- cates errors due to a difference in URIs between
within’’ questions. This outperforms [8] by 5 points (of the reference and the prediction. We use regular
F1) and [1] by 2 points (of Accuracy). expressions that detect prefixes (complete or truncated)
• LC-QuAD 2.0: we report 94% of F1 and 94% of and the identifier of each entity or resource (e.g.,
accuracy for T5 with ‘‘tag-within’’ questions. This wdt:P31, http://www.wikidata.org/entity/Q146). ‘‘Fake
outperforms [8] by 2 points of F1. URIs’’ are tokens with a URI pattern that do not exist in
• DBNQA: we report 97% of F1 and 97 of Accuracy with knowledge base.
T5 for both ‘‘tag-within’’ and ‘‘tag-end’’ questions. This • SPARQL keywords (abbreviated as SVocab)): These
outperforms [1] by 10 points. include errors due to an incorrect generation of a
However, we can note that while both studies [1], [8] SPARQL keyword at a specific position in the query.
use an answer-based score, there are some methodological Some of the SPARQL keywords considered are, for
differences. On the one hand, [1] doesn’t discard empty instance, ‘‘select’’, ‘‘ask’’, or ‘‘describe’’, ‘‘distinct’’,
answers, which boost their performance. On the other ‘‘reduced’’, ‘‘where’’, ‘‘limit’’.
hand, [8] uses the first non-empty answer among several • Functions or operators (abbreviated as Fct): These
beam search generated queries, which also might boost include errors due to reference and prediction mismatch
performances compared to our approach, which generates on functions and operators. Some examples of SPARQL
only one query. functions considered are: ‘‘=’’, ‘‘!=’’, ‘‘<’’, ‘‘>’’,

VOLUME 12, 2024 125065

P. A. K. K. Diallo et al.: Comprehensive Evaluation of Neural SPARQL Query Generation

TABLE 5. Standard fine tuning of large language models.

TABLE 6. Instruction fine tuning on llama, code llama and mistral.

TABLE 7. F1 Performance in % for different SOTA models. Based on token types, our objective is to determine which
type of token is generated instead of the expected type. This
makes it possible to study exactly how models are wrong: for
example, how often does a model generate a URI instead of
a SPARQL keyword, how often does it misplace a URI in
triples (when there is more than one URI in the same query).
We also measure if URIs in particular can be hallucinated,
that is, how often does a query include fake URIs. Examples
of errors are shown in Table 9.

1) ERROR DISTRIBUTION FOR NON-PRETRAINED MODELS

For the LC-QuAD 1.0 dataset, the base non-pre-trained
models ConvSeq2Seq and Transformer have a lot of dif-
ficulties generating the correct queries because they make
errors practically on all types such as URIs, SVocab, and the
Unknowns. This confusion of models is further accentuated
on the LC-QuAD 2.0 dataset in which we also observe errors
at the Variable level. However, in copy-based models, the
errors are only at the level of URIs and SPARQL Vocabulary
TABLE 8. BLEU performance (%) comparison with Hirigoyen et al. [1].
for all datasets and are due to the incorrect ordering of tokens
for these two types during generation. Indeed, copying helps
to respect the template of the reference query but does not
help to choose the right URIs among the candidates for
copying. With the ‘‘tag-end’’ setting, we note a lot of errors
on the URIs, and the rate of Fake-URIs is slightly higher
compared to ‘‘tag-within’’ and ‘‘raw-question’’, which are
settings that lead to lower error percentages.

‘‘<=’’, ‘‘>=’’, ‘‘+’’, ‘‘−’’, ‘‘*’’, ‘‘/’’, ‘‘str’’, ‘‘ucase’’, 2) ERROR DISTRIBUTION FOR PRETRAINED MODELS
‘‘lcase’’ and ‘‘concat’’. We have roughly the same observations with the pre-trained
• Literals (abbreviated as Lit): include faulty generations models except that there are unknown tokens errors in LC-
of a literal value. A literal is recognized as a string QuAD 2.0, which has a much higher number of OOV tokens
in double quotes or can be a numeric literal (ex 42), in the test set, as shown in Table 2. Indeed, on LC-QuAD
a boolean literal (true, false), or a date. 1.0, BART and T5, with or without the copy mechanism,
• Variables (abbreviated as Var): Errors made on vari- make errors in the generation of URIs and SVocab. Copy-
ables. based models make wrong choices of URIs among the copy
• Unknown (abbreviated by Unk) designates an error URI candidates. As previously with NPLMs, on LC-QuAD
made on the generation of the token ‘‘<unk>’’ that 2.0, some errors are made on variables due to the greater
represents all the OOV elements. complexity of this dataset. From the point of view of the
impact of the annotation, we see that without the copy,
B. ERROR TYPE DISTRIBUTION we have approximately the same rate of URI errors (41.7 %)
We align the reference query and the generated query and whatever the tagging method, and this rate increases by 12 %
detect mismatches at the token level. in the ‘‘raw-question’’ data.

125066 VOLUME 12, 2024

P. A. K. K. Diallo et al.: Comprehensive Evaluation of Neural SPARQL Query Generation

TABLE 9. Some examples of errors on SPARQL queries.

3) ERROR DISTRIBUTION FOR LARGE PRETRAINED MODELS is to test the ability of the models to handle natural
As for the LLMs, the generation errors are essentially made questions when trained on template-based questions.
very largely at the level of the URIs, then in a smaller 2) We trained these models with the original questions
percentage on Variables, and finally in a small percentage on and then tested them on the train set’s reformulated
the SVocab regardless of the fine-tuning method used. Almost questions. Our objective here is to test the ability of the
at all positions where URIs are expected, LLMs generate models to handle natural questions that are paraphrases
incorrect URIs and occasionally predict URIs that do not of template-based questions encountered during the
exist. This hallucination is due to the use of these models’ training phase.
parametric memories during the generation of the query. 3) We trained the models on the train set ’s reformulated
In addition, the Variables and SVocab are mixed together in questions and tested them on the test set’s reformulated
the query, i.e., at the position where a variable is expected, questions. Our objective here is to measure the
an SVocab is generated, and vice versa. performance of the models on natural, non-template-
based questions.
4) SUMMARY The results of these experiments are shown in Table 10,
The error distribution in query generation varies among Table 11 and Table 12. Since copying is performed on tagged
different types of models but is mostly made on the URIs questions, there aren’t any results for the ConvSeq2Seq
and SVocab levels. NPLMs, when generating queries for LC- Copy models in the ‘‘raw-question’’ configuration. The ‘‘tag-
QuAD 1.0 and LC-QuAD 2.0 datasets, struggle with errors within’’ method is not shown in these tables, as refor-
across various token types, including URIs, SVocab, and Unk. mulations are exclusively applied to ‘‘tag-end’’ questions.
Copying helps maintain query structure but doesn’t always This is because when we use reformulated questions,
select the correct URIs. PLMs have a similar pattern of we can no longer leverage any template for tagging the
errors, with less Unk tokens due to their larger vocabulary. question.
These models also exhibit URIs and SVocab generation It is clear from Table 10 that query generation is more
errors, especially on LC-QuAD 1.0. With LLMs, most challenging, as the train questions’ structures differ from
errors occur with URIs, followed by Variables and SVocab, those of the test set. Conversely, as shown in Table 11, the
and this hallucination is exacerbated by the absence of a models exhibit approximately the same performance in query
copy mechanism. Variables and SVocab tokens are often generation when reformulated questions are paraphrases
interchanged during generation. In all cases, maintaining of train questions. Nevertheless, a substantial decrease in
tagging methods is beneficial, and the ‘‘tag-within’’ strategy model performance is observed when models trained on
works best when using the copy mechanism. original questions are tested with reformulated questions.
Furthermore, a significant decline is observed across all
C. GENERALIZATION CAPABILITIES OF THE BEST MODELS
configurations for models trained and tested on reformulated
questions, as shown in Table 12. This reaffirms the notion that
For this part, we considered the best models and optimal con-
templates constitute a significant component for achieving a
figurations in two datasets and evaluated their generalization
robust alignment between a question and its corresponding
capacity. For both LC-QuAD 1.0 and LC-QuAD 2.0, the best
SPARQL query. But this also highlights that current models
models are T5 Base and ConvSeq2Seq Copy, with the ‘‘tag-
are not yet ready to be used - as is - for natural questions.
within’’ setting. To test the generalization abilities of these
It is worth noting that T5 Base is less affected by the
models, we carried out the following three experiments:
‘‘noise’’ introduced with question reformulation compared to
1) We first trained these models with the original ConvSeq2Seq, owing to its pre-training, which endows it with
(template-based) questions and then tested them on the language knowledge and enhanced adaptability to changes
test set’s reformulated questions. Our objective here while preserving semantics (e.g., synonyms).

VOLUME 12, 2024 125067

P. A. K. K. Diallo et al.: Comprehensive Evaluation of Neural SPARQL Query Generation

TABLE 10. Performances after training on original questions and testing on the test set reformulated questions.

TABLE 11. Performances after training on original questions and testing on the train set reformulated questions.

VII. DISCUSSION outperformed the other models with a large margin. However,
Overall, the PLMs outperform the NPLMs and the LLMs the pretraining in T5 does not solve the problem of identifying
in both LC-QuAD 1.0 and LC-QuAD 2.0, even though we the correct URIs, which is why question annotation is
obtained pretty good results with the ConvSeq2Seq model important. However, there is a drop in performance when
with copy. Counter-intuitively, we obtained low performance reformulated or raw questions are used at test time with
with the LLMs after trying various kinds of training (standard models trained on tagged questions due to lexical variations.
fine-tuning, instruction tuning) with different subsets of the Nonetheless, the generalization capabilities of these models
train set. This low performance is partly justified by the small are better than those trained on raw questions or without the
amount of data related to SPARQL in the datasets used to copy mechanism.
pretrain these models. For example, in Code Llama [12], most LLMs obtained a lower performance for SPARQL query
data relate to Python, C++, Java, PHP, TS, C#, and Bash. generation, even when pre-trained on code generation. The
Given the high costs and suboptimal performance of LLMs, importance of question annotation for improved performance
their use for this task is questionable in their current form. shows the necessity of having better question’s semantic
Alternative models should be developed that do not depend representations adaptable to knowledge base schema and
on parametric memories for URI generation. resources. In future work, we plan to investigate how to learn
objective functions that target annotation and generation at
A. THE IMPACT OF THE TAGGING AND COPY the same time.
MECHANISMS
Question annotations always positively impact the perfor- 1) DIFFICULTY OF THE ‘‘TAG-END’’ SETTING FOR THE COPY
mance of the generation for non-pretrained and pretrained MECHANISM
language models. From Table 14 - 17, we observe that in less We noticed that many models struggle with the ‘‘tag-end’’
complex datasets like LC-QuAD 1.0 and DBNQA, the results questions when using the copy mechanism, this mostly occurs
are similar accross models, even though T5 seems to be with some Transformer-based models, namely our non-pre-
consistently the top performer. With LC-QuAD 2.0, the PLMs trained Transformer and T5. This is probably due to the fact

125068 VOLUME 12, 2024

P. A. K. K. Diallo et al.: Comprehensive Evaluation of Neural SPARQL Query Generation

TABLE 12. Performances after training on reformulated questions and testing on the test set reformulated questions.

that placing KB elements in the query is conditioned by the mistakes due to the wrong choice of URIs to copy among
adequate location of the corresponding URI in the question. the candidates. For example, from Table 13, we can see that
In the ‘‘tag-within’’ setting, the cross attention heads easily 27% of the incorrect queries generated by the BART model
learn to map the position in the question to the right position with copy are caused by a wrong choice among the candidate
in the query. However, with ‘‘tag-end’’ questions, the model URIs. By considering the masking that comes with the copy,
has to associate the KB element to its label and then map this our hypothesis is that we increase the difficulty to elect a
label to the corresponding natural language mention in the candidate URI for the copy. This masking also accentuates
original question. These additional steps might be the source the tendency of the models to choose the most frequent query
of the challenge observed with some Transformer models. template for questions having the same question template.
We can suppose that the convolution operation is more suited These limitations could be addressed by unmasking URIs
to addressing these steps, maybe because of the proximity in the case of pre-trained models. Yet we would be facing
between the URI and its label at the end of the annotated a ‘‘spelling’’ problem that is the problem that occurs when
questions. Additionally, BART also seems to overcome these the tokenizer split the URIs into fragments and does not
difficulties. We can suggest that its pre-training tasks based on properly merge them back at generation. Other types of
denoising might have led to better short context associations. copy mechanisms could be explored that would not require
masking the URIs, would keep them as a single token
without adding them to the model’s vocabulary. This will be
2) LIMITATIONS OF THE COPY MECHANISM investigated in future work.
The main limitation of this mechanism is that it mainly
relies on the structure of the question-query pairs and on
the positional mapping of the URIs between questions and
queries. It is possible that these models only focus on the B. ERROR TYPES
task of question template - query template mapping and 1) COMPARISON OF MODELS IN TERMS OF PERFORMANCE
placeholder filling. This might pose problems when we AND DISTRIBUTION OF THE ERROR TYPES
expose the model to a question-query pair that is generated Table 13 lists the overall performance of our different models
by a template not associated to enough or any training entries, for all configurations and the distribution of error types in
or, as we saw, to reformulated questions that do not follow a percentage. To fully understand this table, the reader first
known template. needs to look at the metrics that give the overall performance
Another limitation of our copy-based architectures is (BLUE score, Accuracy, or F1 score) before looking at the
that they are oblivious to the URIs’ semantics. The distribution of errors because a configuration can have a high
encoder-decoder does not see these tokens since they are error rate for a given error type while conserving a good
masked, and the copy mechanism only uses positions to copy performance. All values in the ‘‘Error Distribution’’ columns
the tokens. This might cause problems when the expected represent errors in query generation on the corresponding
query is independent of the question structure. When we type. For example, in the first line of Table 13, we see that the
mask the URIs, the copy layer calculates the probability BART model has a BLEU, Accuracy, and F1 of 84%, 72%,
distributions of the elements to copy from a list of candidate and 75%, respectively, and that 67% of incorrect queries are
URIs that occur in the query. This has the effect of causing due to incorrect URIs with 52% being URIs that exist in the
errors at the level of the URIs because, as we have seen in knowledge base that are incorrectly placed in the SPARQL
the analysis of the errors, we have an essential proportion of query and 15% being ‘‘Fake URIs.’’

VOLUME 12, 2024 125069

P. A. K. K. Diallo et al.: Comprehensive Evaluation of Neural SPARQL Query Generation

Considering both LC-QuAD 1.0 and LC-QuAD 2.0 with for reformulated questions, the average F1 score is 33% and
the best annotation method, which is ‘‘tag-within,’’ we obtain 17.5%, respectively for T5 and ConvSeq2Seq Copy. Thus,
an average of 28% of incorrect URIs and 4% of ‘‘Fake URIs’’ we observe an overall drop in performance of 60% for T5
for the T5 model without the copy mechanism. Conversely, and a drop of 54.5% for NPLMs compared to the results
for the model ConvSeq2Seq, we have 27.5% of incorrect with original questions. Pre-trained Language Models are
URIs and 0% of ‘‘Fake URIs.’’ Plus, there is 1% more ‘‘Fake more resilient to question reformulation than Non-Pre-trained
Uris’’ for the ‘‘tag-end’’ setting compared to the ‘‘tag-within’’ Language Models. These generalization results confirm the
setting. real-world representativeness of the chosen datasets. LC-
To go deeper into the analysis of the errors for the QuAD1 and LC-QuAD2 effectively represent real-world
best models (T5 and ConvSeq2Seq Copy), we also include scenarios for SPARQL query generation, covering a wide
Tables 14 - 17 which give an analysis of the errors by showing range of topics and complexities commonly encountered in
which token types are generated in place of the expected knowledge graph querying, despite the limitation of having
ones. These analyses reveal that often, instead of the expected only one paraphrase per question. Future work should focus
URIs, incorrect URIs are predicted by all models. While on generating multiple reformulations for each question to
copy models go wrong by making wrong choices between enhance robustness and performance under lexical variations.
candidates, base models occasionally generate URIs that do
not exist, as we can see in Table 13. There are more ‘‘Fake
URIs’’ with LC-QuAD 1.0 than LC-QuAD 2.0 because the D. OTHER CONSIDERATIONS
morphology of URIs in DBpedia is more sensitive to errors. 1) LC-QUAD 2.0
On the other hand, with Wikidata, an error can lead to another Most results on LC-QuAD 2.0 are low (compared to results
URI, which, even if not the expected one, exists in the on other datasets) and in particular with non-pre-trained
knowledge base. Finally, we find a relatively high frequency models. This is most probably because of the large number
of SPARQL vocabulary (SVocab) or Variables that are put in of unknown words in the test set (5,751 in LC-QuAD
place of URIs. 2.0 versus 368 in LC-QuAD 1.0 as shown in Table 2)
and the URIs formulation in Wikidata which is a set of
2) SYNTACTIC AND NON-SYNTACTIC QUERIES numbers that cannot be mapped to known semantic structures
We also divide the errors types based on how they impact compared to DBPedia URIs which use words. While LC-
query functionality, that is, syntactic and non-syntactic errors. QuAD1.0 and DBNQA have easily ‘‘understandable’’ URIs,
Syntactic errors involve issues like unbalanced parentheses, such as ‘‘http://dbpedia.org/resource/Barack_Obama’’ for
braces, or typos in keywords and lead to non-executable Barack Obama in LC-QuAD1.0, the URIs in LC-
query. Non-syntactic errors occur in queries that are syn- QuAD2.0 are challenging for the models due to their
tactically correct but refer to non-existent identifiers in the format. For instance, the entity Barack Obama has the URI
knowledge base. For this latter type, the query is executable ‘‘http://www.wikidata.org/entity/Q76’’ in Wikidata, which is
but leads to a wrong answer. We conducted a statistical a simple concatenation of letters and numbers, making it more
analysis to quantify these errors, calculating the percentages difficult for the models to interpret. We also have, in a lesser
of executable and non-executable queries. The results showed extent, the presence of literals (for example specific titles of
that for LC-QuAD1.0 and LC-QuAD2.0, 30.2% and 21.25% movies). Additionally, LC-QuAD 2.0 includes much more
of the queries, respectively, were non-executable due to difficult query structures as suggested by the large SPARQL
syntactic errors, while 69.8% and 78.75% had non-syntactic vocabulary size reported in Table 2. Another challenge is that
errors. This confirms that most errors stem from issues with some templates of LC-QuAD 2.0 share a question template.
knowledge base identifiers. Future research should focus on This means that for the same question template, we can
addressing these errors. have different query templates. Since our copy mechanism
is based on masking the KB vocabulary, two sentences
C. THE GENERALIZATION CAPABILITIES OF THE MODELS that share the same question template will be considered
As shown in section VI-C, after conducting various exper- exactly the same sentence by the encoder-decoder block
iments to test generalization capabilities, including training and only the copy block will be able to ‘‘see’’ the words
on original questions and testing on reformulated questions, that differentiate them. Therefore, questions that share a
results revealed that models face challenges when the question template are generated by the models following the
question structure differs during testing, resulting in a most frequent query template associated with this question
critical decreased performance. Models trained and tested template. For instance, the question structure ‘‘what is the
on reformulated questions also exhibited lower performance, <mask> for <mask> of <mask>’’ can be generated by two
highlighting the importance of templates for aligning ques- global templates. The first template generates entries such as:
tions with SPARQL queries. Considering the best models for Question: ‘‘what is the country for head of state of mahmoud
the experiments on original questions with ‘‘tag-end’’ setting, abbas’’ / Query: select distinct ?sbj where
the average F1 score is respectively 93% and 72% for T5 { ?sbj wdt:P35 wd:Q127998. ?sbj wdt:P31
and ConvSeq2Seq Copy, respectively. On the other hand, wd:Q6256 } and appears 1431 times in the train set. The

125070 VOLUME 12, 2024

P. A. K. K. Diallo et al.: Comprehensive Evaluation of Neural SPARQL Query Generation

TABLE 13. Models Comparison in terms of performance (based on BLEU, Accuracy and F1-score) and error type distribution.

VOLUME 12, 2024 125071

P. A. K. K. Diallo et al.: Comprehensive Evaluation of Neural SPARQL Query Generation

TABLE 14. Error distribution in % for T5 LCQ 1 ‘‘tag-within’’. TABLE 19. Fine-Tuning Llama v2 7B with 50% of ‘‘tag-end’’ questions of
the train set.

TABLE 20. Instruction-Fine-Tuning Llama v2 7B with 50% of ‘‘tag-end’’

TABLE 15. Error distribution in % for T5 LCQ 2 ‘‘tag-within’’. questions of the train set.

TABLE 16. Error distribution in % for ConvSeq2Seq Copy LCQ 1

‘‘tag-within’’.
TABLE 21. Instruction-Fine-Tuning Llama v2 7B with 100% of ‘‘tag-end’’
questions of the train set.

TABLE 17. Error distribution in % for ConvSeq2Seq Copy LCQ 2

‘‘tag-within’’.

TABLE 22. Fine-Tuning Code Llamav2 7B with 25% of tag-end questions

of the train set.

TABLE 18. Fine-Tuning Llama v2 7B with 25% of ‘‘tag-end’’ questions of

the train set.

thromycin’’ / Query: select distinct ?obj where

{ wd:Q213511 wdt:P769 ?obj. ?obj wdt:P31
wd:Q12140 } and appears 1344 times in the train set.
We then observe that when the model faces a question with
similar structure in the test set, it will always generate a query
with the first most frequent template at train time. This fact
second template generates entries such as: Question: ‘‘what implies that all expected queries that do not match this query
is the medication for significant drug interaction of ery- template will not be properly generated.

125072 VOLUME 12, 2024

P. A. K. K. Diallo et al.: Comprehensive Evaluation of Neural SPARQL Query Generation

TABLE 23. Fine-Tuning Code Llamav2 7B with 50% of tag-end questions TABLE 26. Hyperparameters for our models.
of the train set.

TABLE 27. Number of epochs and batch sizes per configuration.

TABLE 24. Instruction-Fine-Tuning* Code Llamav2 7B with 50% of
tag-end* questions of the train set.

TABLE 25. Instruction-Fine-Tuning* Code Llamav2 7B with 100% of

tag-end* questions of the train set.

We can also note that all our copy models with ‘‘tag-
within’’ questions have very similar performances, around
73% of accuracy. This cap in performance does not occur for
‘‘tag-end’’ questions. In the ‘‘tag-end’’ setting, the complete
NL question is passed to the models. Even though the KB
elements are masked, the semantics of the question can be
identified by its natural language formulation. We can see,
for instance, that BART manages to outperform the cap of
70% of accuracy with the copy mechanism on ‘‘tag-end’’
questions.

2) OTHER MODELS
In this study, we used NMT techniques based on
encoders-decoders that are fully available to train and test. FIGURE 3. Experimental diagram of our work.

Moreover, we only use BART and T5 for PLMs and

ConvSeq2Seq and Transformer for NPLMs since we compare
our results to current state-of-the-art approaches. We also VIII. CONCLUSION AND FUTURE WORK
tested four of the best LLMs for code generation, but there We presented a set of experiments to compare and
are others, such as Codex [38] and InstructGPT [39], that expand upon the state-of-the-art approaches for NMT-based
could be tested in the future. SPARQL query generation and examined the impact of a

VOLUME 12, 2024 125073

P. A. K. K. Diallo et al.: Comprehensive Evaluation of Neural SPARQL Query Generation

copy block. We compared non-pre-trained and pre-trained lengths in each batch. Even if the generated and expected
models. In the case of pre-trained models (BART, T5), we are queries differ in word count, padding ensures equal token
the first to evaluate adding a copy layer for this task. Given numbers.
the lack of homogeneous evaluation metrics in the state of the
art, we also compare three datasets using the BLEU score, the 2) ADAPTATION OF THE COPY MECHANISM TO MODELS
accuracy, and the F1-score computed on non-empty answers Non-pretrained models struggle with out-of-vocabulary
returned by the generated queries. We also show the impact words, particularly URIs from knowledge bases, leading to
of question annotation on non-pre-trained and pre-trained incorrect queries. Even when URIs appear during training,
models. they are often rare, making them hard to generate correctly.
Our results demonstrate that the copy mechanism improves The copy mechanism helps non-pretrained models address
the performances of non-pre-trained models by a significant this issue. Pretrained models, though designed to transfer
margin, including in the ‘‘tag-within’’ setting. We also show input to output, still face challenges with generating correct
that the copy mechanism can improve pre-trained models’ URIs, especially in Wikidata. Adapting the copy mechanism
performance in some cases (BART) and lower them in others to pretrained models involves using the decoder’s outputs
(T5). We also make a detailed analysis of the errors made and the input sentence before masking. The challenge is
by all models. In copy models, errors in generating URIs are masking URIs in the question and query. This requires
due to a wrong choice among hidden tokens (URIs) in the adding necessary tokens to the pretrained model’s tokenizer
input. Finally, we note that even the best PLM and NPLM are vocabulary, initializing new embeddings if needed, and
not flexible to question reformulation and, thus, do not have incorporating the KB vocabulary without altering the model.
adequate generalization capabilities. All tokens with an index higher than the last index of the
Our future plans include leveraging paraphrases and appro- constructed vocabulary are masked, as they represent URIs
priate methods to learn question representation, utilizing or literals that should be copied directly from the input to the
models with both parametric and non-parametric memories output.
to prevent incorrect URI generation, and decomposing tasks
into smaller steps with LLMs. 3) SELECTION OF HYPERPARAMETERS
In our experiments, we did not focus on hyperparameters
A. ERROR ANALYSIS FOR LLMS optimization but rather on comparing approaches. Therefore,
to maintain the validity of comparisons, hyperparameters are
The followings tables show the detailed errors analysis type
based on the suggestions of state-of-the-art approaches [1],
by type for Llama v2 7B and Code Llama v2 7B models and
[8], [16] and the limits of the hardware resources available
settings. All experiences are conducted over LC-QuAD 2.0.
to us. For Transformer and ConvSeq2Seq architectures, the
hyperparameters are inspired by [16]. For BART and T5
B. TECHNICAL DETAILS OF EXPERIENCES models, the hyperparameters follow the recommendations
1) PRE-PROCESSING AND SEGMENTATION of [8]. Table 26 summarizes the main hyperparameters of
Non-pretrained models need a fixed vocabulary created each architecture. Only T5-small was used, as [8] reports
from the training corpus. Our pre-processing transforms better performance with this version on LC-QuAD 1.0 and
sentences into sequences of lowercase tokens, separated similar performance between T5-small and T5-large on LC-
by spaces, and standardizes SPARQL terms. We convert QuAD 2.0. BART-base is used to compare results with those
questions and queries into indices in this vocabulary before of [8].
model input. Annotated questions treat tagged terms (tokens Table 27 shows the number of epochs and batch sizes
between ≪ and ≫) as single tokens. Pretrained models use used. Adjustments were made to fit machine RAM, notably
their own vocabulary and tokenizer, which divides words reducing both for DBNQA. Lowering the batch size for
into subwords. To compare with non-pretrained models, DBNQA with the Transformer architecture significantly
we apply the same pre-processing and add spaces before impacted performance, so the batch size was maximized
punctuation. Post-processing adjusts segmenter outputs to within hardware limits.
ensure correct spacing. Following [40], we use special tokens
for the SPARQL vocabulary: we use sentinel tokens for T5 4) TRAINING AND VALIDATION
and new tokens for BART. To reduce URI segmentation For each run, the model is trained for the indicated number
errors, we add URI prefixes to these tokens and rely on a of epochs, and the one with the best validation loss is
post-processing step to recover correctly formatted queries kept for testing. The loss function used is cross-entropy.
e.g. replace ‘‘wd:’’ by ‘‘<http://www.wikidata.org/entity/>’’, The training-validation-evaluation process is repeated three
‘‘dbo:’’ by ‘‘<http://dbpedia.org/ontology/>’’. Inputs are times with different random seeds. The reported results are
batched for model processing, with tokenizer converting the averages of the three outcomes. Training and validation
sentences to token sequences, replaced by indices in the are performed using the teacher forcing method since the
vocabulary. Padding tokens <pad> ensure uniform sequence expected query is known. The entire expected sequence is

125074 VOLUME 12, 2024

P. A. K. K. Diallo et al.: Comprehensive Evaluation of Neural SPARQL Query Generation

TABLE 28. Examples of errors on the generated SPARQL queries (1).

VOLUME 12, 2024 125075

P. A. K. K. Diallo et al.: Comprehensive Evaluation of Neural SPARQL Query Generation

TABLE 29. Examples of errors on the generated SPARQL queries (2).

given to the decoder during the generation phase with a decoder. This is a combinatorial problem that cannot be
causal attention mask, ensuring each word is influenced solved in a reasonable time. Given maximum sequence
only by previous words. Thus, the output will be the next lengths of l tokens and output vocabularies of about 104 ,
predicted token for each step i given the correct sequence we need to evaluate the probabilities of approximately 104×l
of tokens up to step i. This method significantly speeds up sequences. We used a greedy search heuristic, meaning only
training as it requires only one pass through the decoder one sequence is generated at a time, consisting of the most
for each data point. Therefore, no query generation occurs probable token at each generation step.
during training or validation. It is worth noting that batch
sampling was done differently depending on the experiments. C. EXPERIMENTAL DIAGRAM
For PLMs without copying, batches are randomly sampled The following diagram summarizes our experiments. We gen-
independently, whereas for other experiments, training data erated two tagged versions of the raw questions from
is randomly partitioned to obtain batches (each data point is the datasets. Using these three formats (raw, tag-within,
seen once per epoch). The first version is inspired by [8] and tag-end), we trained non-pretrained models, pre-trained
the second by [1]. The choice to use two different methods models, and large language models (LLMs) for SPARQL
was made due to the superior results of the first method in the query generation. For non-pretrained and pretrained models,
case of base PLMs. we experiment with both base models and models with
copy mechanisms. For LLMs, we used base models with
5) GENERATION ON THE TEST SET standard and instruction fine-tuning. Finally, we tested the
The goal of generation is to produce the sequence of tokens generalization capabilities of the models with reformulated
with the highest product of probabilities generated by the questions.

125076 VOLUME 12, 2024

P. A. K. K. Diallo et al.: Comprehensive Evaluation of Neural SPARQL Query Generation

D. EXAMPLES OF ERRORS MADE BY THE MODELS [15] T. Soru, E. Marx, D. Moussallem, G. Publio, A. Valdestilhas, D. Esteves,
DURING SPARQL QUERIES GENERATION and C. B. Neto, ‘‘SPARQL as a foreign language,’’ in Proc. Posters Demos
Track 13th Int. Conf. Semantic Syst. (EMANTiCS), vol. 2044, Amsterdam,
We randomly selected models and chose five error-containing The Netherlands, J. D. Fernández and S. Hellmann, Eds., Sep. 2017,
SPARQL queries. We considered different configurations: pp. 1–7. [Online]. Available: http://ceur-ws.org/Vol-2044/paper14/
base models, models with copy mechanism, and various [16] X. Yin, D. Gromann, and S. Rudolph, ‘‘Neural machine translating from
natural language to SPARQL,’’ Future Gener. Comput. Syst., vol. 117,
question versions (raw_question, tag_within, tag_end). The pp. 510–519, Apr. 2021, doi: 10.1016/j.future.2020.12.013.
examples are shown in Table 28 and 29. [17] J. Gehring, M. Auli, D. Grangier, D. Yarats, and Y. N. Dauphin,
‘‘Convolutional sequence to sequence learning,’’ in Proc. 34th Int. Conf.
Mach. Learn., in Proceedings of Machine Learning Research, vol. 70,
ACKNOWLEDGMENT
D. Precup and Y. W. Teh, Eds., Aug. 2017, pp. 1243–1252. [Online].
The authors would like to express their gratitude to Com- Available: http://proceedings.mlr.press/v70/gehring17a.html
pute Canada (Calcul Quebec) for providing computational [18] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez,
resources. L. Kaiser, and I. Polosukhin, ‘‘Attention is all you need,’’ in Proc. Adv.
Neural Inf. Process. Syst., Annu. Conf. Neural Inf. Process. Syst., Long
Beach, CA, USA, I. Guyon, U. von Luxburg, S. Bengio, H. M. Wallach,
REFERENCES R. Fergus, S. V. N. Vishwanathan, and R. Garnett, Eds. Red Hook, NY,
[1] R. Hirigoyen, A. Zouaq, and S. Reyd, ‘‘A copy mechanism for handling USA: Curran Associates, Dec. 2017, pp. 5998–6008. [Online]. Available:
knowledge base elements in SPARQL neural machine translation,’’ in https://proceedings.neurips.cc/paper/2017/hash/3f5ee243547dee91fbd053
Proc. Findings Assoc. Comput. Linguistics (AACL-IJCNLP), Nov. 2022, c1c4a845aa-Abstract.html
pp. 226–236. [Online]. Available: https://aclanthology.org/2022.findings- [19] P. Trivedi, G. Maheshwari, M. Dubey, and J. Lehmann, ‘‘LC-QuAD:
aacl.22 A corpus for complex question answering over knowledge graphs,’’ in
[2] J.-H. Lin and E. J.-L. Lu, ‘‘SPARQL generation with an NMT-based Proc. Int. Semantic Web Conf., in Lecture Notes in Computer Science,
approach,’’ J. Web Eng., vol. 21, pp. 1471–1490, Jul. 2022. Vienna, Austria, C. d’Amato, M. Fernández, V. A. M. Tamma, F. Lécué,
[3] P. A. K. K. Diallo, S. Reyd, and A. Zouaq, ‘‘A comprehensive evaluation of P. Cudré-Mauroux, J. F. Sequeda, C. Lange, and J. Heflin, Eds., Cham,
neural SPARQL query generation from natural language questions,’’ 2023, Switzerland: Springer, 2017, pp. 210–218, doi: 10.1007/978-3-319-68204-
arXiv:2304.07772. 4_22.
[4] H. Tran, L. Phan, J. Anibal, B. T. Nguyen, and T.-S. Nguyen, ‘‘SPBERT: [20] A.-K. Hartmann, T. Soru, and E. Marx. (Apr. 2018). Generating a
An efficient pre-training BERT on SPARQL queries for question answer- Large Dataset for Neural Question Answering Over the DBpedia
ing over knowledge graphs,’’ in Proc. 28th Int. Conf. Neural Inf. Process. Knowledge Base. [Online]. Available: https://www.researchgate.net/
(ICONIP), Bali, Indonesia. Cham, Switzerland: Springer, Dec. 2021, publication/324482598_Generating_a_Large_Dataset_for_Neural_Questi
pp. 512–523. on_Answering_over_the_DBpedia_Knowledge_Base
[5] X. Huang, J.-J. Kim, and B. Zou, ‘‘Unseen entity handling in complex [21] K. Papineni, S. Roukos, T. Ward, and W.-J. Zhu, ‘‘BLEU: A method
question answering over knowledge base via language generation,’’ for automatic evaluation of machine translation,’’ in Proc. 40th Annu.
in Proc. Findings Assoc. Comput. Linguistics (EMNLP), Nov. 2021, Meeting Assoc. Comput. Linguistics, Philadelphia, PA, USA, Jul. 2002,
pp. 547–557. [Online]. Available: https://aclanthology.org/2021.findings- pp. 311–318. [Online]. Available: https://aclanthology.org/P02-1040/
emnlp.50 [22] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, ‘‘BERT: Pre-training
[6] B. B. Naik, T. J. V. R. Reddy, K. R. V. Karthik, and P. Kuila, ‘‘An SQL of deep bidirectional transformers for language understanding,’’ in Proc.
query generator for cross-domain human language based questions based Conf. North Amer. Chapter Assoc. Comput. Linguistics, Hum. Lang.
on NLP model,’’ Multimedia Tools Appl., vol. 83, no. 4, pp. 11861–11884, Technol., vol. 1, Minneapolis, MI, USA, Jun. 2019, pp. 4171–4186.
Jan. 2024. [Online]. Available: https://aclanthology.org/N19-1423
[7] J. Lehmann, P. Gattogi, D. Bhandiwad, S. Ferré, and S. Vahdati, ‘‘Language [23] A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, and I. Sutskever,
models as controlled natural language semantic parsers for knowledge ‘‘Language models are unsupervised multitask learners,’’ OpenAI
graph question answering,’’ in Proc. Eur. Conf. Artif. Intell. (ECAI), Blog, vol. 1, no. 8, pp. 2–6, 2019. [Online]. Available: https://api.
vol. 372, 2023, pp. 1348–1356. semanticscholar.org/CorpusID:160025533
[8] D. Banerjee, P. A. Nair, J. N. Kaur, R. Usbeck, and C. Biemann, ‘‘Modern
[24] M. R. A. H. Rony, U. Kumar, R. Teucher, L. Kovriguina, and J. Lehmann,
baselines for SPARQL semantic parsing,’’ in Proc. 45th Int. ACM SIGIR
‘‘SGPT: A generative approach for SPARQL query generation from natural
Conf. Res. Develop. Inf. Retr., Madrid, Spain, E. Amigó, P. Castells,
language questions,’’ IEEE Access, vol. 10, pp. 70712–70723, 2022, doi:
J. Gonzalo, B. Carterette, J. S. Culpepper, and G. Kazai, Eds., Jul. 2022,
10.1109/ACCESS.2022.3188714.
pp. 2260–2265, doi: 10.1145/3477495.3531841.
[25] A. See, P. J. Liu, and C. D. Manning, ‘‘Get to the point: Summarization
[9] M. Lewis, Y. Liu, N. Goyal, M. Ghazvininejad, A. Mohamed, O. Levy,
with pointer-generator networks,’’ in Proc. 55th Annu. Meeting Assoc.
V. Stoyanov, and L. Zettlemoyer, ‘‘BART: Denoising sequence-to-
Comput. Linguistics, Vancouver, BC, Canada, Jul. 2017, pp. 1073–1083,
sequence pre-training for natural language generation, translation, and
doi: 10.18653/v1/p17-1099.
comprehension,’’ in Proc. 58th Annu. Meeting Assoc. Comput. Linguistics,
D. Jurafsky, J. Chai, N. Schluter, and J. R. Tetreault, Eds., Jul. 2020, [26] M. Dubey, D. Banerjee, A. Abdelkawi, and J. Lehmann, ‘‘LC-QuAD
pp. 7871–7880, doi: 10.18653/v1/2020.acl-main.703. 2.0: A large dataset for complex question answering over Wikidata and
[10] C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, Y. Zhou, DBpedia,’’ in Proc. 18th Int. Semantic Web Conf. (ISWC), in Lecture Notes
W. Li, and P. J. Liu, ‘‘Exploring the limits of transfer learning with a unified in Computer Science, vol. 11779, Auckland, New Zealand, C. Ghidini,
text-to-text transformer,’’ J. Mach. Learn. Res., vol. 21, no. 140, pp. 1–67, O. Hartig, M. Maleshkova, V. Svátek, I. F. Cruz, A. Hogan, J. Song,
2020. [Online]. Available: http://jmlr.org/papers/v21/20-074.html M. Lefrançois, and F. Gandon, Eds., Cham, Switzerland: Springer,
[11] H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A. Lachaux, T. Lacroix, Oct. 2019, pp. 69–78, doi: 10.1007/978-3-030-30796-7_5.
B. Rozière, N. Goyal, E. Hambro, F. Azhar, A. Rodriguez, A. Joulin, [27] J. Gu, Z. Lu, H. Li, and V. O. K. Li, ‘‘Incorporating copying mechanism
E. Grave, and G. Lample, ‘‘LLaMA: Open and efficient foundation in sequence-to-sequence learning,’’ in Proc. 54th Annu. Meeting Assoc.
language models,’’ 2023, arXiv:2302.13971. Comput. Linguistics, Berlin, Germany, Aug. 2016, pp. 1631–1640, doi:
[12] B. Rozière et al., ‘‘Code llama: Open foundation models for code,’’ 2023, 10.18653/v1/p16-1154.
arXiv:2308.12950. [28] N. Muennighoff, ‘‘SGPT: GPT sentence embeddings for semantic search,’’
[13] F. F. Luz and M. Finger, ‘‘Semantic parsing natural language into SPARQL: 2022, arXiv:2202.08904.
Improving target language representation with neural attention,’’ 2018, [29] S. Yang, M. Teng, X. Dong, and F. Bo, ‘‘LLM-based SPARQL generation
arXiv:1803.04329. with selected schema from large scale knowledge base,’’ in Proc. China
[14] S. Hochreiter and J. Schmidhuber, ‘‘Long short-term memory,’’ Neural Conf. Knowl. Graph Semantic Comput. Cham, Switzerland: Springer,
Comput., vol. 9, no. 8, pp. 1735–1780, Nov. 1997. 2023, pp. 304–316.

VOLUME 12, 2024 125077

P. A. K. K. Diallo et al.: Comprehensive Evaluation of Neural SPARQL Query Generation

[30] L. Kovriguina, R. Teucher, D. Radyush, and D. Mouromtsev, ‘‘SPAR- PAPA ABDOU KARIM KAROU DIALLO
QLGEN: One-shot prompt-based approach for sparql query generation,’’ received the master’s degree from the School of
in Proc. Int. Conf. Semantic Syst., 2023, pp. 3–4. [Online]. Available: Control and Computer Engineering, North China
https://api.semanticscholar.org/CorpusID:265309659 Electric Power University, Beijing, in 2021. He is
[31] H. Luo, H. E, Z. Tang, S. Peng, Y. Guo, W. Zhang, C. Ma, G. Dong, currently pursuing the Ph.D. with Polytechnique
M. Song, W. Lin, Y. Zhu, and L. A. Tuan, ‘‘ChatKBQA: A generate-then- Montreal. His current research interests include
retrieve framework for knowledge base question answering with fine-tuned neural machine translation, question answering,
large language models,’’ 2023, arXiv:2310.08975. language model, knowledge base, information
[32] T. B. Brown et al., ‘‘Language models are few-shot learners,’’ in Proc. retrieval, and semantic web.
Adv. Neural Inf. Process. Syst., Annu. Conf. Neural Inf. Process. Syst.
(NeurIPS), H. Larochelle, M. Ranzato, R. Hadsell, M. Balcan, and
H. Lin, Eds. Red Hook, NY, USA: Curran Associates, Dec. 2020,
pp. 1877–1901. [Online]. Available: https://proceedings.neurips.cc/ SAMUEL REYD received the master’s degree
paper/2020/hash/1457c0d6bfcb4967418bfb8ac142f64a-Abstract.html from Polytechnique Montreal, where he studied
[33] A. Q. Jiang, A. Sablayrolles, A. Mensch, C. Bamford, D. S. Chaplot, SPARQL query generation with neural machine
D. de las Casas, F. Bressand, G. Lengyel, G. Lample, L. Saulnier, translation methods. He is currently pursuing the
L. R. Lavaud, M.-A. Lachaux, P. Stock, T. Le Scao, T. Lavril, T. Wang, Ph.D. degree with Telecom Paris. He focused
T. Lacroix, and W. El Sayed, ‘‘Mistral 7B,’’ 2023, arXiv:2310.06825. on the comparison of state-of-the-art approaches,
[34] L. Zheng, W.-L. Chiang, Y. Sheng, S. Zhuang, Z. Wu, Y. Zhuang, Z. Lin, including neural architectures and annotation
Z. Li, D. Li, E. P. Xing, H. Zhang, J. E. Gonzalez, and I. Stoica,
methods, the development of a copy mechanism
‘‘Judging LLM-as-a-judge with MT-bench and chatbot arena,’’ 2023,
for pre-trained models, and the study of the
arXiv:2306.05685.
generalization capabilities of these techniques. His
[35] Y. Zhou, X. Geng, T. Shen, W. Zhang, and D. Jiang, ‘‘Improving zero-shot
cross-lingual transfer for multilingual question answering over knowledge research interest includes cognitive explainability methods for intricate
graph,’’ in Proc. Conf. North Amer. Chapter Assoc. Comput. Linguistics cyber-physical systems.
Human Lang. Technol., 2021, pp. 5822–5834.
[36] S. Purkayastha, S. Dana, D. Garg, D. Khandelwal, and G. P. S. Bhargav, AMAL ZOUAQ is a Full Professor at Poly-
‘‘A deep neural approach to KGQA via SPARQL silhouette gener- technique Montreal and an Associate Academic
ation,’’ in Proc. Int. Joint Conf. Neural Netw. (IJCNN), Jul. 2022, Member at MILA (Quebec Artificial Intelligence
pp. 1–8.
Institute). She holds an FRQS (Dual) Chair in
[37] J. Ding, W. Hu, Q. Xu, and Y. Qu, ‘‘Leveraging frequent query
AI and Digital Health. She is also an IVADO
substructures to generate formal queries for complex question answering,’’
Professor, a member of the CLIQ-AI consortium
2019, arXiv:1908.11053.
[38] M. Chen et al., ‘‘Evaluating large language models trained on code,’’ 2021, (Computational Linguistics in Québec), and an
arXiv:2107.03374. Adjunct Professor at the University of Ottawa. Her
[39] L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. L. Wainwright, P. Mishkin, research interests include artificial intelligence,
C. Zhang, S. Agarwal, K. Slama, A. Ray, J. Schulman, J. Hilton, F. Kelton, natural language processing, and semantic web.
L. Miller, M. Simens, A. Askell, P. Welinder, P. Christiano, J. Leike, and She is the Director of the LAMA-WeST Research Laboratory,5 which
R. Lowe, ‘‘Training language models to follow instructions with human focuses on challenges related to representation learning, natural language
feedback,’’ 2022, arXiv:2203.02155. interfaces and question answering, automated reasoning, knowledge base
[40] B. M. Lake, ‘‘Compositional generalization through meta sequence-to- learning and alignment, ontology learning and modeling, and information
sequence learning,’’ in Proc. Adv. Neural Inf. Process. Syst., vol. 32, 2019, extraction and generation.
pp. 1–11.