G Prompt Syntax
G Prompt Syntax
1145
WWW ’23 Companion, April 30–May 04, 2023, Austin, TX, USA Linzbach and Tressel, et al.
Typological-transformation Template
The capital of [S] is [O] .
simple
[S] maintains diplomatic relations with [O].
[S] is a country and it’s capital is [O].
compound
[S] maintains diplomatic relations with countries and [O] is one of them.
[S] is the country, who’s capital is [O].
complex
[S] is a country that maintains diplomatic relations with [O].
[O] is a city and it is the city that is the capital of [S].
compound-complex
[S] is a country that maintains diplomatic relations with [O].
Table 1: Templates for ’capital of’ (1:1) and ’diplomatic relation’ (M:N)
present initial experiments to study this hypothesis and share man- memorization and knowledge types (KMIR [6], KAMEL [13]). Fur-
ually created prompts with the community 1 . We have extended thermore, the performance improvements which were achieved
T-REx to incorporate diferent grammatical structures alongside the through fne-tuning LLMs on the provided prompts were investi-
relations already provided. The key fndings of our work are that a gated. For this, an archive with diferent prompts as well as train,
simple sentence structure performs better for relational knowledge validation, and test-splits for the T-REx subset of the LAMA-probe
extraction than complex grammatical constructions. However, the called LPAQA [12] was created. (b) In addition to prompts, probing
impact of sentence structures is negligible for simpler relations tasks are often used to investigate the knowledge encoded in LLMs.
(1:1). Moreover, these relations are easier to extract than complex This method uses auxiliary classifcations with features derived
relations (N:M). from the frozen network to understand inherent information. For
Overall, this paper is organised as follows: Firstly ( Related Work), the example of transformer-based language models, probing tasks
we briefy cover the state of the relevant research for this paper. can be solved by using the output representations [29], the atten-
The main section ( Preliminary Experiments) is then divided into tion information [1] or the information change across the diferent
four subsections, three of which describe the methodology (Data, layers [11, 29]. The information derivable from those features has
Task & metrics, Prompt Engineering), while the fnal subsection been used to understand several aspects of the contextualization of
Results presents the performance of diferent models on our earlier the representation [29], the syntactic truthfulness of the attention
established experimental setting. We conclude our work with a mechanism [1], and the workfow of the layer-wise processing [11].
discussion and an outlook (Conclusion and Future Work). Enhancing the LLMs inherent knowledge:
Various types of information are used to enhance the model’s in-
2 RELATED WORK herent knowledge. Approaches range from enhancing lexical word
relations [18], in-context semantic abstractions [19], sentiment sen-
Since the proposal of transformer-based LLMs which learn repre-
sitivity [17, 30], and entity centred information [5, 21] to improving
sentations through Masked Language Modelling (MLM-task) [2],
any knowledge type [20, 31]. Knowledge enhancement approaches
two research felds have emerged: (1) Understanding the knowl-
also difer in their infusion technique. Proposals that stay the clos-
edge inherent in LLMs [25], and (2) Enhancing the LLMs’ inherent
est to pure language modelling only change the probabilities of
knowledge [33].
the corruption task in a way in which it teaches stance- [16], or
Understanding the knowledge inherent in LLMs:
entity-knowledge [27]. Another infusion strategy tries to enhance
Current research generally proposes two diferent methods to test
the model by simultaneously teaching a secondary learning objec-
the self-taught knowledge of LLMs. (a) Prompts that pose knowl-
tive. This is applied to entity- [32], sentiment- [30] and general
edge related tasks in a cloze-text format. This research direction is
linguistic knowledge [20].
heavily infuenced by the LAMA-probe proposed by [23], a cloze-
In this paper, we focus on understanding the knowledge inherent in
text data-set that encodes simple relational facts about real world
LLMs. In particular, we aim to study the impact of syntactical difer-
entities. E.g. the prompt ’Where was Dante born [MASK]?’ is paired
ences while treating LLMs as a black box model. In comparison to
with ’Florence’. Using BERT [2] for predicting missing tokens, the
Heinzerling et al. [8], we test paraphrasing motivated by linguistics.
authors show that BERT already carried a surprisingly high amount
Additionally, we open the feld for new probing-tasks [29], i.e. how
of relational knowledge. Following Petroni et al’s [15, 23] fndings,
sentence processing [11] impacts knowledge inference. Thus, we
Heinzerling et al. [8] focus on entity representations, storage capac-
gain insight into information encoding and potential directions for
ity and paraphrased queries. However, they draw a more critical
knowledge enhancement strategies.
picture of storage and query capabilities of these models. More-
over, Roberts et al. [24] investigate how much knowledge can be
stored in model parameters. To approximate the storage capacities, 3 PRELIMINARY EXPERIMENTS
they over-ft the model on knowledge triples. Since then, many 3.1 Data
probing-suites have been published to understand the impact of
In this work, we propose that utilizing cloze-text prompts ofers
a direct means of studying the impact of syntactic features on
1 https://github.com/Thrasolt/ContextualKnowledgeOfLMs knowledge retrieval in language models. Knowledge capturing
1146
Decoding Prompt Syntax: Analysing its Impact on Knowledge Retrieval in Large Language Models WWW ’23 Companion, April 30–May 04, 2023, Austin, TX, USA
1147
WWW ’23 Companion, April 30–May 04, 2023, Austin, TX, USA Linzbach and Tressel, et al.
a sentence that contains only one main clause (LAMA-probe tem- at least two and at most four percentage points. Second, there is
plates). A sentence that includes two or more independent clauses a chasm in performance between cased and uncased models, as
is known as a compound sentence, while a sentence that contains the accuracy of uncased models is comparatively low. Third every
an independent clause and one or more dependent clauses is known model has a higher prediction accuracy when queried with the
as a complex sentence. Lastly, a sentence that includes two or more simple sentence compared to the other three types. Finally, the
independent clauses and at least one dependent clause is known diferences in scores among the non-������ sentence types are
as a compound-complex sentence. Table 1 shows an example of the signifcantly lower than the variations within the ������ sentence
four templates for the 1:1 relation P36, describing the predicate type. These observations also apply to the results based on the
’capital of’, and the M:N relation P530, describing the ’diplomatic top-10 accuracy, albeit with the expected higher accuracy results,
relation with’. see Table 3.
Table 5 and Table 6 show the average accuracy results for the
3.4 Results four sentence types for each of the cardinality relations for top-1
We applied these template-based-prompts to three BERT variants: and top-10 for the BERT-large-cased model. Both results show that
BERT-large, BERT-base, and BERT-base-multilingual. We include the ������ sentence type enables a higher accuracy for all three
multilingual BERT to understand the impact of named entity men- cardinalities. Additionally, in both sets of results, the performance
tions in diferent languages. Our experiments show that all investi- decreases with increasing cardinality, which is intuitive, as the
gated LLMs perform best on the simple sentence type. Additionally, difculty level increases with the number of possible subjects and
we discover that the cased models outperform the uncased models objects. For N:M relations, top-1 is an inappropriate metric, as only
by a large margin. one guess is allowed per subject.
Table 2 shows the top-1 accuracy for each model in percent for The results are the closest for the cardinality 1:1 and furthest
all four sentence types. We report slightly worse results for the apart for N:M, thus implying that the relation extraction works
top-1������ accuracy (BERT-base-cased -3.0, BERT-large-cased -1.1) best for simple sentence types and simple relations (1:1). The per-
on the LAMA-probe than in the original paper [23]. In contrast formance noticeably decreases when either sentence or relation
to Petroni et al. [23], we consistently evaluated over the whole complexity increases. Additionally, the sentence structure (typol-
vocabulary, which had a notable infuence on the reliability of the ogy) has close to no infuence on the top-10 performance for the
results for the N:M relations. Specifcally, Petroni et al. [23] exclude simple relations (1:1). However, the relations with less mutual infor-
all other valid entities except for the one they test. Nonetheless, mation between subject and object co-occurrence (N:1, M:N) show a
our results are reasonably close, given diferent reported results large decrease in performance for changes in the sentence-typology.
on the same data in other works [34]. Generally, the values of the Hence, the MLM-task does not incorporate the rules of syntactical
correct tokens are surprisingly high. The best model was able to change while keeping semantic equivalence.
predict one-third of the masked tokens correctly. However, most
comparable results achieved by the cased models are around 15 4 CONCLUSION AND FUTURE WORK
to 20 percent accurate. Most importantly, the average top-1 accu- In this paper, we investigate the impact of prompt syntax on the
racy varies signifcantly between diferent sentence types. Thus, knowledge retrieval performance of LLMs. To achieve this, we
indicating grammatical structure infuences a model’s ability to expand the well-known and commonly used T-REx subset of the
retrieve relational knowledge. This is true for all models under LAMA-probe to support diferent syntactical structures of prompts.
investigation. our preliminary results show, that the impact of syntax is only mar-
From this, we draw four conclusions: First, the BERT-large-cased ginal for simple relations (1:1). In general, simple prompts should
model outperforms all other models on all four sentence types by be the preferred way of querying. Most importantly, we show that
1148
Decoding Prompt Syntax: Analysing its Impact on Knowledge Retrieval in Large Language Models WWW ’23 Companion, April 30–May 04, 2023, Austin, TX, USA
LLMs indeed struggle to generalise knowledge across grammatical representations. arXiv preprint arXiv:1909.04164 (2019).
structures. This fnding highlights the importance of the relation- [22] Fabio Petroni, Patrick Lewis, Aleksandra Piktus, Tim Rocktäschel, Yuxiang Wu,
Alexander H. Miller, and Sebastian Riedel. 2020. How Context Afects Language
ship between syntax and semantics within LLMs as a crossroad of Models’ Factual Predictions. arXiv:2005.04611 [cs.CL]
human and machine language representation. Consequently, we [23] Fabio Petroni, Tim Rocktäschel, Sebastian Riedel, Patrick Lewis, Anton Bakhtin,
Yuxiang Wu, and Alexander Miller. 2019. Language Models as Knowledge
will focus on a deeper analysis of the disparities in information Bases?. In Proceedings of the 2019 Conference on Empirical Methods in Natu-
coding for typologically diferent templates. These disparities may ral Language Processing and the 9th International Joint Conference on Natu-
be refected in the attention mechanism [1], the predicted token- ral Language Processing (EMNLP-IJCNLP). ACL, Hong Kong, China, 2463–2473.
https://doi.org/10.18653/v1/D19-1250
distribution [3] or the diferences in mask representation among [24] Adam Roberts, Colin Rafel, and Noam Shazeer. 2020. How Much Knowl-
the various typologies per relation. edge Can You Pack Into the Parameters of a Language Model? arXiv preprint
arXiv:2002.08910 (2020).
[25] Anna Rogers, Olga Kovaleva, and Anna Rumshisky. 2021. A primer in BERTology:
REFERENCES What we know about how BERT works. Transactions of the Association for
[1] Kevin Clark, Urvashi Khandelwal, Omer Levy, and Christopher D Manning. Computational Linguistics 8 (2021), 842–866.
2019. What does bert look at? an analysis of bert’s attention. arXiv preprint [26] Emma Strubell, Patrick Verga, Daniel Andor, David Weiss, and Andrew McCallum.
arXiv:1906.04341 (2019). 2018. Linguistically-Informed Self-Attention for Semantic Role Labeling. In
[2] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Proceedings of the 2018 Conference on Empirical Methods in Natural Language
Pre-training of deep bidirectional transformers for language understanding. arXiv Processing. Association for Computational Linguistics, Brussels, Belgium, 5027–
preprint arXiv:1810.04805 (2018). 5038. https://doi.org/10.18653/v1/D18-1548
[3] Yanai Elazar, Shauli Ravfogel, Alon Jacovi, and Yoav Goldberg. 2021. Amnesic [27] Yu Sun, Shuohuan Wang, Yukun Li, Shikun Feng, Xuyi Chen, Han Zhang, Xin
probing: Behavioral explanation with amnesic counterfactuals. Transactions of Tian, Danxiang Zhu, Hao Tian, and Hua Wu. 2019. Ernie: Enhanced representa-
the Association for Computational Linguistics 9 (2021), 160–175. tion through knowledge integration. arXiv preprint arXiv:1904.09223 (2019).
[4] Hady Elsahar, Pavlos Vougiouklis, Arslen Remaci, Christophe Gravier, Jonathon [28] Dhanasekar Sundararaman, Vivek Subramanian, Guoyin Wang, Shijing Si, Ding-
Hare, Frederique Laforest, and Elena Simperl. 2018. T-rex: A large scale alignment han Shen, Dong Wang, and Lawrence Carin. 2021. Syntactic Knowledge-Infused
of natural language with knowledge base triples. In Proceedings of the Eleventh Transformer and BERT models. In CEUR Workshop Proceedings, Vol. 3052. CEUR
International Conference on Language Resources and Evaluation (LREC 2018). Workshop Proceedings.
[5] Thibault Févry, Livio Baldini Soares, Nicholas FitzGerald, Eunsol Choi, and Tom [29] Ian Tenney, Patrick Xia, Berlin Chen, Alex Wang, Adam Poliak, R Thomas McCoy,
Kwiatkowski. 2020. Entities as experts: Sparse memory access with entity super- Najoung Kim, Benjamin Van Durme, Samuel R Bowman, Dipanjan Das, et al. 2019.
vision. arXiv preprint arXiv:2004.07202 (2020). What do you learn from context? probing for sentence structure in contextualized
[6] Daniel Gao, Yantao Jia, Lei Li, Chengzhen Fu, Zhicheng Dou, Hao Jiang, Xinyu word representations. arXiv preprint arXiv:1905.06316 (2019).
Zhang, Lei Chen, and Zhao Cao. 2022. KMIR: A Benchmark for Evaluating [30] Hao Tian, Can Gao, Xinyan Xiao, Hao Liu, Bolei He, Hua Wu, Haifeng Wang, and
Knowledge Memorization, Identifcation and Reasoning Abilities of Language Feng Wu. 2020. SKEP: Sentiment knowledge enhanced pre-training for sentiment
Models. arXiv preprint arXiv:2202.13529 (2022). analysis. arXiv preprint arXiv:2005.05635 (2020).
[7] Yoav Goldberg. 2019. Assessing BERT’s Syntactic Abilities. [31] Ruize Wang, Duyu Tang, Nan Duan, Zhongyu Wei, Xuanjing Huang, Guihong
arXiv:1901.05287 [cs.CL] Cao, Daxin Jiang, Ming Zhou, et al. 2020. K-adapter: Infusing knowledge into
[8] Benjamin Heinzerling and Kentaro Inui. 2020. Language models as knowledge pre-trained models with adapters. arXiv preprint arXiv:2002.01808 (2020).
bases: On entity representations, storage capacity, and paraphrased queries. arXiv [32] Xiaozhi Wang, Tianyu Gao, Zhaocheng Zhu, Zhengyan Zhang, Zhiyuan Liu,
preprint arXiv:2008.09036 (2020). Juanzi Li, and Jian Tang. 2021. KEPLER: A unifed model for knowledge embed-
[9] Jennifer Hu, Jon Gauthier, Peng Qian, Ethan Wilcox, and Roger Levy. 2020. A ding and pre-trained language representation. Transactions of the Association for
Systematic Assessment of Syntactic Generalization in Neural Language Models. Computational Linguistics 9 (2021), 176–194.
In Proceedings of ACL. Association for Computational Linguistics. [33] Chaoqi Zhen, Yanlei Shang, Xiangyu Liu, Yifei Li, Yong Chen, and Dell Zhang.
[10] Rodney Huddleston. 1984. Introduction to the Grammar of English. Cambridge 2022. A Survey on Knowledge-Enhanced Pre-trained Language Models. arXiv
University Press. preprint arXiv:2212.13428 (2022).
[11] Ganesh Jawahar, Benoît Sagot, and Djamé Seddah. 2019. What does BERT [34] Zexuan Zhong, Dan Friedman, and Danqi Chen. 2021. Factual Probing Is [MASK]:
learn about the structure of language?. In ACL 2019-57th Annual Meeting of the Learning vs. Learning to Recall. In North American Association for Computational
Association for Computational Linguistics. Linguistics (NAACL).
[12] Zhengbao Jiang, Frank F Xu, Jun Araki, and Graham Neubig. 2020. How can we
know what language models know? Transactions of the Association for Computa-
tional Linguistics 8 (2020), 423–438.
[13] Jan-Christoph Kalo and Leandra Fichtel. 2022. KAMEL: Knowledge Analysis
with Multitoken Entities in Language Models. In Proceedings of the Conference on
Automated Knowledge Base Construction.
[14] Nora Kassner, Benno Krojer, and Hinrich Schütze. 2020. Are Pretrained Language
Models Symbolic Reasoners Over Knowledge? arXiv preprint arXiv:2006.10413
(2020).
[15] Nora Kassner, Benno Krojer, and Hinrich Schütze. 2020. Pre-trained Language
Models as Symbolic Reasoners over Knowledge?, In Proceedings of the 24th
Conference on Computational Natural Language Learning. CoRR.
[16] Kornraphop Kawintiranon and Lisa Singh. 2021. Knowledge enhanced masked
language model for stance detection. In Proceedings of the 2021 conference of the
north american chapter of the association for computational linguistics: human
language technologies. 4725–4735.
[17] Pei Ke, Haozhe Ji, Siyang Liu, Xiaoyan Zhu, and Minlie Huang. 2019. SentiLARE:
Sentiment-aware language representation learning with linguistic knowledge.
arXiv preprint arXiv:1911.02493 (2019).
[18] Anne Lauscher, Ivan Vulić, Edoardo Maria Ponti, Anna Korhonen, and Goran
Glavaš. 2019. Specializing Unsupervised Pretraining Models for Word-Level
Semantic Similarity. arXiv:1909.02339 [cs.CL]
[19] Yoav Levine, Barak Lenz, Or Dagan, Ori Ram, Dan Padnos, Or Sharir, Shai Shalev-
Shwartz, Amnon Shashua, and Yoav Shoham. 2019. Sensebert: Driving some
sense into bert. arXiv preprint arXiv:1908.05646 (2019).
[20] Xiaodong Liu, Pengcheng He, Weizhu Chen, and Jianfeng Gao. 2019. Multi-
task deep neural networks for natural language understanding. arXiv preprint
arXiv:1901.11504 (2019).
[21] Matthew E Peters, Mark Neumann, Robert L Logan IV, Roy Schwartz, Vidur Joshi,
Sameer Singh, and Noah A Smith. 2019. Knowledge enhanced contextual word
1149