KnowPrompt
KnowPrompt
(b) KnowPrompt
CE Loss
no_relation
per:employee_of
org:founded_by
per:date_of_birth MLM Head
per:stateorprovinces_of_residence
Relation
Embedding Head
relation probabilities
[CLS] Steve Jobs , co-founder of Apple . [SEP] Apple [MASK] Steve Jobs [SEP]
relation
Knowledge Injection subject object
virtual type words person virtual answer words
Relation
date Embedding Head Structured Loss
organization
Synergistic Optimization
Figure 2: Model architecture of fine-tuning for RE (Figure a), and proposed KnowPrompt (Figure b) approach (Best viewed in
color). The answer word described in the paper refers to the virtual answer word we proposed.
layer of L. Since the virtual type words designed based on the prior where ê [𝑟𝑒𝑙 ] (𝑣 ′ )is the embedding of virtual label word 𝑣 ′ , e repre-
knowledge within relation labels can initially perceive the range sents the word-embedding layer of L. It is noticed that the knowl-
of entity types, it can be further optimized according to context edgeable initialization process of virtual answer words may be
to express semantic information close to the actual entity type, regarded as a great anchor; we can further optimize them based on
holding the role similar to Typer Marker. context to express optimal semantic information, leading to better
Relation Knowledge Injection. Previous studies on prompt- performance.
tuning usually form a one-one mapping between one label word in
the vocabulary and one task label by automatic generation, which
maintains large computational complexity of the search process 4.2 Synergistic Optimization with Knowledge
and fails to leverage the abundant semantic knowledge in relation Constraints
labels for RE. To this end, we assume that there exists a virtual Since there exist close interaction and connection between en-
answer word 𝑣 ′ ∈ V ′ in the vocabulary space of PLMs, which tity types and relation labels, and those virtual type words as
can represent the implicit semantics of the relation. From this per- well as answer words should be associated with the surround-
spective, we expand the MLM Head layer of L with extra learnable ing context, we further introduce a synergistic optimization
relation embeddings as the virtual answer word sets V ′ to com- method with implicit structural constraints over the parameter
pletely represent the corresponding relation labels Y. Thus, we can set {ê [𝑠𝑢𝑏 ] , ê [𝑜𝑏 𝑗 ] , ê [𝑟𝑒𝑙 ] (V ′ )} of virtual type words and virtual
reformalize 𝑝 (𝑦|𝑥) with the probability distribution over V ′ at the answer words.
masked position. We propose to encodes semantic knowledge Context-aware Prompt Calibration. Although our virtual
about the label and facilitates the process of RE. Concretely, type and answer words are initialized based on knowledge, they
we set the 𝜙𝑅 = [𝜙𝑟 1, 𝜙𝑟 2, ..., 𝜙𝑟𝑚 ] and C𝑅 = [C𝑟 1, C𝑟 2, ..., C𝑟𝑚 ], may not be optimal in the latent variable space. They should be as-
where 𝜙𝑟 represent the probability distribution over the candidate sociated with the surrounding context. Thus, further optimization
set C𝑟 of the semantic words of relation by disassembling the is necessary by perceiving the context to calibrate their represen-
relation label 𝑟 , 𝑚 is the number of relation labels. Furthermore, we tation. Given the probability distribution 𝑝 (𝑦|𝑥) = 𝑝 ( [MASK] =
adopt the weighted average function for 𝜙𝑟 to average embeddings V ′ |𝑥 prompt ) over V ′ at the masked position, we optimize the vir-
of each words among C𝑟 to initialize these relation embeddings, tual type words as well as answer words by the loss function com-
which can inject the semantic knowledge of relations. The specific puted as the cross-entropy between y and 𝑝 (𝑦|𝑥) as follows:
decomposition process is shown in Table 1, and the learnable rela-
tion embedding of virtual answer word 𝑣 ′ = M (𝑦) is initialized as
follows:
1 ∑︁
′
J[MASK] = − y log 𝑝 (𝑦|𝑥), (4)
ê [𝑟𝑒𝑙 ] (𝑣 ) = 𝜙𝑟 · e (C𝑟 ) , (3) |X|
𝑥 ∈X
Table 1: Examples of some relations of the datasets TACREV, and relation-specific C𝑠𝑢𝑏 ,C𝑜𝑏 𝑗 and C𝑟 .
Relation Labels C𝑠𝑢𝑏 C𝑜𝑏 𝑗 C𝑟 (Disassembled Relation Prepared for Virtual Answer Words)
per:country_of_birth person country {“country”, “of”, “birth” }
per:data_of_death person data {“data”, “of”, “death” }
per:schools_attended person organization {“school”,“attended’}
org:alternate_names organization organization {“alternate”, “names” }
org:city_of_headquarters organization city {“city”, “of”, “headquarters” }
org:number_of_employees/members organization number {“number”, “of”, “employees”, “members” }
where |X| represents the numbers of the training dataset. The of virtual type words and virtual answer words with a large learning
learnable words may adaptively obtain optimal representations for rate 𝑙𝑟 1 to obtain the optimal prompt as follows:
prompt-tuning through a synergistic type and answer optimization. J = J[MASK] + 𝜆Jstructured, (7)
Implicit Structured Constraints. To integrate structural knowl-
edge into KnowPrompt, we adopt additional structured constraints where 𝜆 is the hyperparameter, and Jstructured and J[MASK] are
to optimize prompts. Specifically, we use a triplet (𝑠, 𝑟, 𝑜) to describe the losses for the KE and [MASK] prediction, respectively. Second,
a relational fact; here, 𝑠, 𝑜 represent the virtual types of subject and based on the optimized virtual type words and answer words, we
object entities, respectively, and 𝑟 is the relation label within a pre- utilize the object function J[MASK] to tune the parameters of the
defined set of answer words V ′ . In KnowPrompt, instead of using PLM with prompt (optimizing overall parameters) with a small
pre-trained knowledge graph embeddings2 , we directly leverage learning rate 𝑙𝑟 2 . For more experimental details, please refer to the
the output embedding of virtual type words and virtual answer Appendix.
words through LMs to participate in the calculation. We define the
loss Jstruct of implicit structured constraints as follows: 5 EXPERIMENTS
Jstructured = − log 𝜎 (𝛾 − 𝑑𝑟 (s, o)) 5.1 Datasets
𝑛 For comprehensive experiments, we carry out our experiments on
∑︁ 1 (5)
− log 𝜎 (𝑑𝑟 (si′, oi′ ) − 𝛾), five RE datasets: SemEval 2010 Task 8 (SemEval) [26], DialogRE [54],
𝑛
𝑖=1 TACRED [63], TACRED-Revisit [1], Re-TACRED [47]. Statistical
details are provided in Table 2 and Appendix A:
𝑑𝑟 (s, o) = ∥s + r − o∥ 2, (6)
where (𝑠𝑖′, 𝑟, 𝑜𝑖′ )
are negative samples, 𝛾 is the margin, 𝜎 refers 5.2 Experimental Settings
to the sigmoid function and 𝑑𝑟 is the scoring function. For negative For fine-tuning vanilla PLMs and our KnowPrompt, we utilize
sampling, we assign the correct virtual answer words at the position RoBERT_large for all experiments to make a fair comparison
of [MASK] and randomly sample the subject entity or object entity (except for DialogRE, we adopt RoBERTa_base to compare with
and replace it with an irrelevant one to construct corrupt triples, in previous methods). For test metrics, we use micro 𝐹 1 scores of RE as
which the entity has an impossible type for the current relation. the primary metric to evaluate models, considering that 𝐹 1 scores
can assess the overall performance of precision and recall. We use
different settings for standard and low-resource experiments.All
Table 2: Statistics for RE datasets used in the paper, includ-
detailed settings for our KnowPrompt, Fine-tuning and PTR can be
ing numbers of relations and instances in the different split.
found in the Appendix B, C and D.
For dialogue-level DialogRE, instance refers to the number
Standard Setting. In the standard setting, we utilize full Dtrain
of documents.
to fine-tune. Considering that entity information is essential for
models to understand relational semantics, a series of knowledge-
Dataset # Train. # Val. # Test. # Rel. enhanced PLMs have been further explored using knowledge graphs
SemEval 6,507 1,493 2,717 19 as additional information to enhance PLMs. Specifically, we select
DialogRE 5,963 1,928 1,858 36 SpanBERT [30], KnowBERT [38], LUKE [52], and MTB [3] as our
TACRED 68,124 22,631 15,509 42
strong baselines, which are typical models that use external knowl-
TACRED-Revisit 68,124 22,631 15,509 42
Re-TACRED 58,465 19,584 13,418 40 edge to enhance learning objectives, input features, model archi-
tectures, or pre-training strategies. We also compare several SOTA
models on DialogRE, in which one challenge is that each entity pair
has more than one relation.
4.3 Training Details Low-Resource Setting. We conducted 8-, 16-, and 32-shot ex-
Our approach has a two-stage optimization procedure. First, we syn- periments following LM-BFF [15, 22] to measure the average perfor-
ergistically optimize the parameter set {ê [𝑠𝑢𝑏 ] , ê [𝑜𝑏 𝑗 ] , ê [𝑟𝑒𝑙 ] (V ′ )} mance across five different randomly sampled data based on every
experiment using a fixed set of seeds Sseed . Specifically, we sample
2 Note that pre-trained knowledge graph embeddings are heterogeneous compared 𝑘 instances of each class from the initial training and validation
with pre-trained language model embeddings. sets to form the few-shot training and validation sets.
Table 3: Standard RE performance of 𝐹 1 scores (%) on different test sets. “w/o” means that no additional data is used for pre-
training and fine-tuning, yet “w/” means that the model uses extra data for tasks. It is worth noting that “†” indicates we
exceptionally rerun the code of KnowPrompt and PTR with RoBERT_base for a fair comparison with current SOTA models
on DialogRE. Subscript in red represents advantages of KnowPrompt over the best results of baselines. Best results are bold.
5.3 Main Results and works in the same period as our KnowPrompt. Thus, we make
Standard Result. As shown in Table 3, the knowledge-enhanced a comprehensively comparative analysis between KnowPrompt
PLMs yield better performance than the vanilla Fine-tuning. This and PTR, and summarize the comparison in Table 7. The specific
result illustrates that it is practical to inject task-specific knowledge analysis is as follows:
to enhance models, indicating that simply fine-tuning PLMs cannot Firstly, PTR adopt a fixed number of multi-token answer form
perceive knowledge obtained from pre-training. and LM-BFF leverage actual label word with single-token answer
Note that our KnowPrompt achieves improvements over all base- form, while KnowPrompt propose virtual answer word with
lines and even achieves better performance than those knowledge- single-token answer form. Thus, PTR needs to manually formu-
enhanced models, which use knowledge as data augmentation late rules, which is more labor-intensive. LM-BFF requires expen-
or architecture enhancement during fine-tuning. On the other sive label search due to its search process exponentially depending
hand, even if the task-specific knowledge is already contained in on the number of categories.
knowledge-enhanced PLMs such as LUKE, KnowBERT SpanBERT Secondly, essentially attributed to to the difference of answer
and MTB , it is difficult for fine-tuning to stimulate the knowledge form, our KnowPrompt and LM-BFF is model-agnostic and can
for downstream tasks. Overall, we believe that the development be plugged into different kinds of PLMs (As show in Figure 3, our
of prompt-tuning is imperative and KnowPrompt is a simple and method can adopted on GPT-2), while PTR fails to generalize to
effective prompt-tuning paradigm for RE. generative LMs due to it’s nultiple discontinuous [MASK] prediction.
Low-Resource Result. From Table 4, KnowPrompt appears to Thirdly, above experiments, demonstrates that KnowPrompt is
be more beneficial in low-resource settings. We find that Know- comprehensively comparable to the PTR, and can perform better
Prompt consistently outperforms the baseline method Fine-tuning, in low-resource scenarios. Especially for DialogRE, a multi-label
GDPNet, and PTR in all datasets, especially in the 8-shot and 16- classification task, our method exceeded PTR by approximately 5.4
shot experiments. Specifically, our model can obtain gains of up to points in the standard supervised settings. It may be attributed to
22.4% and 13.2% absolute improvement on average compared with the rule method used by PTR that forcing multiple mask predictions
fine-tuning. As 𝐾 increases from 8 to 32, the improvement in our will confuse multi-label predictions.
KnowPrompt over the other three methods decreases gradually. For In a nutshell, the above analysis proves that KnowPrompt is
32-shot, we think that the number of labeled instances is sufficient. more flexible and widely applicable; meanwhile, it can be aware of
Thus, those rich semantic knowledge injected in our approach may knowledge and stimulate it to serve downstream tasks better.
induce fewer gains. We also observe that GDPNet even performs
worse than Fine-tuning for 8-shot, which reveals that the complex
SOTA model in the standard supervised setting may fall off the
5.4 Ablation Study on KnowPrompt
altar when the data are extremely scarce. Effect of Virtual Answer Words Modules: To prove the effects
Comparison between KnowPrompt and Prompt Tuning of the virtual answer words and its knowledge injection, we conduct
Methods. The typical prompt-tuning methods perform outstand- the ablation study, and the results are shown in Table 5. For -VAW,
ingly on text classification tasks (e.g., sentiment analysis and NLI), we adopt one specific token in the relation label as the label word
such as LM-BFF, but they don’t involve RE application. Thus we without optimization, and for -Knowledge Injection for VAW, we ran-
cannot rerun their code for RE tasks. To our best knowledge, PTR is domly initialize the virtual answer words to conduct optimization.
the only method that uses prompts for RE, which is a wonderful job Specifically, removing the knowledge injection for virtual answer
words has the most significant effect, causing the relation F1 score
Table 4: Low-resource RE performance of 𝐹 1 scores (%) on different test sets. We use 𝐾 = 8, 16, 32 (# examples per class) for
few-shot experiments. Subscript in red represents the advantages of KnowPrompt over the results of Fine-tuning.
Low-Resource Setting
Split Methods SemEval DialogRE† TACRED TACRED-Revisit Re-TACRED Average
Fine-tuning 41.3 29.8 12.2 13.5 28.5 25.1
GDPNet 42.0 28.6 11.8 12.3 29.0 24.7
K=8
PTR 70.5 35.5 28.1 28.7 51.5 42.9
KnowPrompt 74.3 (+33.0) 43.8 (+14.0) 32.0 (+19.8) 32.1 (+18.6) 55.3 (+26.8) 47.5 (+22.4)
Fine-tuning 65.2 40.8 21.5 22.3 49.5 39.9
GDPNet 67.5 42.5 22.5 23.8 50.0 41.3
K=16
PTR 81.3 43.5 30.7 31.4 56.2 48.6
KnowPrompt 82.9 (+17.7) 50.8 (+10.0) 35.4 (+13.9) 33.1 (+10.8) 63.3 (+13.8) 53.1 (+13.2)
Fine-tuning 80.1 49.7 28.0 28.2 56.0 48.4
GDPNet 81.2 50.2 28.8 29.1 56.5 49.2
K=32
PTR 84.2 49.5 32.1 32.4 62.1 52.1
KnowPrompt 84.8 (+4.7) 55.3 (+3.6) 36.5 (+8.5) 34.7 (+6.5) 65.0 (+9.0) 55.3 (+6.9)
Table 5: Ablation study on SemEval, VAW and VTW refers 6 ANALYSIS AND DISCUSSION
to virtual answer words and type words.
6.1 Can KnowPrompt Applied to Other LMs?
Since we focus on MLM (e.g., RoBERTa) in the main experiments,
Method K=8 K=16 K=32 Full
we further extend our KnowPrompt to autoregressive LMs like
KnowPrompt 74.3 82.9 84.8 90.2 GPT-2. Specifically, we directly append the prompt template with
-VAW 68.2 72.7 75.9 85.2 [MASK] at the end of the input sequence for GPT-2. We further ap-
-Knowledge Injection for VAW 52.5 78.0 80.2 88.0 ply the relation embedding head by extending the word embedding
-VTW 72.8 80.3 82.9 88.7 layer in PLMs; thus, GPT2 can generate virtual answer words. We
-Knowledge Injection for VTW 68.8 79.5 81.6 88.5 first notice that fine-tuning leads to poor performance with high
-Structured Constrains 73.5 81.2 83.6 89.3
variance in the low-resource setting, while KnowPrompt based on
RoBERTa or GPT-2 can achieve impressive improvement with low
variance compared with Fine-tuning. As shown in Figure 3, Know-
to drop from 74.3% to 52.5% in the 8-shot setting. It also reveals that Prompt based on GPT-2 obtains the results on par of the model
the injection of semantic knowledge maintained in relation labels with RoBERTa-large, which reveals our method can unearth the
is critical for relation extraction, especially in few-shot scenarios. potential of GPT-2 to make it perform well in natural language
Effect of Virtual Type Words Modules: We also conduct an ab- understanding tasks such as RE. This finding also indicates that our
lation study to validate the effectiveness of the design of virtual method is model-agnostic and can be plugged into different kinds
type words. As for -VTW, we directly remove virtual type words, of PLMs.
and for -Knowledge Injection for VTW, we randomly initialize the
virtual type words to conduct optimization. In the 8-shot setting,
the performance of the directly removing virtual type words drops F1 Over Diff. K Set On TACRED-Revisit
from 74.3 to 72.8, while randomly initialized virtual type words
80
decrease the performance to 68.1, which is much lower than 72.8.
This phenomenon may be related to the noise disturbance caused
by random initialization, while as the instance increase, the impact 60
F1 Score
Input Example of our KnowPrompt Top 3 words around Top 3 words around
[sub] [obj]
x:[CLS] It sold [𝐸 1 ] ALICO [/𝐸 1 ] to [𝐸 2 ] MetLife Inc [𝐸 2 ] for $ 162 billion. [SEP] organization company
[sub] ALICO [sub] [MASK] [obj] MetLife Inc [obj]. [SEP] group plc
y: "𝑜𝑟𝑔 : 𝑚𝑒𝑚𝑏𝑒𝑟 _𝑜 𝑓 " corporation organization
x: [CLS] [𝐸 1 ] Ismael Rukwago [/𝐸 1 ], a senior [𝐸 2 ] ADF [𝐸 2 ] commander, denied any involvement. [SEP] person intelligence
[sub]Ismael Rukwago [sub] [MASK] [obj] ADF [obj]. [SEP] commander organization
y: "𝑝𝑒𝑟 : 𝑒𝑚𝑝𝑙𝑜𝑦𝑒𝑒_𝑜 𝑓 " colonel command
Table 7: Comparative statistics between KnowPrompt and express and whether virtual type words can adaptively reflect the
PTR, including (1)Answer Form of prompt; (2) labor- entity types based on context as shown in Table 6. Specifically, we
intensive; (3) MA refers to whether model-agnostic ; (4) ML apply the MLM head over the position of the virtual type words
refers to the ability of multi-label learning; and (4) CC refers to get the output representation and get the top-3 words in vocab-
to the computational complexity. ulary nearest the virtual type words according to the 𝐿2 distance
of embeddings between virtual type words and other words. We
Method # Answer Form # Labor # MA # ML # CC observe that thanks to the synergistic optimization with knowl-
edge constraints, those learned virtual type words can dynamically
LM-BFF single-token normal yes - high
PTR multi-token normal no normal norm adjust according to context and play a reminder role for RE.
Ours single-token small yes better norm