CAT: A Contextualized Conceptualization and Instantiation Framework For Commonsense Reasoning
CAT: A Contextualized Conceptualization and Instantiation Framework For Commonsense Reasoning
𝑻𝑯𝑹+
Holiday
Break
A running Relaxing
Concept example event holiday, Take a break, … …
…
(3) Prompt Aggregation traveling
PersonX is (3) Prompt Aggregation
on vacation xIntent
Go on a holiday
xIntent Have fun
Take a break
relaxing event [SEP] traveling, 𝑻𝑯𝑹+
break, holiday, … …
……
……
Instance
student
Plausible? Train (1) teacher
𝒮ℎ 𝐷𝑡𝑙
𝒯𝑡
Figure 2: Overview of our CAT framework. A running example that conceptualizes the triple (PersonX is on
vacation, xIntent, have fun) is presented in the figure, where the head is conceptualized first, and the model needs
to determine whether the conceptualized triple still holds after the event conceptualization.
its linked concept c from a concept taxonomy C. triples are manually annotated as Dhl and Dtl ,
The task objective for generative event conceptual- while others remain unlabeled Dhu and Dtu . The
ization is to generate ha directly from ho with text trn/dev/tst partition follows the same split as in
generation models. For the triple conceptualization the original ATOMIC. Statistics and more detailed
task, the objective is to distinguish whether a con- explanations of AbstractATOMIC are shown in
ceptualized triple (ha , r, t), representing abstract Table 1 and Appendix A.
commonsense knowledge, is plausible or not.
4 CAT Framework
Dataset. To study conceptualization over This section introduces our proposed Contextual-
CSKBs, we use the AbstractATOMIC dataset ized ConceptualizAtion and InsTantiation (CAT)
provided by He et al. (2022) as the benchmark. In framework for conceptualizing commonsense
AbstractATOMIC, ATOMIC is used as the original knowledge bases and acquiring abstract common-
CSKB. And the event conceptualization adopts sense knowledge. An overview is presented in
a discriminative way, where a syntactic parsing Figure 2. Our motivation is two-fold: first, adding
schema is defined to identify the components i in instantiation after conceptualization to form a cy-
ho to be heuristically linked to concept taxonomies cle can strongly benefit two conceptualization tasks
Probase (Wu et al., 2012) and WordNet (Miller, simultaneously. On the one hand, instantiating con-
1995) to form conceptualized ha . Such a heuristic ceptualized triple relies on the correctness of event
can produce over 32 times more candidate conceptualization. On the other hand, properly
conceptualized head events and over 10 times conceptualized triples can benefit event conceptual-
more conceptualized triples compared with the ization via instantiation by providing more context
original ATOMIC, as the number of retrieved brought by (r, t). Second, to address the lack of
concepts from the concept taxonomy C can be annotations, we resort to pseudo labeling, a typi-
manually controlled to acquire a large number cal semi-supervised learning approach to automat-
of conceptualizations. Triple conceptualization ically assign pseudo labels to the vast majority of
is defined as predicting the plausibility of the unlabeled data using a teacher model.
triples whose head is conceptualized. Only 131K Following He et al. (2022), we study the
(26%) conceptualizations of 7K (45%) ATOMIC retrieval-based discriminative paradigm of event
head events and 81K (1.3%) conceptualized conceptualization and leave the generative
paradigm as an intrinsic evaluation. In CAT, conceptualization, we retrieve some alternative pos-
we unify event conceptualization and triple sible conceptualizations of ho to accompany the
conceptualization into one cycle and make them learning of ha . Additional conceptualizations of ho
mutually benefit each other through instantiation from both labeled and pseudo-labeled examples are
and conceptualization. Our framework can be predicted again by the teacher model and ranked ac-
summarized into four steps: cording to their plausibility score prediction. And
(1) Train teacher models for both event conceptual- top m conceptualizations are retrieved with m be-
ization and triple conceptualization on the labeled ing a hyperparameter to control the number of re-
dataset Dhl and Dtl , respectively. Use the two teach- trievals. For triple conceptualization, we perform
ers to assign pseudo labels to unlabeled datasets. instantiation in cascade to instantiate c to some
(2) Conduct alternative conceptualization or instan- concrete instances to assist the learning process.
tiation on labeled and pseudo-labeled data. Possible instantiations of c are extracted from anno-
(3) Bootstrap (aggregate) the alternative concepts tated and pseudo-labeled event conceptualizations
and instances in the second step using natural lan- by searching for conceptualized events h0a ∈ Ha
guage prompt templates and train student models other than ha with c as the concept and extracting
on both labeled and pseudo-labeled data. their corresponding instances i ⊂ h0a . Similarly,
(4) Use the student models to refine the pseudo the instances are then scored by the teacher model,
labels and then re-train the student models. and the top n of them are retrieved. Intuitively,
alternative event conceptualizations can serve as
4.1 Teacher Model Training hints for discriminating the correctness of the target
Two teacher models on both event and triple con- conceptualization, and instantiations can carry ad-
ceptualization tasks are trained separately on the ditional contextualized information to help verify
labeled dataset Dhl and Dtl . As both tasks are in- the plausibility of a conceptualized triple, which
herently text/triple classification, we adopt KG- meets the objective of deriving abstract common-
BERT (Yao et al., 2019) as the skeleton of our sense knowledge that is context-sensitive.
models. The event conceptualization model deter-
4.3 Prompt Aggregation
mines whether ha is a valid conceptualization of
ho , and the triple conceptualization model deter- We then bootstrap the retrieved alternative con-
mines whether a conceptualized triple (ha , r, t) is ceptualizations/instantiations via natural language
plausible or not. The two models θ are trained on prompts. Here bootstrap (Carey, 2004) can be un-
annotated examples xi with a cross-entropy loss derstood as binding the alternative retrievals and
(Eq. 1) and used to provide pseudo labels to in- the target concept/triple together to strengthen the
stances from the unlabeled datasets Dhu and Dtu . discrimination of the target concept/triple. As
Two thresholds, T + and T − , are set to determine shown in Figure 2 step (3), the initially given in-
the pseudo labels of unlabeled examples with high put and retrieved concepts/instances are concate-
confidence. Examples with a pseudo-labeled score nated via human-defined prompts for both concep-
higher than T + will be labeled yi = 1, and those tualization tasks. Alternative concepts/instances
lower than T − will be labeled yi = 0. The rest will are sorted in the order of their plausibility score
be discarded. ranking. Two student models Sh and St for both
|x| tasks are trained using the modified text with such
X
L(xi , θ) = − yi log(θ(xi )) (1) prompts as inputs. They are expected to learn the
i=1 bootstrapping connectionism between the target
and the additional retrievals we provided. More
4.2 Alternative Conceptualization and detail about the prompt design is in Appendix B.
Instantiation
According to Murphy (2004), when humans learn 4.4 Pseudo-Label Refinement
a new concept, we pre-extract similar known con- All pseudo labels, initially derived by a teacher
cepts in our minds and infer possibly equivalent un- model trained on the original labeled dataset, are re-
known concepts on the fly. Inspired by this theory, labeled according to the plausibility score predicted
we retrieve additional abstract concepts or instanti- by our newly enhanced student models Sh and St .
ated events to help discriminate conceptualizations Similar to the teacher model, two thresholds, T +
and abstract commonsense knowledge. For event and T − , are applied to distinguish positive and neg-
Event Conceptualization Triple Conceptualization
Framework Backbone PTLM / Method
Validation Testing Validation Testing
BERT-base 110M 82.4±0.05 82.5±0.31 71.2±0.58 72.6±0.71
BERT-large 340M 82.8±0.48 83.1±0.80 72.4±0.01 73.7±0.00
BART-base 139M 83.8±0.28 84.4±0.32 72.0±0.09 72.6±0.15
BART-large 406M 85.0±0.13 85.2±0.22 74.5±0.13 76.2±0.19
RoBERTa-base 110M 84.1±0.04 84.5±0.19 72.2±0.00 74.1±0.00
RoBERTa-large 340M 85.2±0.24 85.5±0.02 75.3±0.00 76.9±0.01
Supervised DeBERTa-v3-base 214M 85.1±0.08 85.8±0.07 73.9±0.10 75.9±0.04
Learning DeBERTa-v3-large 435M 85.8±0.05 86.2±0.15 76.9±0.03 78.0±0.02
ELECTRA-base 110M 85.4±0.05 85.8±0.02 74.3±0.27 76.2±0.12
ELECTRA-large 340M 84.7±0.47 85.3±0.38 75.6±0.01 77.9±0.06
GPT2-base 117M 60.0±0.06 59.1±0.14 52.8±0.14 55.9±0.11
GPT2-medium 345M 61.2±0.11 60.3±0.08 54.6±0.17 57.4±0.09
GPT2-large 774M 64.1±0.05 62.7±0.08 60.5±0.11 59.8±0.06
GPT2-XL 1558M 64.2±0.19 63.6±0.22 62.2±0.08 61.5±0.10
UDA (TF-IDF) 83.6±0.29 83.6±0.24 75.8±1.26 76.8±1.34
UDA (back-trans.) 83.4±0.27 83.6±0.24 75.8±1.25 76.8±1.34
Semi-Supervised
Noisy-Student 86.4±0.05 86.5±0.09 75.4±0.64 76.7±0.59
Learning
PseudoReasoner (BERT-base) 83.3±0.11 84.0±0.24 73.0±0.14 74.1±0.33
PseudoReasoner (RoBERTa-large) 86.6±0.25 86.7±0.33 76.3±0.12 77.2±0.21
BERT-base 110M 87.1±0.06 87.4±0.11 74.3±0.26 76.3±0.38
BERT-large 340M 87.7±0.16 88.0±0.19 75.8±0.23 77.8±0.36
BART-base 139M 88.2±0.09 88.2±0.09 75.7±0.09 78.0±0.14
BART-large 406M 88.6±0.07 88.7±0.10 77.2±0.12 79.0±0.14
CAT RoBERTa-base 110M 88.4±0.12 88.3±0.08 76.9±0.16 78.0±0.19
(Semi-Supervised) RoBERTa-large 340M 89.0±0.15 88.8±0.20 78.2±0.08 79.4±0.14
DeBERTa-v3-base 214M 88.8±0.12 88.9±0.08 77.5±0.10 79.9±0.07
DeBERTa-v3-large 435M 89.1±0.05 89.2±0.14 78.7±0.16 80.0±0.33
ELECTRA-base 110M 88.7±0.10 88.9±0.10 74.9±0.15 75.5±0.40
ELECTRA-large 340M 88.6±0.77 88.5±0.70 74.9±0.15 75.5±0.40
Table 2: Performance (%) by our CAT framework on the discriminative event conceptualization and triple concep-
tualization tasks. We report the average AUC score and standard deviation across experiments with three random
seeds. The best performances within each framework are underlined, and the best among all models are bold-faced.
ative examples for both tasks. In addition, negative plied to various downstream commonsense reason-
labels are assigned to triples whose conceptualized ing tasks such as SocialIQA (Sap et al., 2019b),
head events are predicted as wrong conceptualiza- self-talk (Shwartz et al., 2020), and CSKB com-
tions by Sh , as wrong conceptualizations will not pletion (Malaviya et al., 2020). Meanwhile, gener-
yield plausible abstract commonsense knowledge. ative event conceptualization enables performing
automatic conceptualization scalably. Both are im-
4.5 Application and Evaluation of CAT portant applications and evaluations of CAT.
The resulting models of CAT include an event con- 5 Experiments
ceptualization model and a triple conceptualization
model, both fine-tuned on the refined pseudo labels We conduct conceptualization experiments using
and the labeled data. These two models can be CAT in Section 5.1 and generative experiments
used to conceptualize ATOMIC to a larger com- as evaluations in Section 5.2. These experiments
monsense knowledge base on a more abstract level. demonstrate that CAT has a strong capability in
We further conduct intrinsic evaluations on the ac- conceptualizing CSKBs, and better conceptualiza-
quired event conceptualization model under a gen- tion modeling can help populate more novel and
erative event conceptualization paradigm and ex- diverse commonsense knowledge and thus help
trinsic evaluations on the resulting conceptualized commonsense modeling (COMET).
CSKB with commonsense inference modeling task
(COMET; Bosselut et al. (2019)) in Section 5. Here 5.1 CSKB Conceptualization
we select COMET as the representative because it Baselines. We collectively introduce the base-
is a general commonsense model that can be ap- lines for both event and triple conceptualization
BLEU-1 BLEU-2 METEOR ROUGE-L CIDEr Human
Training Data
Dev Test Dev Test Dev Test Dev Test Dev Test Dev Test
Dhl + u
D0.95 73.0 71.1 70.2 63.0 48.1 47.1 71.4 70.7 63.6 66.9 92.8 93.3
Dhl + u
D0.9 71.3 71.9 65.2 63.8 45.7 46.7 69.8 71.3 63.4 67.9 90.5 91.0
Dhl + u
D0.8 68.2 68.4 65.9 64.0 44.8 44.0 66.6 66.7 60.0 62.0 86.0 85.7
Dhl + u
D0.7 66.5 67.2 57.2 62.6 43.0 43.4 65.9 65.8 60.4 61.2 79.0 80.3
Dhl + u
D0.5 64.9 62.4 58.3 51.1 41.2 40.9 63.8 63.0 58.2 59.4 74.5 79.0
Dhl 67.6 65.3 56.8 53.1 43.5 43.1 65.7 66.6 60.2 60.9 70.0 81.5
Zero-Shot 20.2 17.0 6.80 4.11 5.80 4.70 3.80 3.00 1.90 1.60 15.0 11.5
Table 3: Performance (%) of GPT2 (XL) on the generative event conceptualization task. Dhl stands for annotated
labeled data, and Du stands for the data acquired by CAT. The underfoot value indicates the threshold for selecting
plausible pseudo labels. The best performances are bold-faced, and the second-best ones are underlined.
tasks, as they are inherently classification tasks. and 2.2%. This shows pipelining two-step concep-
AUC is used as the evaluation metric. Under a su- tualizations as a loop and leveraging our proposed
pervised learning setting, we apply KG-BERT (Yao bootstrapping-based method can yield a larger per-
et al., 2019) model with BERT (Devlin et al., 2019), formance gain compared with simply applying a
BART (Lewis et al., 2020), RoBERTa (Liu et al., semi-supervised learning strategy. Due to limited
2019), DeBERTa (He et al., 2021, 2023), and space, ablation studies on framework components
ELECTRA (Clark et al., 2020) as the backbone and the semi-supervised learning paradigm of CAT
language models. We also attempt to leverage su- are conducted in Appendix C.1.4. For example,
pervised generative language models as baselines. the results indicate that bootstrapping alternative
GPT2 (Radford et al., 2019) models are trained conceptualization and instantiation plays the most
with a text generation objective only on positive important role in assisting learning conceptualiza-
examples, and we use perplexity as the prediction tion among all components of CAT. Additional re-
scores to calculate AUC. For the semi-supervised sults and a computational cost study can be found
learning baselines, we leverage UDA (Xie et al., in Appendix C.1.3 and Appendix D.
2020a), NoisyStudent (Xie et al., 2020b), and Pseu-
doReasoner (Fang et al., 2022) with RoBERTa- 5.2 Application and Evaluation of CAT
large being the backbone model. Additional expla-
As CAT is a framework for acquiring conceptu-
nations can be found in Appendix C.1.1.
alized commonsense knowledge, including both
conceptualized head events (from ho to ha ) and
Discriminative Results. The results for both abstract commonsense triples (ha , r, t), we assess
tasks are presented in Table 2. Under a super- these pseudo-labeled outcomes via two generative
vised learning setting, KG-BERT family mostly tasks with various threshold tuning as evaluations.
performs better on both tasks than GPT2 due to the
fact that GPT2 is only fine-tuned on positive exam- Generative Event Conceptualization. To in-
ples and thus cannot learn from negative examples trinsically evaluate the effectiveness of CAT’s event
that contain wrong conceptualizations and implau- conceptualization, we use the acquired conceptual-
sible abstract commonsense knowledge. As for ized head events as training data to learn a genera-
the semi-supervised learning setting, previous SSL tive event conceptualizer. Specifically, the models
baselines are rather limited in improving the perfor- are trained with instance-conceptualizations pairs
mance against supervised learning. The best Pseu- in the format of “<instance> is an instance of
doReasoner only improves by 0.5% and 0.3% on <concept>”. At the evaluation phase, the model
the test set for both tasks compared with supervised is prompted with “<instance> is an instance of
RoBERTa-large models. Instead, models trained [GEN]” where <instance> is the instance to be
with CAT can outperform all other training method- conceptualized and [GEN] is the generation token.
ologies. Comparing the test set performance with We then retrieve the top-1 generation and com-
PseudoReasoner, small backbone models (BERT- pare it against the target set from the evaluation
base) can improve by 3.4% and 2.2%, and large dataset to compute four NLG metrics, as listed in
models (RoBERTa-large) can be improved by 2.1% Appendix C.2.1. These scores can be regarded as
BLEU-1 BLEU-2 BLEU-3 BLEU-4 METEOR ROUGE-L CIDEr
Training Data
Dev Test Dev Test Dev Test Dev Test Dev Test Dev Test Dev Test
Zero-Shot 5.42 4.89 1.84 1.51 0.65 0.52 0.26 0.21 6.50 5.70 6.40 5.90 1.60 1.20
ATOMIC (subset) 38.1 38.1 25.4 25.7 18.7 18.8 15.5 15.7 14.9 14.9 33.0 33.2 27.6 27.8
+Dtl 38.1 38.5 24.8 25.5 17.8 18.4 14.7 15.2 15.3 15.6 33.1 33.7 26.8 27.3
+Finetune 38.6 39.0 25.8 26.6 18.9 19.7 15.7 16.4 15.1 15.4 33.6 34.4 28.8 30.0
u
+DAbs.ATM. 40.0 40.3 27.1 27.8 20.0 20.8 16.5 17.5 16.1 16.3 35.3 35.7 31.6 31.7
+Finetune 40.1 40.5 27.1 27.8 20.1 20.8 16.7 17.4 16.2 16.4 35.4 35.9 31.8 31.7
+Dtl + DAbs.ATM.
u
40.2 40.6 26.2 27.4 19.0 20.4 15.1 16.8 16.3 16.5 35.0 35.4 31.0 31.3
+Finetune 40.0 40.4 26.0 26.9 18.7 19.7 15.0 16.1 16.3 16.4 35.0 35.4 30.3 30.7
u
+DCAT 41.2 41.9 28.1 29.0 20.7 21.5 16.5 17.8 16.6 16.9 35.9 36.5 33.4 33.7
+Finetune 41.1 42.0 28.0 29.0 20.4 21.5 16.4 17.6 16.6 17.0 36.0 36.8 33.2 33.8
+Dtl + DCAT
u
39.9 40.5 26.2 27.4 19.3 20.6 16.0 17.4 16.0 16.2 35.0 35.4 30.8 31.3
+Finetune 40.4 41.0 26.6 27.6 19.5 20.7 16.1 17.1 16.2 16.5 35.4 35.8 31.3 31.5
Table 4: Performances (%) of GPT2 (XL) on commonsense inference modeling task (COMET). Dtl stands for
u u
annotated abstract triples, and DCAT stands for abstract triples acquired by CAT. DAbs.ATM. contains triples that are
pseudo-labeled by a supervised RoBERTa discriminator, as done by He et al. (2022). The best performances are
bold-faced. Finetune refers to fine-tuning back on the ATOMIC subset.
lects plausible conceptualized heads, where higher is fine-tuned on the ATOMIC subset, the annotated
thresholds indicate higher plausibility regarded by abstract triples Dtl , the abstract knowledge verified
CAT. The results are presented in Table 3. With by CAT, or their combinations. The commonsense
a relatively high threshold, generators trained on generation results are presented in Table 4. Similar
a mixture of pseudo-labeled data by CAT and an- to COMET (Bosselut et al., 2019), all models are
notated concepts significantly outperform the base- evaluated on the original ATOMIC’s full validation
lines in every automated metric. A plausible rate of and testing sets. The best result is achieved using a
93.3% is maximally achieved on the test set, which mixture of the ATOMIC subset and abstract triples
is 11.8% higher than the baseline. Gradually reduc- pseudo-labeled by our framework, with 0.95 as the
ing the threshold also decreases the performance, threshold for selecting plausible triples. This in-
indicating abstract heads with lower plausibility dicates high-quality abstract commonsense triples
scores can be of poorer quality. Such results in- can indeed provide a more general view of the orig-
dicate that CAT can produce high-quality event inal commonsense knowledge, thus helping com-
conceptualizations for generative models to learn monsense inference. Additionally, training with
better conceptualizers without the need to annotate our pseudo-labeled examples outperforms training
a large number of data. with those annotated triples in AbstractATOMIC,
which also validates the effectiveness of our model
that leverages a large amount of unlabeled data.
( Y H Q W &