0% found this document useful (0 votes)
6 views28 pages

CAT: A Contextualized Conceptualization and Instantiation Framework For Commonsense Reasoning

The document proposes a framework called CAT that aims to improve commonsense reasoning by integrating event conceptualization and instantiation. CAT uses a semi-supervised learning approach to conceptualize commonsense knowledge bases at scale. Experiments show CAT achieves state-of-the-art performance on conceptualization tasks and the acquired abstract knowledge can significantly improve commonsense inference modeling.

Uploaded by

Ali M
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views28 pages

CAT: A Contextualized Conceptualization and Instantiation Framework For Commonsense Reasoning

The document proposes a framework called CAT that aims to improve commonsense reasoning by integrating event conceptualization and instantiation. CAT uses a semi-supervised learning approach to conceptualize commonsense knowledge bases at scale. Experiments show CAT achieves state-of-the-art performance on conceptualization tasks and the acquired abstract knowledge can significantly improve commonsense inference modeling.

Uploaded by

Ali M
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 28

CAT: A Contextualized Conceptualization and Instantiation Framework

for Commonsense Reasoning


Weiqi Wang1∗, Tianqing Fang1∗ , Baixuan Xu1 , Chun Yi Louis Bo1 ,
Yangqiu Song1 , Lei Chen1,2
1
Department of Computer Science and Engineering, HKUST, Hong Kong SAR, China
2
Information Hub, HKUST (GZ), Guangzhou, China
{wwangbw, tfangaa}@cse.ust.hk, [email protected]
{cybo, yqsong}@cse.ust.hk, [email protected]

Abstract PersonX watches football game,


as a result, PersonX will:
feel relaxed
Commonsense reasoning, aiming at endowing Conceptualization
Conceptualization (watches football game →
machines with a human-like ability to make sit- (watches football game → relaxing event) observe)
arXiv:2305.04808v2 [cs.CL] 10 May 2023

uational presumptions, is extremely challeng- PersonX [relaxing event] PersonX [observe]


Abstract
ing to generalize. For someone who barely Knowledge
as a result, PersonX will:
feel relaxed
as a result, PersonX will:
feel relaxed
knows about meditation, while is knowledge- Aided
Reasoning Instantiation
able about singing, he can still infer that med- (relaxing event → plays with his dog)
Wrong
itation makes people relaxed from the exist- PersonX plays with his dog, Abstract
ing knowledge that singing makes people re- as a result, PersonX will: Knowledge
be happy and relaxed
laxed by first conceptualizing singing as a re-
laxing event and then instantiating that event
Figure 1: A demonstration of commonsense reasoning
to meditation. This process, known as concep-
on an unknown situation, PersonX plays with his dog,
tual induction and deduction, is fundamental
with the aid of abstract commonsense knowledge. De-
to commonsense reasoning while lacking both
contextualized conceptualization, such as observe, may
labeled data and methodologies to enhance
yield wrong abstract commonsense knowledge that can-
commonsense modeling. To fill such a re-
not be instantiated within the corresponding context.
search gap, we propose CAT (Contextualized
ConceptuAlization and InsTantiation), a semi-
supervised learning framework that inte- (CSKB) (Sap et al., 2019a; Speer et al., 2017;
grates event conceptualization and instantia- Hwang et al., 2021) and large language mod-
tion to conceptualize commonsense knowl- els (Bosselut et al., 2019; Rajani et al., 2019; Liu
edge bases at scale. Extensive experiments et al., 2022b; Su et al., 2022; Yu et al., 2022b).
show that our framework achieves state-of- However, when encountering situations beyond the
the-art performances on two conceptualiza-
data given, more abstract background knowledge
tion tasks, and the acquired abstract com-
monsense knowledge can significantly im- must be acquired and generalized to assist the rea-
prove commonsense inference modeling. Our soning (Tenenbaum et al., 2011), and language
code, data, and fine-tuned models are pub- models trained with an autoregressive language
licly available at https://github.com/HKUST- modeling objective do not explicitly leverage such
KnowComp/CAT. abstract knowledge during inference.
Instead, humans rely on conceptual induction
1 Introduction
and deduction (Murphy, 2004) to make inferences
“Concepts are the glue that holds our on novel situations without the need to memorize
mental world together.”– Murphy (2004) all special cases. As shown in Figure 1, humans can
derive conceptualizations based on the assertion
Commonsense reasoning is a crucial ability for that “PersonX watches a football game, as a result,
machines to make situational presumptions and he feels relaxed” to infer that “relaxing events can
draw inferences from the knowledge that reflects make someone feel relaxed,” where the acquired ab-
our humans’ understanding of situations and com- stract commonsense knowledge can be further used
mon facts (Davis, 1990; Davis and Marcus, 2015). as general knowledge to perform reasoning on sim-
It has gained increasing popularity in the Natural ilar or associated situations. A new commonsense
Language Processing (NLP) community with the knowledge “PersonX plays with his dog, as a result,
emergence of CommonSense Knowledge Bases he feels happy and relaxed” can be deduced by in-

Equal Contribution stantiating relaxing events to playing with his dog.
As the cornerstone of generalizable commonsense novel bootstrapping1 method to enhance concep-
reasoning, such a process is extremely challeng- tualizations and abstract commonsense knowledge
ing for machines to replicate due to the absence verification with the help of similar conceptualiza-
of contextualized conceptualizations and abstract tions and instantiations as a reference. We demon-
commonsense knowledge in CSKBs and a lack of strate the effectiveness of CAT by using the ac-
relevant methodologies. quired abstract commonsense knowledge to train
Yet, existing works address the process of induc- COMET (Bosselut et al., 2019), a commonsense in-
tion and deduction separately via conceptualization ference language model that generates if-then com-
and instantiation. Several methods performing con- monsense knowledge, and showing that our derived
ceptualization are proposed with a specific focus abstract commonsense knowledge can significantly
on entity-level (Durme et al., 2009; Song et al., improve commonsense inference modeling.
2011; Gong et al., 2016; He et al., 2020; Peng et al., Our contributions are three-fold: (1) We intro-
2022; Song et al., 2015) and event-level (Chen duce a semi-supervised learning framework, CAT,
et al., 2020; He et al., 2022) semantics. Instan- to conceptualize CSKBs with the assistance of pro-
tiation (Allaway et al., 2023), as the process that gressively bootstrapping similar abstract concepts
simulates conceptual deduction, is tackled sepa- or instantiations in the conceptualization process.
rately and not leveraged by these methods. Though (2) We use CAT to acquire abstract commonsense
abstract commonsense knowledge can be derived knowledge at scale with high quality, which can
by using existing conceptualization methods to ab- be used for commonsense inference modeling. (3)
stract a certain instance from factual commonsense We demonstrate the effectiveness of our framework
knowledge, several limitations still exist. by achieving state-of-the-art performance on two
CSKB conceptualization tasks and remarkably im-
First, the plausibility of abstract commonsense
proving commonsense inference modeling with our
knowledge banks on both the correctness of con-
derived abstract commonsense knowledge.
ceptualization and proper contextualization under
specific assertions. The latter one, which is an es- 2 Related Works
sential step for the deduction of abstract knowledge,
is missing from current methodologies. Take Fig- Conceptualization and Instantiation. Many
ure 1 as an example, the concept observe will not existing works have studied conceptualization and
necessarily lead to the result of “feeling relaxed”, instantiation separately. Durme et al. (2009) first
as observe omits the entertaining property of the attempted to derive more general knowledge by ab-
original instance as a cost of abstraction. Second, stracting over large sets of factoids obtained from
instantiating abstract commonsense knowledge can WordNet (Miller, 1995) synsets. Song et al. (2011,
yield much more and diverse concrete common- 2015) and Gong et al. (2016) proposed to turn
sense knowledge that can serve as an augmentation instances in a sentence into concepts via weight
of the training dataset, while current methods un- matching from Probase (Wu et al., 2012). Recently,
dervalue such a process and only focus on concep- Liu et al. (2022c) proposed a taxonomy-guided
tualization. Finally, the complex contextualization induction method to mine verb-oriented common-
and conceptualization of commonsense knowledge sense knowledge from verb phrases. Peng et al.
can easily bring more than two orders of magni- (2022) constructed a conceptual knowledge bench-
tude of data on top of the original dataset. This mark to evaluate language models with three zero-
makes current labeled data scarce and infeasible shot probing tasks. While these works focus on
for practitioners to annotate all of them, leaving a the conceptualization of entities, He et al. (2022)
large amount of unlabeled data. constructed an event conceptualization benchmark
based on ATOMIC (Sap et al., 2019a) by combin-
To fill in these research gaps, we pro-
ing syntactic parsing, semantically heuristic match-
pose CAT (Contextualized ConceptuAlization and
ing, and human annotation. Besides, the line of
InsTantiation), a semi-supervised learning frame-
works focusing on ultra-fine entity typing (Choi
work that unites event conceptualization and in-
et al., 2018; Dai et al., 2021; Li et al., 2022) shared
stantiation in cascade to conceptualize CSKBs and
1
acquire abstract commonsense knowledge to aid Bootstrapping refers to the linguistics term in language
acquisition that humans learn new knowledge by recogniz-
commonsense reasoning. Inspired by how humans ing its semantic elements and connecting them with known
learn with concepts (Carey, 2004), we design a knowledge (Pinker and MacWhinney, 1987).
Data Type Train Dev Test comprises two steps (He et al., 2022): event con-
#event 107,384 12,117 11,503 ceptualization and triple conceptualization.
Dl
#triple 65,386 8,403 7,408
Denote the triples in the original CSKB as Do =
#event 304,983 36,023 31,578 {(ho , r, t)|ho ∈ Ho , r ∈ R, t ∈ T }, where Ho , R,
Du
#triple 4,851,272 499,523 570,400
and T are the set of heads, relations, and tails in
Table 1: Statistics of labeled data Dl and unlabeled data the original CSKB. The first step only operates on
Du in AbstractATOMIC. head events without considering the context in r
and t. The goal of event conceptualization is to
similar objectives of typing named entities, nom- produce conceptualized head event ha from the
inal nouns, and pronouns into a set of free-form original head ho to represent an abstraction of ho .
phrases. Instantiation was attempted by Allaway In the second step, the task is to verify whether
et al. (2023), who proposed a controllable gener- the conceptualized head ha still makes sense in the
ative framework to probe valid instantiations for context of r and t, as r and t will further restrict the
abstract knowledge automatically. Though Porada level of abstractness in ha . As shown in Figure 1,
et al. (2021) and Peng et al. (2022) both proved that conceptualizing watch football game to observe is
existing pretrained language models lack concep- wrong within the context of having feel relaxed as a
tual knowledge, none of existing works explicitly result. Plausible (ha , r, t) triples will be considered
combine both techniques to derive abstract knowl- as valid abstract commonsense knowledge.
edge that is context-sensitive and generalizable. Specifically, in the first step, there are two ways
of conceptualizing head events alone: a retrieval-
Commonsense Reasoning. Endowing NLP sys- based discriminative way and a generative way.
tems with the ability to perform commonsense The retrieval-based discriminative paradigm identi-
reasoning is an elusive goal of artificial intelli- fies and links a component i in ho to a concept c in
gence (Sap et al., 2020). A diverse collection of a concept taxonomy C to form a conceptualization
commonsense reasoning tasks have been proposed ha by replacing i with c. The model needs to verify
as evaluation benchmarks (Talmor et al., 2019; whether ha is a valid conceptualization of ho . The
Omura et al., 2020; Ponti et al., 2020; Fang et al., generative paradigm aims to generate a ha directly
2021a). Among them, Bosselut et al. (2019) pro- given ho and the designated component i in ho .
posed a generative model, COMET, to learn to
Formally, denote the annotated dataset in the
produce if-then commonsense knowledge as an ef-
first step, event conceptualization, as Dhl =
fective approach toward modeling commonsense
{(ho , ha , y)|ho ∈ Ho , ha ∈ Ha , y ∈ {0, 1}},
inference that can be applied in various common-
where ho is an original head event without con-
sense reasoning tasks (Talmor et al., 2019).
ceptualization, ha is a corresponding conceptual-
Semi-Supervised Learning. Semi-supervised ization of ho , and y is the human-annotated label
learning (SSL) aims at taking advantage of unla- indicating whether such a conceptualization is plau-
beled data to equip models with stronger general- sible or not. The labeled dataset in the second
ization ability (van Engelen and Hoos, 2020). The step, triple conceptualization, is denoted as Dtl =
most common approach is using pseudo labels (Is- {(h, r, t, y)|h ∈ Ha , r ∈ R, t ∈ T, y ∈ {0, 1}},
cen et al., 2019; Wang et al., 2022) to expose more where h is a conceptualized head event from the
unseen data to the student model. It has been ap- first step, r and t are a relation and a tail from the
plied in various machine learning tasks such as im- original CSKB accompanied with the correspond-
age classification (Liu et al., 2022a; Hu et al., 2021), ing original head ho , and y is the human-annotated
text classification (Li et al., 2021; Meng et al., 2019; label indicating whether such abstract common-
Xiao et al., 2019), commonsense knowledge base sense knowledge, in the form of a conceptualized
population (Fang et al., 2022), and named entity triple, is plausible or not. Besides labeled datasets,
recognition (Liu et al., 2021; Chen et al., 2021). unlabeled datasets are defined similarly as Dhu and
Dtu only with the difference that labels y are miss-
3 Problem Definition ing. Thus, the task objective for discriminative
event conceptualization is to determine whether a
Definition. Conceptualizing an event-centric ho can be properly abstracted using ha , where ha
CSKB to derive abstract commonsense knowledge is derived by replacing a component i ⊂ ho with
Head Conceptualization
Pseudo labels
(2) Alternative Conceptualization 𝐷ℎ𝑢 (4) Pseudo Label Refinement
(1) teacher
𝒯ℎ Train 𝐷ℎ𝑙 Plausible?
student
Prob. 𝒮𝑡

[relaxing event], xIntent, have fun


Traveling

𝑻𝑯𝑹+
Holiday
Break

[SEP] PersonX joins party, Go on a


……
……

A running Relaxing
Concept example event holiday, Take a break, … …

(3) Prompt Aggregation traveling
PersonX is (3) Prompt Aggregation
on vacation xIntent

PersonX joins party


Prob.
PersonX is on vacation [SEP]

Go on a holiday
xIntent Have fun

Take a break
relaxing event [SEP] traveling, 𝑻𝑯𝑹+
break, holiday, … …

……
……
Instance
student
Plausible? Train (1) teacher
𝒮ℎ 𝐷𝑡𝑙
𝒯𝑡

(4) Pseudo Label Refinement 𝐷𝑡𝑢 (2) Instantiation


Pseudo labels
Triple Conceptualization

Figure 2: Overview of our CAT framework. A running example that conceptualizes the triple (PersonX is on
vacation, xIntent, have fun) is presented in the figure, where the head is conceptualized first, and the model needs
to determine whether the conceptualized triple still holds after the event conceptualization.

its linked concept c from a concept taxonomy C. triples are manually annotated as Dhl and Dtl ,
The task objective for generative event conceptual- while others remain unlabeled Dhu and Dtu . The
ization is to generate ha directly from ho with text trn/dev/tst partition follows the same split as in
generation models. For the triple conceptualization the original ATOMIC. Statistics and more detailed
task, the objective is to distinguish whether a con- explanations of AbstractATOMIC are shown in
ceptualized triple (ha , r, t), representing abstract Table 1 and Appendix A.
commonsense knowledge, is plausible or not.
4 CAT Framework
Dataset. To study conceptualization over This section introduces our proposed Contextual-
CSKBs, we use the AbstractATOMIC dataset ized ConceptualizAtion and InsTantiation (CAT)
provided by He et al. (2022) as the benchmark. In framework for conceptualizing commonsense
AbstractATOMIC, ATOMIC is used as the original knowledge bases and acquiring abstract common-
CSKB. And the event conceptualization adopts sense knowledge. An overview is presented in
a discriminative way, where a syntactic parsing Figure 2. Our motivation is two-fold: first, adding
schema is defined to identify the components i in instantiation after conceptualization to form a cy-
ho to be heuristically linked to concept taxonomies cle can strongly benefit two conceptualization tasks
Probase (Wu et al., 2012) and WordNet (Miller, simultaneously. On the one hand, instantiating con-
1995) to form conceptualized ha . Such a heuristic ceptualized triple relies on the correctness of event
can produce over 32 times more candidate conceptualization. On the other hand, properly
conceptualized head events and over 10 times conceptualized triples can benefit event conceptual-
more conceptualized triples compared with the ization via instantiation by providing more context
original ATOMIC, as the number of retrieved brought by (r, t). Second, to address the lack of
concepts from the concept taxonomy C can be annotations, we resort to pseudo labeling, a typi-
manually controlled to acquire a large number cal semi-supervised learning approach to automat-
of conceptualizations. Triple conceptualization ically assign pseudo labels to the vast majority of
is defined as predicting the plausibility of the unlabeled data using a teacher model.
triples whose head is conceptualized. Only 131K Following He et al. (2022), we study the
(26%) conceptualizations of 7K (45%) ATOMIC retrieval-based discriminative paradigm of event
head events and 81K (1.3%) conceptualized conceptualization and leave the generative
paradigm as an intrinsic evaluation. In CAT, conceptualization, we retrieve some alternative pos-
we unify event conceptualization and triple sible conceptualizations of ho to accompany the
conceptualization into one cycle and make them learning of ha . Additional conceptualizations of ho
mutually benefit each other through instantiation from both labeled and pseudo-labeled examples are
and conceptualization. Our framework can be predicted again by the teacher model and ranked ac-
summarized into four steps: cording to their plausibility score prediction. And
(1) Train teacher models for both event conceptual- top m conceptualizations are retrieved with m be-
ization and triple conceptualization on the labeled ing a hyperparameter to control the number of re-
dataset Dhl and Dtl , respectively. Use the two teach- trievals. For triple conceptualization, we perform
ers to assign pseudo labels to unlabeled datasets. instantiation in cascade to instantiate c to some
(2) Conduct alternative conceptualization or instan- concrete instances to assist the learning process.
tiation on labeled and pseudo-labeled data. Possible instantiations of c are extracted from anno-
(3) Bootstrap (aggregate) the alternative concepts tated and pseudo-labeled event conceptualizations
and instances in the second step using natural lan- by searching for conceptualized events h0a ∈ Ha
guage prompt templates and train student models other than ha with c as the concept and extracting
on both labeled and pseudo-labeled data. their corresponding instances i ⊂ h0a . Similarly,
(4) Use the student models to refine the pseudo the instances are then scored by the teacher model,
labels and then re-train the student models. and the top n of them are retrieved. Intuitively,
alternative event conceptualizations can serve as
4.1 Teacher Model Training hints for discriminating the correctness of the target
Two teacher models on both event and triple con- conceptualization, and instantiations can carry ad-
ceptualization tasks are trained separately on the ditional contextualized information to help verify
labeled dataset Dhl and Dtl . As both tasks are in- the plausibility of a conceptualized triple, which
herently text/triple classification, we adopt KG- meets the objective of deriving abstract common-
BERT (Yao et al., 2019) as the skeleton of our sense knowledge that is context-sensitive.
models. The event conceptualization model deter-
4.3 Prompt Aggregation
mines whether ha is a valid conceptualization of
ho , and the triple conceptualization model deter- We then bootstrap the retrieved alternative con-
mines whether a conceptualized triple (ha , r, t) is ceptualizations/instantiations via natural language
plausible or not. The two models θ are trained on prompts. Here bootstrap (Carey, 2004) can be un-
annotated examples xi with a cross-entropy loss derstood as binding the alternative retrievals and
(Eq. 1) and used to provide pseudo labels to in- the target concept/triple together to strengthen the
stances from the unlabeled datasets Dhu and Dtu . discrimination of the target concept/triple. As
Two thresholds, T + and T − , are set to determine shown in Figure 2 step (3), the initially given in-
the pseudo labels of unlabeled examples with high put and retrieved concepts/instances are concate-
confidence. Examples with a pseudo-labeled score nated via human-defined prompts for both concep-
higher than T + will be labeled yi = 1, and those tualization tasks. Alternative concepts/instances
lower than T − will be labeled yi = 0. The rest will are sorted in the order of their plausibility score
be discarded. ranking. Two student models Sh and St for both
|x| tasks are trained using the modified text with such
X
L(xi , θ) = − yi log(θ(xi )) (1) prompts as inputs. They are expected to learn the
i=1 bootstrapping connectionism between the target
and the additional retrievals we provided. More
4.2 Alternative Conceptualization and detail about the prompt design is in Appendix B.
Instantiation
According to Murphy (2004), when humans learn 4.4 Pseudo-Label Refinement
a new concept, we pre-extract similar known con- All pseudo labels, initially derived by a teacher
cepts in our minds and infer possibly equivalent un- model trained on the original labeled dataset, are re-
known concepts on the fly. Inspired by this theory, labeled according to the plausibility score predicted
we retrieve additional abstract concepts or instanti- by our newly enhanced student models Sh and St .
ated events to help discriminate conceptualizations Similar to the teacher model, two thresholds, T +
and abstract commonsense knowledge. For event and T − , are applied to distinguish positive and neg-
Event Conceptualization Triple Conceptualization
Framework Backbone PTLM / Method
Validation Testing Validation Testing
BERT-base 110M 82.4±0.05 82.5±0.31 71.2±0.58 72.6±0.71
BERT-large 340M 82.8±0.48 83.1±0.80 72.4±0.01 73.7±0.00
BART-base 139M 83.8±0.28 84.4±0.32 72.0±0.09 72.6±0.15
BART-large 406M 85.0±0.13 85.2±0.22 74.5±0.13 76.2±0.19
RoBERTa-base 110M 84.1±0.04 84.5±0.19 72.2±0.00 74.1±0.00
RoBERTa-large 340M 85.2±0.24 85.5±0.02 75.3±0.00 76.9±0.01
Supervised DeBERTa-v3-base 214M 85.1±0.08 85.8±0.07 73.9±0.10 75.9±0.04
Learning DeBERTa-v3-large 435M 85.8±0.05 86.2±0.15 76.9±0.03 78.0±0.02
ELECTRA-base 110M 85.4±0.05 85.8±0.02 74.3±0.27 76.2±0.12
ELECTRA-large 340M 84.7±0.47 85.3±0.38 75.6±0.01 77.9±0.06
GPT2-base 117M 60.0±0.06 59.1±0.14 52.8±0.14 55.9±0.11
GPT2-medium 345M 61.2±0.11 60.3±0.08 54.6±0.17 57.4±0.09
GPT2-large 774M 64.1±0.05 62.7±0.08 60.5±0.11 59.8±0.06
GPT2-XL 1558M 64.2±0.19 63.6±0.22 62.2±0.08 61.5±0.10
UDA (TF-IDF) 83.6±0.29 83.6±0.24 75.8±1.26 76.8±1.34
UDA (back-trans.) 83.4±0.27 83.6±0.24 75.8±1.25 76.8±1.34
Semi-Supervised
Noisy-Student 86.4±0.05 86.5±0.09 75.4±0.64 76.7±0.59
Learning
PseudoReasoner (BERT-base) 83.3±0.11 84.0±0.24 73.0±0.14 74.1±0.33
PseudoReasoner (RoBERTa-large) 86.6±0.25 86.7±0.33 76.3±0.12 77.2±0.21
BERT-base 110M 87.1±0.06 87.4±0.11 74.3±0.26 76.3±0.38
BERT-large 340M 87.7±0.16 88.0±0.19 75.8±0.23 77.8±0.36
BART-base 139M 88.2±0.09 88.2±0.09 75.7±0.09 78.0±0.14
BART-large 406M 88.6±0.07 88.7±0.10 77.2±0.12 79.0±0.14
CAT RoBERTa-base 110M 88.4±0.12 88.3±0.08 76.9±0.16 78.0±0.19
(Semi-Supervised) RoBERTa-large 340M 89.0±0.15 88.8±0.20 78.2±0.08 79.4±0.14
DeBERTa-v3-base 214M 88.8±0.12 88.9±0.08 77.5±0.10 79.9±0.07
DeBERTa-v3-large 435M 89.1±0.05 89.2±0.14 78.7±0.16 80.0±0.33
ELECTRA-base 110M 88.7±0.10 88.9±0.10 74.9±0.15 75.5±0.40
ELECTRA-large 340M 88.6±0.77 88.5±0.70 74.9±0.15 75.5±0.40

Table 2: Performance (%) by our CAT framework on the discriminative event conceptualization and triple concep-
tualization tasks. We report the average AUC score and standard deviation across experiments with three random
seeds. The best performances within each framework are underlined, and the best among all models are bold-faced.

ative examples for both tasks. In addition, negative plied to various downstream commonsense reason-
labels are assigned to triples whose conceptualized ing tasks such as SocialIQA (Sap et al., 2019b),
head events are predicted as wrong conceptualiza- self-talk (Shwartz et al., 2020), and CSKB com-
tions by Sh , as wrong conceptualizations will not pletion (Malaviya et al., 2020). Meanwhile, gener-
yield plausible abstract commonsense knowledge. ative event conceptualization enables performing
automatic conceptualization scalably. Both are im-
4.5 Application and Evaluation of CAT portant applications and evaluations of CAT.
The resulting models of CAT include an event con- 5 Experiments
ceptualization model and a triple conceptualization
model, both fine-tuned on the refined pseudo labels We conduct conceptualization experiments using
and the labeled data. These two models can be CAT in Section 5.1 and generative experiments
used to conceptualize ATOMIC to a larger com- as evaluations in Section 5.2. These experiments
monsense knowledge base on a more abstract level. demonstrate that CAT has a strong capability in
We further conduct intrinsic evaluations on the ac- conceptualizing CSKBs, and better conceptualiza-
quired event conceptualization model under a gen- tion modeling can help populate more novel and
erative event conceptualization paradigm and ex- diverse commonsense knowledge and thus help
trinsic evaluations on the resulting conceptualized commonsense modeling (COMET).
CSKB with commonsense inference modeling task
(COMET; Bosselut et al. (2019)) in Section 5. Here 5.1 CSKB Conceptualization
we select COMET as the representative because it Baselines. We collectively introduce the base-
is a general commonsense model that can be ap- lines for both event and triple conceptualization
BLEU-1 BLEU-2 METEOR ROUGE-L CIDEr Human
Training Data
Dev Test Dev Test Dev Test Dev Test Dev Test Dev Test
Dhl + u
D0.95 73.0 71.1 70.2 63.0 48.1 47.1 71.4 70.7 63.6 66.9 92.8 93.3
Dhl + u
D0.9 71.3 71.9 65.2 63.8 45.7 46.7 69.8 71.3 63.4 67.9 90.5 91.0
Dhl + u
D0.8 68.2 68.4 65.9 64.0 44.8 44.0 66.6 66.7 60.0 62.0 86.0 85.7
Dhl + u
D0.7 66.5 67.2 57.2 62.6 43.0 43.4 65.9 65.8 60.4 61.2 79.0 80.3
Dhl + u
D0.5 64.9 62.4 58.3 51.1 41.2 40.9 63.8 63.0 58.2 59.4 74.5 79.0
Dhl 67.6 65.3 56.8 53.1 43.5 43.1 65.7 66.6 60.2 60.9 70.0 81.5
Zero-Shot 20.2 17.0 6.80 4.11 5.80 4.70 3.80 3.00 1.90 1.60 15.0 11.5

Table 3: Performance (%) of GPT2 (XL) on the generative event conceptualization task. Dhl stands for annotated
labeled data, and Du stands for the data acquired by CAT. The underfoot value indicates the threshold for selecting
plausible pseudo labels. The best performances are bold-faced, and the second-best ones are underlined.

tasks, as they are inherently classification tasks. and 2.2%. This shows pipelining two-step concep-
AUC is used as the evaluation metric. Under a su- tualizations as a loop and leveraging our proposed
pervised learning setting, we apply KG-BERT (Yao bootstrapping-based method can yield a larger per-
et al., 2019) model with BERT (Devlin et al., 2019), formance gain compared with simply applying a
BART (Lewis et al., 2020), RoBERTa (Liu et al., semi-supervised learning strategy. Due to limited
2019), DeBERTa (He et al., 2021, 2023), and space, ablation studies on framework components
ELECTRA (Clark et al., 2020) as the backbone and the semi-supervised learning paradigm of CAT
language models. We also attempt to leverage su- are conducted in Appendix C.1.4. For example,
pervised generative language models as baselines. the results indicate that bootstrapping alternative
GPT2 (Radford et al., 2019) models are trained conceptualization and instantiation plays the most
with a text generation objective only on positive important role in assisting learning conceptualiza-
examples, and we use perplexity as the prediction tion among all components of CAT. Additional re-
scores to calculate AUC. For the semi-supervised sults and a computational cost study can be found
learning baselines, we leverage UDA (Xie et al., in Appendix C.1.3 and Appendix D.
2020a), NoisyStudent (Xie et al., 2020b), and Pseu-
doReasoner (Fang et al., 2022) with RoBERTa- 5.2 Application and Evaluation of CAT
large being the backbone model. Additional expla-
As CAT is a framework for acquiring conceptu-
nations can be found in Appendix C.1.1.
alized commonsense knowledge, including both
conceptualized head events (from ho to ha ) and
Discriminative Results. The results for both abstract commonsense triples (ha , r, t), we assess
tasks are presented in Table 2. Under a super- these pseudo-labeled outcomes via two generative
vised learning setting, KG-BERT family mostly tasks with various threshold tuning as evaluations.
performs better on both tasks than GPT2 due to the
fact that GPT2 is only fine-tuned on positive exam- Generative Event Conceptualization. To in-
ples and thus cannot learn from negative examples trinsically evaluate the effectiveness of CAT’s event
that contain wrong conceptualizations and implau- conceptualization, we use the acquired conceptual-
sible abstract commonsense knowledge. As for ized head events as training data to learn a genera-
the semi-supervised learning setting, previous SSL tive event conceptualizer. Specifically, the models
baselines are rather limited in improving the perfor- are trained with instance-conceptualizations pairs
mance against supervised learning. The best Pseu- in the format of “<instance> is an instance of
doReasoner only improves by 0.5% and 0.3% on <concept>”. At the evaluation phase, the model
the test set for both tasks compared with supervised is prompted with “<instance> is an instance of
RoBERTa-large models. Instead, models trained [GEN]” where <instance> is the instance to be
with CAT can outperform all other training method- conceptualized and [GEN] is the generation token.
ologies. Comparing the test set performance with We then retrieve the top-1 generation and com-
PseudoReasoner, small backbone models (BERT- pare it against the target set from the evaluation
base) can improve by 3.4% and 2.2%, and large dataset to compute four NLG metrics, as listed in
models (RoBERTa-large) can be improved by 2.1% Appendix C.2.1. These scores can be regarded as
BLEU-1 BLEU-2 BLEU-3 BLEU-4 METEOR ROUGE-L CIDEr
Training Data
Dev Test Dev Test Dev Test Dev Test Dev Test Dev Test Dev Test
Zero-Shot 5.42 4.89 1.84 1.51 0.65 0.52 0.26 0.21 6.50 5.70 6.40 5.90 1.60 1.20
ATOMIC (subset) 38.1 38.1 25.4 25.7 18.7 18.8 15.5 15.7 14.9 14.9 33.0 33.2 27.6 27.8
+Dtl 38.1 38.5 24.8 25.5 17.8 18.4 14.7 15.2 15.3 15.6 33.1 33.7 26.8 27.3
+Finetune 38.6 39.0 25.8 26.6 18.9 19.7 15.7 16.4 15.1 15.4 33.6 34.4 28.8 30.0
u
+DAbs.ATM. 40.0 40.3 27.1 27.8 20.0 20.8 16.5 17.5 16.1 16.3 35.3 35.7 31.6 31.7
+Finetune 40.1 40.5 27.1 27.8 20.1 20.8 16.7 17.4 16.2 16.4 35.4 35.9 31.8 31.7
+Dtl + DAbs.ATM.
u
40.2 40.6 26.2 27.4 19.0 20.4 15.1 16.8 16.3 16.5 35.0 35.4 31.0 31.3
+Finetune 40.0 40.4 26.0 26.9 18.7 19.7 15.0 16.1 16.3 16.4 35.0 35.4 30.3 30.7
u
+DCAT 41.2 41.9 28.1 29.0 20.7 21.5 16.5 17.8 16.6 16.9 35.9 36.5 33.4 33.7
+Finetune 41.1 42.0 28.0 29.0 20.4 21.5 16.4 17.6 16.6 17.0 36.0 36.8 33.2 33.8
+Dtl + DCAT
u
39.9 40.5 26.2 27.4 19.3 20.6 16.0 17.4 16.0 16.2 35.0 35.4 30.8 31.3
+Finetune 40.4 41.0 26.6 27.6 19.5 20.7 16.1 17.1 16.2 16.5 35.4 35.8 31.3 31.5

Table 4: Performances (%) of GPT2 (XL) on commonsense inference modeling task (COMET). Dtl stands for
u u
annotated abstract triples, and DCAT stands for abstract triples acquired by CAT. DAbs.ATM. contains triples that are
pseudo-labeled by a supervised RoBERTa discriminator, as done by He et al. (2022). The best performances are
bold-faced. Finetune refers to fine-tuning back on the ATOMIC subset.

an approximation of the top-1 generations’ recall. Commonsense Inference Modeling (COMET).


Additionally, we uniformly sample 500 generations The second component of CAT produces triple-
from each evaluation split and conduct expert an- level abstract commonsense knowledge. We evalu-
notations on the plausibility of each conceptual- ate these abstract commonsense triples with a com-
ization to ensure that out-of-domain concepts can monsense inference task that generates common-
be properly evaluated. The experts are asked to sense tails given heads and relations as inputs, as
determine whether each top-1 generation is indeed in COMET (Bosselut et al., 2019). Following He
a plausible conceptualization or not, such that the et al. (2022), we apply the same training and eval-
top-1 generations’ precision is reflected. Thus, cur- uation process to the models. The base training
rent evaluation measures jointly evaluate the top-1 data we use are a subset of ATOMIC triples cor-
generations’ precision and recall, which makes it responding to those annotated abstract triples in
robust and non-easy to be impacted by repetition Dtl , which contains 17K (3.7%) among the original
problems (Li et al., 2020). Zero-shot GPT2 and ATOMIC. We derive abstract commonsense knowl-
GPT2 fine-tuned on the originally labeled event edge using CAT from a subset of Dtu where the
conceptualizations in Dhl are used as baselines. We heads correspond to those in the ATOMIC subset
also study the effect of the threshold T + that se- to ensure no data leakage, denoted as DCAT u . GPT2

lects plausible conceptualized heads, where higher is fine-tuned on the ATOMIC subset, the annotated
thresholds indicate higher plausibility regarded by abstract triples Dtl , the abstract knowledge verified
CAT. The results are presented in Table 3. With by CAT, or their combinations. The commonsense
a relatively high threshold, generators trained on generation results are presented in Table 4. Similar
a mixture of pseudo-labeled data by CAT and an- to COMET (Bosselut et al., 2019), all models are
notated concepts significantly outperform the base- evaluated on the original ATOMIC’s full validation
lines in every automated metric. A plausible rate of and testing sets. The best result is achieved using a
93.3% is maximally achieved on the test set, which mixture of the ATOMIC subset and abstract triples
is 11.8% higher than the baseline. Gradually reduc- pseudo-labeled by our framework, with 0.95 as the
ing the threshold also decreases the performance, threshold for selecting plausible triples. This in-
indicating abstract heads with lower plausibility dicates high-quality abstract commonsense triples
scores can be of poorer quality. Such results in- can indeed provide a more general view of the orig-
dicate that CAT can produce high-quality event inal commonsense knowledge, thus helping com-
conceptualizations for generative models to learn monsense inference. Additionally, training with
better conceptualizers without the need to annotate our pseudo-labeled examples outperforms training
a large number of data. with those annotated triples in AbstractATOMIC,
which also validates the effectiveness of our model
that leverages a large amount of unlabeled data.
 
(YHQW&RQFHSWXDOL]DWLRQ$8& 'LIILFXOW
7ULSOH&RQFHSWXDOL]DWLRQ$8&

7ULSOH&RQFHSWXDOL]DWLRQ$8&
(YHQW&RQFHSWXDOL]DWLRQ$8&
   (DV\
  

6FRUH
 

 

 

 
     
5HWULHYDO1XPEHU  %/(8 &,'(U 528*(/ %(576FRUH
0HWULFV
Figure 3: Ablation study on the number of retrieved Figure 4: Comparison of performance improvement by
conceptualizations/instantiations for CAT framework. GPT2 generator trained on the conceptualization-aided
ATOMIC subset for two groups of testing head events.
To further investigate how conceptual knowledge
improves commonsense inference modeling, we et al., 2020b) between each individual testing en-
conduct more empirical analysis in Section 5.4. try against the whole training set in the origi-
Additional experiment results with other thresholds nal ATOMIC and split them in half to acquire
and case studies can be found in Appendix C.2.3 two test groups. The testing entries with lower
and Appendix E, respectively. BERTScore on the training set indicate a larger
semantic shift from the training set (Deutsch and
5.3 Number of Retrieved Alternative
Roth, 2021), which is also harder for models to
Conceptualizations and Instantiations.
discriminate (Hsu et al., 2020). We denote the test-
We then study the ablation of bootstrapping ing group with a lower BERTScore as “Difficult”
different numbers of alternative conceptualiza- and the other half as “Easy”. The performance
tions/instantiations (denoted as #retrieval) in our gain on the two test set splits between the best
CAT framework. For simplicity, when tuning the conceptualization-aided COMET and the COMET
#retrieval for one task, the #retrieval of the other trained on the ATOMIC subset only is reported in
task is fixed at the best value we acquired. We plot Figure 4. We can observe that training COMET
the test AUC score with #retrieval from 0 to 11 with abstract commonsense knowledge leads to a
using BERT-base as the backbone model in Fig- larger improvement for harder test examples dis-
ure 3. #retrieval=0 refers to training with a sim- similar from the original training set, indicating that
ple student-teacher framework without bootstrap- introducing extra abstract commonsense knowl-
ping alternative conceptualizations and instantia- edge can help COMET become more generalizable
tions. For event conceptualization, the performance to harder test sets.
generally positively correlates with the number of
retrievals, while it starts dropping after 9. A re- 6 Conclusion
versed trend is observed for triple conceptualiza-
tion, where using only two instances achieves the In conclusion, this paper proposes CAT, a semi-
best performance. One possible reason is that in supervised learning framework for commonsense
triple conceptualization, the retrieved instances are reasoning, by leveraging the power of abstract com-
events and much longer than the retrieved concepts monsense knowledge. By achieving state-of-the-
in event conceptualization, and aggregating various art performances in CSKB conceptualization tasks,
alternative events for a triple will cause language we remarkably improve modeling commonsense
models to be less sensitive to the semantics of the inference, as an important cornerstone of many
original triple (Holtzman et al., 2020). commonsense reasoning tasks. Our analysis also
demonstrates that high-quality abstract common-
5.4 The Effect of Abstract Knowledge sense knowledge can benefit commonsense infer-
We finally study the effect of abstract common- ence modeling by providing more generalizability
sense knowledge acquired by CAT by studying on hard commonsense knowledge. We hope this
the semantic overlaps between training and testing work can draw insights toward commonsense rea-
data. We sort the test set by the BERTScore (Zhang soning from a conceptualization perspective.
Limitations the content with another CSKB, ATOMIC, which
is anonymized and desensitized (Sap et al., 2019a).
Our framework manually sets thresholds T + and Thus, no data privacy issue is involved.
T − in pseudo labeling by observations of data qual- The potential risks of CAT are relatively low.
ity and hyperparameter searching. Dynamic thresh- Since CAT is trained on AbstractATOMIC, a
old tuning (Xu et al., 2021) or meta pseudo la- conceptualization benchmark based on a popular
bels (Pham et al., 2021; Li et al., 2021) can be im- CSKB, ATOMIC, and two concept taxonomies,
plemented to better filter pseudo-labeled examples. Proabse and WordNet, it is expected that CAT does
And the thresholds for different tasks can be tuned not contain any private, offensive, biased, and sensi-
separately to improve the models’ generalizability. tive information or social, political issues. The stud-
Recently, large generative language models ied tasks all focus on conceptualization or CSKB,
such as GPT3.5 (Brown et al., 2020) and Chat- which is not likely to generate harmful content, as
GPT2 (Ouyang et al., 2022; Gao et al., 2022) shown in the case studies in Appendix E. Thus, we
have demonstrated their strong potential on vari- believe that CAT does not yield additional risks.
ous NLP tasks including probing abstract common-
sense knowledge with in-context learning (Brown Acknowledgements
et al., 2020; Xie et al., 2022). Due to our limited ac-
cess, we did not conduct fully-scaled experiments The authors would like to thank the anonymous
in our paper. A short discussion with case studies reviewers for their constructive comments. The
is provided in Appendix E.3. authors of this paper are supported by the NSFC
While our framework only operates on Abstrac- Fund (U20B2053) from the NSFC of China, the
tATOMIC as the conceptualization of ATOMIC, RIF (R6020-19 and R6021-20), and the GRF
it’s also worthy of verifying our framework on (16211520 and 16205322) from RGC of Hong
other CSKBs such as ATOMIC2020 (Hwang et al., Kong, the MHKJFS (MHP/001/19) from ITC of
2021), GLUCOSE (Mostafazadeh et al., 2020), Hong Kong and the National Key R&D Program of
ATOMIC10X (West et al., 2022), FolkScope (Yu China (2019YFE0198200) with special thanks to
et al., 2022a) and eventuality CSKB such as HKMAAC and CUSBLT. We also thank the UGC
ASER (Zhang et al., 2020a, 2022) and constructing Research Matching Grants (RMGS20EG01-D,
large conceptualized CSKB benchmarks. In addi- RMGS20CR11, RMGS20CR12, RMGS20EG19,
tion, we only evaluated the power of the acquired RMGS20EG21, RMGS23CR05, RMGS23EG08).
abstract commonsense knowledge on the common-
sense knowledge generation task (COMET), while
References
other commonsense reasoning tasks remain future
works, such as CommonsenseQA (Talmor et al., Emily Allaway, Jena D. Hwang, Chandra Bhagavatula,
Kathleen Mckeown, Doug Downey, and Yejin Choi.
2019, 2021), SocialIQA (Sap et al., 2019b), Wino- 2023. Penguins don’t fly: Reasoning about generics
grad Schema Challenge (Levesque et al., 2012), through instantiations and exceptions. In Proceed-
PIQA (Bisk et al., 2020), Abductive Commonsense ings of the 17th Conference of the European Chap-
Reasoning (Bhagavatula et al., 2020), and Wino- ter of the Association for Computational Linguistics,
pages 2610–2627, Dubrovnik, Croatia. Association
grande (Sakaguchi et al., 2020).
for Computational Linguistics.
Ethics Statement Mostafa M. Amin, Erik Cambria, and Björn W.
Schuller. 2023. Will affective computing emerge
This paper introduces CAT, a framework for com- from foundation models and general ai? A first eval-
monsense reasoning via conceptualizing CSKB to uation on chatgpt. CoRR, abs/2303.03186.
acquire abstract commonsense knowledge. The Chandra Bhagavatula, Ronan Le Bras, Chaitanya
experiments are conducted on publicly available Malaviya, Keisuke Sakaguchi, Ari Holtzman, Han-
and well-established datasets that are shared via nah Rashkin, Doug Downey, Wen-tau Yih, and Yejin
open-access licenses. The usage of these datasets Choi. 2020. Abductive commonsense reasoning. In
8th International Conference on Learning Represen-
in our paper is only for research purposes and is
tations, ICLR 2020, Addis Ababa, Ethiopia, April
consistent with the datasets’ intended usage. The 26-30, 2020. OpenReview.net.
primary dataset, AbstractATOMIC, largely shares
Ning Bian, Xianpei Han, Le Sun, Hongyu Lin, Yaojie
2
https://chat.openai.com/ Lu, and Ben He. 2023. Chatgpt is a knowledgeable
but inexperienced solver: An investigation of com- 59th Annual Meeting of the Association for Com-
monsense problem in large language models. CoRR, putational Linguistics and the 11th International
abs/2303.16421. Joint Conference on Natural Language Processing,
ACL/IJCNLP 2021, (Volume 1: Long Papers), Vir-
Yonatan Bisk, Rowan Zellers, Ronan Le Bras, Jianfeng tual Event, August 1-6, 2021, pages 743–753. Asso-
Gao, and Yejin Choi. 2020. PIQA: reasoning about ciation for Computational Linguistics.
physical commonsense in natural language. In The
Thirty-Fourth AAAI Conference on Artificial Intelli- Eunsol Choi, Omer Levy, Yejin Choi, and Luke Zettle-
gence, AAAI 2020, The Thirty-Second Innovative Ap- moyer. 2018. Ultra-fine entity typing. In Proceed-
plications of Artificial Intelligence Conference, IAAI ings of the 56th Annual Meeting of the Associa-
2020, The Tenth AAAI Symposium on Educational tion for Computational Linguistics, ACL 2018, Mel-
Advances in Artificial Intelligence, EAAI 2020, New bourne, Australia, July 15-20, 2018, Volume 1: Long
York, NY, USA, February 7-12, 2020, pages 7432– Papers, pages 87–96. Association for Computational
7439. AAAI Press. Linguistics.
Antoine Bosselut, Hannah Rashkin, Maarten Sap, Chai- Kevin Clark, Minh-Thang Luong, Quoc V. Le, and
tanya Malaviya, Asli Celikyilmaz, and Yejin Choi. Christopher D. Manning. 2020. ELECTRA: pre-
2019. COMET: commonsense transformers for training text encoders as discriminators rather than
automatic knowledge graph construction. In Pro- generators. In 8th International Conference on
ceedings of the 57th Conference of the Association Learning Representations, ICLR 2020, Addis Ababa,
for Computational Linguistics, ACL 2019, Florence, Ethiopia, April 26-30, 2020. OpenReview.net.
Italy, July 28- August 2, 2019, Volume 1: Long Pa-
pers, pages 4762–4779. Association for Computa- Hongliang Dai, Yangqiu Song, and Haixun Wang.
tional Linguistics. 2021. Ultra-fine entity typing with weak supervi-
sion from a masked language model. In Proceed-
Andrew P. Bradley. 1997. The use of the area under ings of the 59th Annual Meeting of the Association
the ROC curve in the evaluation of machine learning for Computational Linguistics and the 11th Interna-
algorithms. Pattern Recognit., 30(7):1145–1159. tional Joint Conference on Natural Language Pro-
Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie cessing, ACL/IJCNLP 2021, (Volume 1: Long Pa-
Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind pers), Virtual Event, August 1-6, 2021, pages 1790–
Neelakantan, Pranav Shyam, Girish Sastry, Amanda 1799. Association for Computational Linguistics.
Askell, Sandhini Agarwal, Ariel Herbert-Voss,
Ernest Davis. 1990. Representations of commonsense
Gretchen Krueger, Tom Henighan, Rewon Child,
knowledge. notThenot Morgan Kaufmann series in
Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu,
representation and reasoning. Morgan Kaufmann.
Clemens Winter, Christopher Hesse, Mark Chen,
Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Ernest Davis and Gary Marcus. 2015. Commonsense
Chess, Jack Clark, Christopher Berner, Sam Mc- reasoning and commonsense knowledge in artificial
Candlish, Alec Radford, Ilya Sutskever, and Dario intelligence. Commun. ACM, 58(9):92–103.
Amodei. 2020. Language models are few-shot learn-
ers. In Advances in Neural Information Processing Daniel Deutsch and Dan Roth. 2021. Understanding
Systems 33: Annual Conference on Neural Informa- the extent to which content quality metrics mea-
tion Processing Systems 2020, NeurIPS 2020, De- sure the information quality of summaries. In Pro-
cember 6-12, 2020, virtual. ceedings of the 25th Conference on Computational
Natural Language Learning, CoNLL 2021, Online,
Susan Carey. 2004. Bootstrapping & the origin of con-
November 10-11, 2021, pages 300–309. Association
cepts. Daedalus, 133(1):59–68.
for Computational Linguistics.
Chunkit Chan, Jiayang Cheng, Weiqi Wang, Yuxin
Jiang, Tianqing Fang, Xin Liu, and Yangqiu Song. Jacob Devlin, Ming-Wei Chang, Kenton Lee, and
2023. Chatgpt evaluation on sentence level relations: Kristina Toutanova. 2019. BERT: pre-training of
A focus on temporal, causal, and discourse relations. deep bidirectional transformers for language under-
CoRR, abs/2304.14827. standing. In Proceedings of the 2019 Conference
of the North American Chapter of the Association
Muhao Chen, Hongming Zhang, Haoyu Wang, and for Computational Linguistics: Human Language
Dan Roth. 2020. What are you trying to do? se- Technologies, NAACL-HLT 2019, Minneapolis, MN,
mantic typing of event processes. In Proceedings of USA, June 2-7, 2019, Volume 1 (Long and Short Pa-
the 24th Conference on Computational Natural Lan- pers), pages 4171–4186. Association for Computa-
guage Learning, CoNLL 2020, Online, November tional Linguistics.
19-20, 2020, pages 531–542. Association for Com-
putational Linguistics. Li Du, Xiao Ding, Ting Liu, and Zhongyang Li. 2019.
Modeling event background for if-then common-
Weile Chen, Huiqiang Jiang, Qianhui Wu, Börje Karls- sense reasoning using context-aware variational au-
son, and Yi Guan. 2021. Advpicker: Effectively toencoder. In Proceedings of the 2019 Conference
leveraging unlabeled data via adversarial discrimi- on Empirical Methods in Natural Language Pro-
nator for cross-lingual NER. In Proceedings of the cessing and the 9th International Joint Conference
on Natural Language Processing, EMNLP-IJCNLP Masato Hagiwara, Yasuhiro Ogawa, and Katsuhiko
2019, Hong Kong, China, November 3-7, 2019, Toyama. 2006. Selection of effective contextual in-
pages 2682–2691. Association for Computational formation for automatic synonym acquisition. In
Linguistics. ACL 2006, 21st International Conference on Compu-
tational Linguistics and 44th Annual Meeting of the
Benjamin Van Durme, Phillip Michalak, and Lenhart K. Association for Computational Linguistics, Proceed-
Schubert. 2009. Deriving generalized knowledge ings of the Conference, Sydney, Australia, 17-21 July
from corpora using wordnet abstraction. In EACL 2006. The Association for Computer Linguistics.
2009, 12th Conference of the European Chapter of
the Association for Computational Linguistics, Pro- Mutian He, Tianqing Fang, Weiqi Wang, and Yangqiu
ceedings of the Conference, Athens, Greece, March Song. 2022. Acquiring and modelling abstract com-
30 - April 3, 2009, pages 808–816. The Association monsense knowledge via conceptualization. CoRR,
for Computer Linguistics. abs/2206.01532.

Tianqing Fang, Quyet V. Do, Sehyun Choi, Weiqi Mutian He, Yangqiu Song, Kun Xu, and Dong Yu.
Wang, and Yangqiu Song. 2023. Ckbp v2: 2020. On the role of conceptualization in com-
An expert-annotated evaluation set for com- monsense knowledge graph construction. CoRR,
monsense knowledge base population. CoRR, abs/2003.03239.
abs/2304.10392.
Pengcheng He, Jianfeng Gao, and Weizhu Chen.
Tianqing Fang, Quyet V. Do, Hongming Zhang, 2023. DeBERTav3: Improving deBERTa us-
Yangqiu Song, Ginny Y. Wong, and Simon See. ing ELECTRA-style pre-training with gradient-
2022. Pseudoreasoner: Leveraging pseudo labels disentangled embedding sharing. In The Eleventh
for commonsense knowledge base population. In International Conference on Learning Representa-
Findings of the Association for Computational Lin- tions, ICLR 2023. OpenReview.net.
guistics: EMNLP 2022, Abu Dhabi, United Arab
Emirates, December 7-11, 2022, pages 3379–3394. Pengcheng He, Xiaodong Liu, Jianfeng Gao, and
Association for Computational Linguistics. Weizhu Chen. 2021. Deberta: decoding-enhanced
bert with disentangled attention. In 9th Inter-
Tianqing Fang, Weiqi Wang, Sehyun Choi, Shibo national Conference on Learning Representations,
Hao, Hongming Zhang, Yangqiu Song, and Bin He. ICLR 2021, Virtual Event, Austria, May 3-7, 2021.
2021a. Benchmarking commonsense knowledge OpenReview.net.
base population with an effective evaluation dataset.
In Proceedings of the 2021 Conference on Empirical Ari Holtzman, Jan Buys, Li Du, Maxwell Forbes, and
Methods in Natural Language Processing, EMNLP Yejin Choi. 2020. The curious case of neural text
2021, Virtual Event / Punta Cana, Dominican Re- degeneration. In 8th International Conference on
public, 7-11 November, 2021, pages 8949–8964. As- Learning Representations, ICLR 2020, Addis Ababa,
sociation for Computational Linguistics. Ethiopia, April 26-30, 2020. OpenReview.net.

Tianqing Fang, Hongming Zhang, Weiqi Wang, Yen-Chang Hsu, Yilin Shen, Hongxia Jin, and Zsolt
Yangqiu Song, and Bin He. 2021b. DISCOS: bridg- Kira. 2020. Generalized ODIN: detecting out-of-
ing the gap between discourse knowledge and com- distribution image without learning from out-of-
monsense knowledge. In WWW ’21: The Web distribution data. In 2020 IEEE/CVF Conference
Conference 2021, Virtual Event / Ljubljana, Slove- on Computer Vision and Pattern Recognition, CVPR
nia, April 19-23, 2021, pages 2648–2659. ACM / 2020, Seattle, WA, USA, June 13-19, 2020, pages
IW3C2. 10948–10957. Computer Vision Foundation / IEEE.

Leo Gao, John Schulman, and Jacob Hilton. 2022. Zijian Hu, Zhengyu Yang, Xuefeng Hu, and Ram Neva-
Scaling laws for reward model overoptimization. tia. 2021. Simple: Similar pseudo label exploitation
CoRR, abs/2210.10760. for semi-supervised classification. In IEEE Con-
ference on Computer Vision and Pattern Recogni-
Yu Gong, Kaiqi Zhao, and Kenny Qili Zhu. 2016. Rep- tion, CVPR 2021, virtual, June 19-25, 2021, pages
resenting verbs as argument concepts. In Proceed- 15099–15108. Computer Vision Foundation / IEEE.
ings of the Thirtieth AAAI Conference on Artificial
Intelligence, February 12-17, 2016, Phoenix, Ari- Luyao Huang, Chi Sun, Xipeng Qiu, and Xuanjing
zona, USA, pages 2615–2621. AAAI Press. Huang. 2019. Glossbert: BERT for word sense dis-
ambiguation with gloss knowledge. In Proceedings
Jonathan Gordon and Benjamin Van Durme. 2013. Re- of the 2019 Conference on Empirical Methods in
porting bias and knowledge acquisition. In Proceed- Natural Language Processing and the 9th Interna-
ings of the 2013 workshop on Automated knowledge tional Joint Conference on Natural Language Pro-
base construction, AKBC@CIKM 13, San Fran- cessing, EMNLP-IJCNLP 2019, Hong Kong, China,
cisco, California, USA, October 27-28, 2013, pages November 3-7, 2019, pages 3507–3512. Association
25–30. ACM. for Computational Linguistics.
Jena D. Hwang, Chandra Bhagavatula, Ronan Le Bras, Zheng Li, Danqing Zhang, Tianyu Cao, Ying Wei,
Jeff Da, Keisuke Sakaguchi, Antoine Bosselut, and Yiwei Song, and Bing Yin. 2021. Metats: Meta
Yejin Choi. 2021. (comet-) atomic 2020: On sym- teacher-student network for multilingual sequence
bolic and neural commonsense knowledge graphs. labeling with minimal supervision. In Proceedings
In Thirty-Fifth AAAI Conference on Artificial Intel- of the 2021 Conference on Empirical Methods in
ligence, AAAI 2021, Thirty-Third Conference on In- Natural Language Processing, EMNLP 2021, Vir-
novative Applications of Artificial Intelligence, IAAI tual Event / Punta Cana, Dominican Republic, 7-11
2021, The Eleventh Symposium on Educational Ad- November, 2021, pages 3183–3196. Association for
vances in Artificial Intelligence, EAAI 2021, Virtual Computational Linguistics.
Event, February 2-9, 2021, pages 6384–6392. AAAI
Press. Chin-Yew Lin. 2004. ROUGE: A package for auto-
matic evaluation of summaries. In Text Summariza-
Ahmet Iscen, Giorgos Tolias, Yannis Avrithis, and On- tion Branches Out, pages 74–81, Barcelona, Spain.
drej Chum. 2019. Label propagation for deep semi- Association for Computational Linguistics.
supervised learning. In IEEE Conference on Com-
Fengbei Liu, Yu Tian, Yuanhong Chen, Yuyuan
puter Vision and Pattern Recognition, CVPR 2019,
Liu, Vasileios Belagiannis, and Gustavo Carneiro.
Long Beach, CA, USA, June 16-20, 2019, pages
2022a. ACPL: anti-curriculum pseudo-labelling for
5070–5079. Computer Vision Foundation / IEEE.
semi-supervised medical image classification. In
IEEE/CVF Conference on Computer Vision and Pat-
Diederik P. Kingma and Jimmy Ba. 2015. Adam: A
tern Recognition, CVPR 2022, New Orleans, LA,
method for stochastic optimization. In 3rd Inter-
USA, June 18-24, 2022, pages 20665–20674. IEEE.
national Conference on Learning Representations,
ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Jiacheng Liu, Alisa Liu, Ximing Lu, Sean Welleck, Pe-
Conference Track Proceedings. ter West, Ronan Le Bras, Yejin Choi, and Hannaneh
Hajishirzi. 2022b. Generated knowledge prompt-
Alon Lavie and Abhaya Agarwal. 2007. METEOR: an ing for commonsense reasoning. In Proceedings
automatic metric for MT evaluation with high levels of the 60th Annual Meeting of the Association for
of correlation with human judgments. In Proceed- Computational Linguistics (Volume 1: Long Papers),
ings of the Second Workshop on Statistical Machine ACL 2022, Dublin, Ireland, May 22-27, 2022, pages
Translation, WMT@ACL 2007, Prague, Czech Re- 3154–3169. Association for Computational Linguis-
public, June 23, 2007, pages 228–231. Association tics.
for Computational Linguistics.
Jingping Liu, Tao Chen, Chao Wang, Jiaqing Liang, Li-
Hector J. Levesque, Ernest Davis, and Leora Morgen- han Chen, Yanghua Xiao, Yunwen Chen, and Ke Jin.
stern. 2012. The winograd schema challenge. In 2022c. Vocsk: Verb-oriented commonsense knowl-
Principles of Knowledge Representation and Rea- edge mining with taxonomy-guided induction. Artif.
soning: Proceedings of the Thirteenth International Intell., 310:103744.
Conference, KR 2012, Rome, Italy, June 10-14, 2012.
AAAI Press. Kun Liu, Yao Fu, Chuanqi Tan, Mosha Chen, Ningyu
Zhang, Songfang Huang, and Sheng Gao. 2021.
Mike Lewis, Yinhan Liu, Naman Goyal, Mar- Noisy-labeled NER with confidence estimation. In
jan Ghazvininejad, Abdelrahman Mohamed, Omer Proceedings of the 2021 Conference of the North
Levy, Veselin Stoyanov, and Luke Zettlemoyer. American Chapter of the Association for Computa-
2020. BART: denoising sequence-to-sequence pre- tional Linguistics: Human Language Technologies,
training for natural language generation, translation, NAACL-HLT 2021, Online, June 6-11, 2021, pages
and comprehension. In Proceedings of the 58th An- 3437–3445. Association for Computational Linguis-
nual Meeting of the Association for Computational tics.
Linguistics, ACL 2020, Online, July 5-10, 2020,
Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Man-
pages 7871–7880. Association for Computational
dar Joshi, Danqi Chen, Omer Levy, Mike Lewis,
Linguistics.
Luke Zettlemoyer, and Veselin Stoyanov. 2019.
Roberta: A robustly optimized BERT pretraining ap-
Bangzheng Li, Wenpeng Yin, and Muhao Chen. 2022.
proach. CoRR, abs/1907.11692.
Ultra-fine entity typing with indirect supervision
from natural language inference. Trans. Assoc. Com- Ilya Loshchilov and Frank Hutter. 2019. Decou-
put. Linguistics, 10:607–622. pled weight decay regularization. In 7th Inter-
national Conference on Learning Representations,
Margaret Li, Stephen Roller, Ilia Kulikov, Sean ICLR 2019, New Orleans, LA, USA, May 6-9, 2019.
Welleck, Y-Lan Boureau, Kyunghyun Cho, and Ja- OpenReview.net.
son Weston. 2020. Don’t say that! making inconsis-
tent dialogue unlikely with unlikelihood training. In Chaitanya Malaviya, Chandra Bhagavatula, Antoine
Proceedings of the 58th Annual Meeting of the As- Bosselut, and Yejin Choi. 2020. Commonsense
sociation for Computational Linguistics, ACL 2020, knowledge base completion with structural and se-
Online, July 5-10, 2020, pages 4715–4728. Associa- mantic context. In The Thirty-Fourth AAAI Con-
tion for Computational Linguistics. ference on Artificial Intelligence, AAAI 2020, The
Thirty-Second Innovative Applications of Artificial 5015–5035. Association for Computational Linguis-
Intelligence Conference, IAAI 2020, The Tenth AAAI tics.
Symposium on Educational Advances in Artificial In-
telligence, EAAI 2020, New York, NY, USA, Febru- Hieu Pham, Zihang Dai, Qizhe Xie, and Quoc V.
ary 7-12, 2020, pages 2925–2933. AAAI Press. Le. 2021. Meta pseudo labels. In IEEE Con-
ference on Computer Vision and Pattern Recogni-
Yu Meng, Jiaming Shen, Chao Zhang, and Jiawei Han. tion, CVPR 2021, virtual, June 19-25, 2021, pages
2019. Weakly-supervised hierarchical text classifi- 11557–11568. Computer Vision Foundation / IEEE.
cation. In The Thirty-Third AAAI Conference on Ar-
tificial Intelligence, AAAI 2019, The Thirty-First In- Steven Pinker and B MacWhinney. 1987. The boot-
novative Applications of Artificial Intelligence Con- strapping problem in language acquisition. Mecha-
ference, IAAI 2019, The Ninth AAAI Symposium nisms of language acquisition, pages 399–441.
on Educational Advances in Artificial Intelligence,
EAAI 2019, Honolulu, Hawaii, USA, January 27 - Edoardo Maria Ponti, Goran Glavas, Olga Majewska,
February 1, 2019, pages 6826–6833. AAAI Press. Qianchu Liu, Ivan Vulic, and Anna Korhonen. 2020.
XCOPA: A multilingual dataset for causal common-
George A. Miller. 1995. Wordnet: A lexical database sense reasoning. In Proceedings of the 2020 Con-
for english. Commun. ACM, 38(11):39–41. ference on Empirical Methods in Natural Language
Processing, EMNLP 2020, Online, November 16-20,
Nasrin Mostafazadeh, Aditya Kalyanpur, Lori Moon, 2020, pages 2362–2376. Association for Computa-
David W. Buchanan, Lauren Berkowitz, Or Biran, tional Linguistics.
and Jennifer Chu-Carroll. 2020. GLUCOSE: gen-
eralized and contextualized story explanations. In Ian Porada, Kaheer Suleman, Adam Trischler, and
Proceedings of the 2020 Conference on Empirical Jackie Chi Kit Cheung. 2021. Modeling event plau-
Methods in Natural Language Processing, EMNLP sibility with consistent conceptual abstraction. In
2020, Online, November 16-20, 2020, pages 4569– Proceedings of the 2021 Conference of the North
4586. Association for Computational Linguistics. American Chapter of the Association for Computa-
tional Linguistics: Human Language Technologies,
Gregory Murphy. 2004. The big book of concepts. MIT NAACL-HLT 2021, Online, June 6-11, 2021, pages
press. 1732–1743. Association for Computational Linguis-
tics.
Kazumasa Omura, Daisuke Kawahara, and Sadao
Kurohashi. 2020. A method for building a common- Chengwei Qin, Aston Zhang, Zhuosheng Zhang, Jiaao
sense inference dataset based on basic events. In Chen, Michihiro Yasunaga, and Diyi Yang. 2023. Is
Proceedings of the 2020 Conference on Empirical chatgpt a general-purpose natural language process-
Methods in Natural Language Processing, EMNLP ing task solver? CoRR, abs/2302.06476.
2020, Online, November 16-20, 2020, pages 2450–
2460. Association for Computational Linguistics. Alec Radford, Jeffrey Wu, Rewon Child, David Luan,
Dario Amodei, Ilya Sutskever, et al. 2019. Lan-
OpenAI. 2022. Chatgpt: Optimizing language models guage models are unsupervised multitask learners.
for dialogue. OpenAI. OpenAI blog, 1(8):9.

Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Nazneen Fatema Rajani, Bryan McCann, Caiming
Carroll L. Wainwright, Pamela Mishkin, Chong Xiong, and Richard Socher. 2019. Explain yourself!
Zhang, Sandhini Agarwal, Katarina Slama, Alex leveraging language models for commonsense rea-
Ray, John Schulman, Jacob Hilton, Fraser Kelton, soning. In Proceedings of the 57th Conference of
Luke Miller, Maddie Simens, Amanda Askell, Peter the Association for Computational Linguistics, ACL
Welinder, Paul F. Christiano, Jan Leike, and Ryan 2019, Florence, Italy, July 28- August 2, 2019, Vol-
Lowe. 2022. Training language models to follow in- ume 1: Long Papers, pages 4932–4942. Association
structions with human feedback. In NeurIPS. for Computational Linguistics.

Kishore Papineni, Salim Roukos, Todd Ward, and Wei- Keisuke Sakaguchi, Ronan Le Bras, Chandra Bhagavat-
Jing Zhu. 2002. Bleu: a method for automatic eval- ula, and Yejin Choi. 2020. Winogrande: An adver-
uation of machine translation. In Proceedings of the sarial winograd schema challenge at scale. In The
40th Annual Meeting of the Association for Compu- Thirty-Fourth AAAI Conference on Artificial Intelli-
tational Linguistics, July 6-12, 2002, Philadelphia, gence, AAAI 2020, The Thirty-Second Innovative Ap-
PA, USA, pages 311–318. ACL. plications of Artificial Intelligence Conference, IAAI
2020, The Tenth AAAI Symposium on Educational
Hao Peng, Xiaozhi Wang, Shengding Hu, Hailong Jin, Advances in Artificial Intelligence, EAAI 2020, New
Lei Hou, Juanzi Li, Zhiyuan Liu, and Qun Liu. York, NY, USA, February 7-12, 2020, pages 8732–
2022. COPEN: probing conceptual knowledge in 8740. AAAI Press.
pre-trained language models. In Proceedings of the
2022 Conference on Empirical Methods in Natural Maarten Sap, Ronan Le Bras, Emily Allaway, Chan-
Language Processing, EMNLP 2022, Abu Dhabi, dra Bhagavatula, Nicholas Lourie, Hannah Rashkin,
United Arab Emirates, December 7-11, 2022, pages Brendan Roof, Noah A. Smith, and Yejin Choi.
2019a. ATOMIC: an atlas of machine commonsense Abu Dhabi, United Arab Emirates. Association for
for if-then reasoning. In The Thirty-Third AAAI Con- Computational Linguistics.
ference on Artificial Intelligence, AAAI 2019, The
Thirty-First Innovative Applications of Artificial In- Alon Talmor, Jonathan Herzig, Nicholas Lourie, and
telligence Conference, IAAI 2019, The Ninth AAAI Jonathan Berant. 2019. Commonsenseqa: A ques-
Symposium on Educational Advances in Artificial tion answering challenge targeting commonsense
Intelligence, EAAI 2019, Honolulu, Hawaii, USA, knowledge. In Proceedings of the 2019 Conference
January 27 - February 1, 2019, pages 3027–3035. of the North American Chapter of the Association
AAAI Press. for Computational Linguistics: Human Language
Technologies, NAACL-HLT 2019, Minneapolis, MN,
Maarten Sap, Hannah Rashkin, Derek Chen, Ronan Le USA, June 2-7, 2019, Volume 1 (Long and Short Pa-
Bras, and Yejin Choi. 2019b. Social iqa: Com- pers), pages 4149–4158. Association for Computa-
monsense reasoning about social interactions. In tional Linguistics.
Proceedings of the 2019 Conference on Empirical
Methods in Natural Language Processing and the Alon Talmor, Ori Yoran, Ronan Le Bras, Chandra Bha-
9th International Joint Conference on Natural Lan- gavatula, Yoav Goldberg, Yejin Choi, and Jonathan
guage Processing, EMNLP-IJCNLP 2019, Hong Berant. 2021. Commonsenseqa 2.0: Exposing the
Kong, China, November 3-7, 2019, pages 4462– limits of AI through gamification. In Proceedings of
4472. Association for Computational Linguistics. the Neural Information Processing Systems Track on
Datasets and Benchmarks 1, NeurIPS Datasets and
Maarten Sap, Vered Shwartz, Antoine Bosselut, Yejin Benchmarks 2021, December 2021, virtual.
Choi, and Dan Roth. 2020. Commonsense reason-
ing for natural language processing. In Proceedings Joshua B Tenenbaum, Charles Kemp, Thomas L Grif-
of the 58th Annual Meeting of the Association for fiths, and Noah D Goodman. 2011. How to grow a
Computational Linguistics: Tutorial Abstracts, ACL mind: Statistics, structure, and abstraction. science,
2020, Online, July 5, 2020, pages 27–33. Associa- 331(6022):1279–1285.
tion for Computational Linguistics.
Jesper E. van Engelen and Holger H. Hoos. 2020. A
Vered Shwartz, Peter West, Ronan Le Bras, Chandra survey on semi-supervised learning. Mach. Learn.,
Bhagavatula, and Yejin Choi. 2020. Unsupervised 109(2):373–440.
commonsense question answering with self-talk. In
Proceedings of the 2020 Conference on Empirical Ramakrishna Vedantam, C. Lawrence Zitnick, and
Methods in Natural Language Processing, EMNLP Devi Parikh. 2015. Cider: Consensus-based image
2020, Online, November 16-20, 2020, pages 4615– description evaluation. In IEEE Conference on Com-
4629. Association for Computational Linguistics. puter Vision and Pattern Recognition, CVPR 2015,
Boston, MA, USA, June 7-12, 2015, pages 4566–
Yangqiu Song, Haixun Wang, Zhongyuan Wang, Hong- 4575. IEEE Computer Society.
song Li, and Weizhu Chen. 2011. Short text
conceptualization using a probabilistic knowledge- Kexin Wang, Nandan Thakur, Nils Reimers, and Iryna
base. In IJCAI 2011, Proceedings of the 22nd Gurevych. 2022. GPL: generative pseudo label-
International Joint Conference on Artificial Intel- ing for unsupervised domain adaptation of dense re-
ligence, Barcelona, Catalonia, Spain, July 16-22, trieval. In Proceedings of the 2022 Conference of
2011, pages 2330–2336. IJCAI/AAAI. the North American Chapter of the Association for
Computational Linguistics: Human Language Tech-
Yangqiu Song, Shusen Wang, and Haixun Wang. 2015. nologies, NAACL 2022, Seattle, WA, United States,
Open domain short text conceptualization: A gen- July 10-15, 2022, pages 2345–2360. Association for
erative + descriptive modeling approach. In Pro- Computational Linguistics.
ceedings of the Twenty-Fourth International Joint
Conference on Artificial Intelligence, IJCAI 2015, Peifeng Wang, Filip Ilievski, Muhao Chen, and Xi-
Buenos Aires, Argentina, July 25-31, 2015, pages ang Ren. 2021. Do language models perform gen-
3820–3826. AAAI Press. eralizable commonsense inference? In Findings
of the Association for Computational Linguistics:
Robyn Speer, Joshua Chin, and Catherine Havasi. 2017. ACL/IJCNLP 2021, Online Event, August 1-6, 2021,
Conceptnet 5.5: An open multilingual graph of gen- volume ACL/IJCNLP 2021 of Findings of ACL,
eral knowledge. In Proceedings of the Thirty-First pages 3681–3688. Association for Computational
AAAI Conference on Artificial Intelligence, Febru- Linguistics.
ary 4-9, 2017, San Francisco, California, USA,
pages 4444–4451. AAAI Press. Peter West, Chandra Bhagavatula, Jack Hessel, Jena D.
Hwang, Liwei Jiang, Ronan Le Bras, Ximing Lu,
Ying Su, Zihao Wang, Tianqing Fang, Hongming Sean Welleck, and Yejin Choi. 2022. Symbolic
Zhang, Yangqiu Song, and Tong Zhang. 2022. knowledge distillation: from general language mod-
MICO: A multi-alternative contrastive learning els to commonsense models. In Proceedings of the
framework for commonsense knowledge representa- 2022 Conference of the North American Chapter of
tion. In Findings of the Association for Computa- the Association for Computational Linguistics: Hu-
tional Linguistics: EMNLP 2022, pages 1339–1351, man Language Technologies, NAACL 2022, Seattle,
WA, United States, July 10-15, 2022, pages 4602– Changlong Yu, Jialong Han, Peifeng Wang, Yangqiu
4625. Association for Computational Linguistics. Song, Hongming Zhang, Wilfred Ng, and Shuming
Shi. 2020. When hearst is not enough: Improv-
Thomas Wolf, Lysandre Debut, Victor Sanh, Julien ing hypernymy detection from corpus with distri-
Chaumond, Clement Delangue, Anthony Moi, Pier- butional models. In Proceedings of the 2020 Con-
ric Cistac, Tim Rault, Rémi Louf, Morgan Funtow- ference on Empirical Methods in Natural Language
icz, Joe Davison, Sam Shleifer, Patrick von Platen, Processing, EMNLP 2020, Online, November 16-20,
Clara Ma, Yacine Jernite, Julien Plu, Canwen Xu, 2020, pages 6208–6217. Association for Computa-
Teven Le Scao, Sylvain Gugger, Mariama Drame, tional Linguistics.
Quentin Lhoest, and Alexander M. Rush. 2020.
Transformers: State-of-the-art natural language pro- Changlong Yu, Weiqi Wang, Xin Liu, Jiaxin Bai,
cessing. In Proceedings of the 2020 Conference on Yangqiu Song, Zheng Li, Yifan Gao, Tianyu Cao,
Empirical Methods in Natural Language Processing: and Bing Yin. 2022a. Folkscope: Intention knowl-
System Demonstrations, EMNLP 2020 - Demos, On- edge graph construction for discovering e-commerce
line, November 16-20, 2020, pages 38–45. Associa- commonsense. CoRR, abs/2211.08316.
tion for Computational Linguistics.
Changlong Yu, Hongming Zhang, Yangqiu Song, and
Wentao Wu, Hongsong Li, Haixun Wang, and Wilfred Ng. 2022b. Cocolm: Complex common-
Kenny Qili Zhu. 2012. Probase: a probabilistic tax- sense enhanced language model with discourse re-
onomy for text understanding. In Proceedings of the lations. In Findings of the Association for Com-
ACM SIGMOD International Conference on Man- putational Linguistics: ACL 2022, Dublin, Ireland,
agement of Data, SIGMOD 2012, Scottsdale, AZ, May 22-27, 2022, pages 1175–1187. Association for
USA, May 20-24, 2012, pages 481–492. ACM. Computational Linguistics.

Huiru Xiao, Xin Liu, and Yangqiu Song. 2019. Ef- Hongming Zhang, Xin Liu, Haojie Pan, Haowen Ke,
ficient path prediction for semi-supervised and Jiefu Ou, Tianqing Fang, and Yangqiu Song. 2022.
weakly supervised hierarchical text classification. In ASER: towards large-scale commonsense knowl-
The World Wide Web Conference, WWW 2019, San edge acquisition via higher-order selectional prefer-
Francisco, CA, USA, May 13-17, 2019, pages 3370– ence over eventualities. Artif. Intell., 309:103740.
3376. ACM.
Hongming Zhang, Xin Liu, Haojie Pan, Yangqiu Song,
Qizhe Xie, Zihang Dai, Eduard H. Hovy, Thang Luong, and Cane Wing-Ki Leung. 2020a. ASER: A large-
and Quoc Le. 2020a. Unsupervised data augmenta- scale eventuality knowledge graph. In WWW ’20:
tion for consistency training. In Advances in Neural The Web Conference 2020, Taipei, Taiwan, April 20-
Information Processing Systems 33: Annual Con- 24, 2020, pages 201–211. ACM / IW3C2.
ference on Neural Information Processing Systems
2020, NeurIPS 2020, December 6-12, 2020, virtual. Tianyi Zhang, Varsha Kishore, Felix Wu, Kilian Q.
Weinberger, and Yoav Artzi. 2020b. Bertscore:
Qizhe Xie, Minh-Thang Luong, Eduard H. Hovy, and Evaluating text generation with BERT. In 8th Inter-
Quoc V. Le. 2020b. Self-training with noisy stu- national Conference on Learning Representations,
dent improves imagenet classification. In 2020 ICLR 2020, Addis Ababa, Ethiopia, April 26-30,
IEEE/CVF Conference on Computer Vision and Pat- 2020. OpenReview.net.
tern Recognition, CVPR 2020, Seattle, WA, USA,
June 13-19, 2020, pages 10684–10695. Computer
Vision Foundation / IEEE.

Sang Michael Xie, Aditi Raghunathan, Percy Liang,


and Tengyu Ma. 2022. An explanation of in-context
learning as implicit bayesian inference. In The Tenth
International Conference on Learning Representa-
tions, ICLR 2022, Virtual Event, April 25-29, 2022.
OpenReview.net.

Yi Xu, Lei Shang, Jinxing Ye, Qi Qian, Yu-Feng Li,


Baigui Sun, Hao Li, and Rong Jin. 2021. Dash:
Semi-supervised learning with dynamic threshold-
ing. In Proceedings of the 38th International Con-
ference on Machine Learning, ICML 2021, 18-24
July 2021, Virtual Event, volume 139 of Proceedings
of Machine Learning Research, pages 11525–11536.
PMLR.

Liang Yao, Chengsheng Mao, and Yuan Luo. 2019.


KG-BERT: BERT for knowledge graph completion.
CoRR, abs/1909.03193.
Appendices alizations of 7K ATOMIC events. All conceptu-
alizations that are not annotated are regarded as
A Dataset Description unlabeled data in this paper. More detailed statis-
tics for the head event conceptualization data can
In this section, we introduce more about Abstrac-
be found in Table 5.
tATOMIC (He et al., 2022), as the primary dataset
we experimented with. AbstractATOMIC is a con- After acquiring the event conceptualizations by
ceptualized commonsense knowledge benchmark only focusing on head events, abstract common-
that is built upon ATOMIC (Sap et al., 2019a), a sense knowledge, in the form of (h, r, t) triple, is
popular CSKB in the format of (h, r, t) triples. The collected by connecting conceptualized head event
dataset is entirely in English. It contains two parts with its non-abstract counterparts (commonsense
of data: (1) event conceptualization data and (2) relations and inference tails) from ATOMIC. Only
abstract knowledge triples conceptualization data. the head events contain abstract concepts. Thus,
The event conceptualization data contain con- these abstract triples are more generalized if-then
ceptualizations for head event instances, where commonsense knowledge that is potentially useful
the events are filtered from the original ATOMIC for commonsense reasoning through instantiation.
head events. Unlike the traditional entity concept Human annotations on Amazon Mechanical
taxonomies, where instances are nouns or verb Turk further verify 81K uniformly sampled ab-
phrases, AbstractATOMIC includes instance candi- stract triples. These triples only correspond to
dates that can be either the entire head event or a 689 unique ATOMIC head events, which makes
certain component of an event. Detailed examples annotations relatively scarce compared with the
can be found in Appendix E. scale of unlabeled data. A supervised RoBERTa-
The instances within each head event are iden- large verifier is trained on the annotated triples
tified through syntactic parsing by using a parser to roughly verify abstract triples that are not an-
from the spaCy 3 library and matching with five notated. Triples with scores higher than 0.9 are
human-defined rules. After identification, the pseudo-labeled as positive ones (He et al., 2022).
candidate instances will be heuristically matched However, this paper only leverages these pseudo-
against Probase (Wu et al., 2012) and Word- labeled examples in the commonsense inference
Net (Miller, 1995) via GlossBERT (Huang et al., generation task (COMET) as baselines. Only an-
2019) to acquire their candidate concepts. A neural notated triples are considered hard-labeled for all
generator based on GPT2, similar to the baseline other tasks concerned. And triples that are not an-
in this paper, is also trained to generate concepts. notated are treated as unlabeled by default. The
A supervised conceptualization verifier, based on detailed relational distribution of abstract triples is
RoBERTa (Liu et al., 2019), is trained as the final presented in Table 6. Examples can be found in
gatekeeper to verify the acquired concepts roughly. Appendix E.

Dhl Dhu Total


Relation ATOMIC Dtl Dtu u
DAbs.ATM.
#Unq. event 7,196 15,165 15,388
#Unq. instance 7,935 20,843 21,493 xEffect 78,832 12,168 938,330 451,564
#Unq. concept 20,036 20,367 31,227 oEffect 28,351 3,526 333,845 160,207
xWant 101,249 15,312 1,170,835 543,964
Avg. #concept/event 18.21 24.57 32.73 oWant 43,079 5,408 484,570 227,493
Avg. #concept/instance 16.51 17.88 23.43 xReact 62,969 8,923 510,476 288,019
oReact 26,570 3,030 224,706 126,386
Table 5: Additional statistics of the event conceptual- xNeed 74,272 11,733 900,429 425,060
xAttr 110,791 14,249 838,191 465,511
ization data in AbstractATOMIC (AbsATM). Dl stands xIntent 45,490 6,848 519,813 259,694
for annotated event conceptualizations and Du are un-
verified conceptualizations. # denotes “number of”, Total 572,053 81,197 5,921,195 2,947,898
Unq stands for unique, and Avg is average.
Table 6: Abstract commonsense triple distribution by
relations. Dl stands for annotated triples and Du are
Human annotations on the Amazon Mechani- u
unverified triples. DAbs.ATM. stands for abstract triples
cal Turk platform are further conducted to acquire verified by a supervised RoBERTa-large discriminator,
annotations on the correctness of 131K conceptu- as done by He et al. (2022).
3
https://spacy.io/
B Prompt Design token for generation. Commonsense inference
modeling uses the same prompt as done by Hwang
In this section, we introduce the textual prompts
et al. (2021); Fang et al. (2021b).
used for training various models.
In addition, we observe that adding special to-
For event conceptualization, denotes the orig-
kens such as <c> and </c> can effectively boost
inal event as ho , instance as i, target concept to
performance. But adding textual guidelines such as
be verified as c, and retrieved alternative concep-
“is an instance of” or “is a concept of” does not have
tualizations as cr,1 , cr,2 , cr,3 , ..., cr,m . The prompt
any positive effect. The same trend is observed for
for training the teacher model is “[CLS] ho [SEP]
the bootstrapping prompt, where adding external
c”, while the one for training the student model is
texts such as “is also instances of” or “can be in-
“[CLS] ho [SEP] c [SEP] cr,1 , cr,2 , cr,3 , ..., cr,m ”.
stantiated to” will harm the model significantly.
For the example in Figure 2, the filled prompt
is “PersonX is on vacation [SEP] relaxing event C Additional Experiments
[SEP] traveling, break, holiday.” Specifically, spe-
cial tokens <c> and </c> are used to enclose i ⊂ ho In this section, we present additional details and ex-
within the original event to highlight the instance periment results for CSKB conceptualization tasks
to be conceptualized. GPT2 generators use similar (Appendix C.1) and applications, as well as evalua-
prompts, with the difference that [SOS] and [EOS] tions, of CAT (Appendix C.2) that are not covered
special tokens are inserted to denote the start and in the paper due to limited space.
end of the sentence, respectively. C.1 CSKB Conceptualization
For triple conceptualization, denotes the head,
relation, and tail of an abstract commonsense triple C.1.1 Baselines
as (h, r, t), the abstract concept in the conceptual- For supervised learning baselines of both discrim-
ized head as c ⊂ h, and retrieved instantiations as inative conceptualization tasks, KG-BERT (Yao
er,1 , er,2 , er,3 , ..., er,n . The prompt for training gen- et al., 2019) is adapted as the skeleton of our base-
erally follows the one used by He et al. (2022). For line models. For BART, we use the embedding of
the teacher model, “[CLS], h1 , ..., h|h| , [SEP], [r], the end-of-sentence token in the decoder as the rep-
[SEP], t1 , ..., t|t| ” is used as the prompt. Similarly, resentation of the input sequence. For other models,
student models are trained with a prompt “[CLS], the embedding of the [CLS] token is used as the
h1 , ..., h|h| [SEP] [r] [SEP] t1 , ..., t|t| [SEP] er,1 , representation vector. Linear layers are appropri-
er,2 , er,3 , ..., er,n ”. A filled example by using the ately appended after the encoder model to perform
case in Figure 2 is “relaxing event [SEP] because text classification.
PersonX wanted [SEP] have fun [SEP] PersonX For the semi-supervised baselines, we provide
joins party, go on a holiday, Take a break.” The additional explanations for different methods:
commonsense relation within each triple is trans- UDA. In the original paper of UDA (Xie et al.,
lated into human-readable text, as shown in Table 7. 2020a), two data augmentation methods, back-
translation and TF-IDF replacement, are imple-
Relation Human Readable Text
mented for unsupervised data augmentation. We
xEffect as a result, PersonX will
oEffect as a result, PersonY or others will leverage both methods in our conceptualization
xWant as a result, PersonX want tasks as two different baselines. For the triple con-
oWant as a result, PersonY or others want ceptualization task, we follow the same setting as
xReact as a result, PersonX feel
oReact as a result, PersonY or others feel proposed in PseudoReasoner (Fang et al., 2022).
xIntent because PersonX wanted The back-translation method translates the original
xNeed before that, PersonX needed corpus from English to French and then translates
xAttr PersonX is described as
it back. Special replacements are taken to avoid
Table 7: Textual prompt for commonsense rela- the influence of special tokens. Meanwhile, the
tions (Fang et al., 2021b). Commonsense triple (h, r, t) TF-IDF method uses a probability of 0.1 to replace
is translated to human language “if h, [prompt] t”. the original corpus according to its TF-IDF score.
For the event conceptualization task, we concate-
The generative event conceptualization by GPT2 nate the head event and its annotated concept into
generators uses “[SOS] ho [SEP] i [GEN]” as the one new sentence and then feed it into the model.
input template, where [GEN] indicates the special For the unlabeled conceptualizations, we enclose
the instance and concept with special tokens <c> NoisyStudent, only one iteration is carried out for
and </c>, which is the same as our framework, and PseudoReasoner as the student model converges to
then use back translation or TF-IDF to generate the the best.
augmented data. The input for triple conceptualiza-
C.1.2 Settings
tion follows a similar way as supervised baselines.
It is observed that these special tokens will not We use pretrained language models from the Hug-
affect the translation significantly as they will be gingface Transformers4 Library (Wolf et al., 2020)
preserved in the translation output. Last but not to build our framework. The learning rate for all
least, the model θ is trained on a mixture of anno- models is set as 5e-6, and the batch size is 64.
tated data x1 and augmented data x2 by using the We use an AdamW (Loshchilov and Hutter, 2019)
consistency training loss, as shown in Equation 2. optimizer and evaluate the model every 25 steps.
The max sequence length for the tokenizer is set
to 25 and 35 for both discriminative tasks, respec-
J(θ) = Ex1∼PL (x) [− log pθ (y1 |x1 )]+ tively. Due to the imbalanced dataset, we eval-
λEx2∼PU (x) Ex̂∼q(x̂|x2 )[CE(pθ̃ (y|x2 )||pθ (y|x̂)] uate the discriminative models with Area Under
(2) Curve (AUC) score (Bradley, 1997). Early stop-
ping is used where the best checkpoint is selected
NoisyStudent. Noisy Student (Xie et al., 2020b)
when the largest validation AUC is achieved. All
is an iterative training method that leverages a
experiments are repeated three times using differ-
teacher-student paradigm. The teacher model is
ent random seeds, and the average performances
first trained on annotated data. It is then asked to
and standard deviations are reported. In addition,
make predictions on the unlabeled data as pseudo-
we set the probability thresholds for both tasks to
labels. Then, another student model with an equal
T + = 0.9 and T − = 0.1 to determine the pseudo
or larger number of parameters is trained with
labels. The thresholds are roughly derived by ob-
a mixture of annotated and pseudo-labeled data.
serving the overall distribution and quality of data
Note that pseudo labels, in numerical values, are
satisfying the respective threshold. For the boot-
directly used as the targeting labels. The trained stu-
strapping method, we bootstrap m = 9 additional
dent model will serve as a new teacher and re-label
concepts for event conceptualization verification
the unlabeled data again to yield a better prediction.
and n = 2 additional instances for abstract triple
In our implementation, dropout or dynamic model
verification. Detailed ablation studies are provided
depth is introduced as noise to the model. All mod-
in Section 5.3. As for the computational infras-
els θ are trained with standard cross-entropy loss,
tructure, the models are trained and evaluated on
as shown in Equation 1. We set the dropout proba-
four NVIDIA RTX3090 (24G) and four NVIDIA
bility to 0.5, as it leads to the fastest convergence
1080Ti (12G) graphical cards. The number of pa-
on our data. Only one iteration is completed in
rameters for every model is reported in Table 11.
our experiment, as that’s when the student model
reaches its best result. C.1.3 Additional Experiment Results
PseudoReasoner. PseudoReasoner (Fang et al., The full experiment results for discriminative
2022) is another iterative semi-supervised learn- CSKB conceptualization tasks are reported in Ta-
ing framework that is proposed to tackle Com- ble 11. All supervised learning baselines achieve
monsense Knowledge Base Population (CKBP) comparable results as reported by He et al. (2022).
task (Fang et al., 2021a, 2023). It leverages a sim- Supervised CAT will be discussed later. The results
ilar teacher-student paradigm and a novel filter- by semi-supervised CAT are generally consistent
ing mechanism with the assistance of the student with our findings as discussed in Section 5.1. To
model. We replaced the generative teacher model study the effect of different components and the
with a DeBERTa-v3-large model due to the dis- training regime of CAT, we conduct more detailed
astrous performance that GPT2 achieved on both ablation studies in Appendix C.1.4.
verification tasks. Similar to CAT, two thresholds, C.1.4 Ablation Study
T + = 0.9 and T − = 0.1, are determined to assign
In this section, we study the effects of different
pseudo-labels to unlabeled data based on the predic-
components in CAT and the training strategy of
tion of the teacher model. The rest steps remain the
4
same as described in the original paper. Similar to https://huggingface.co/docs/transformers
CAT. These studies indicate that our framework ping alternative conceptualization and instantiation
design and the proposed bootstrapping method play leads to the largest performance gain. Bridging
an important role in CSKB conceptualization and event conceptualization discrimination with triple
are more effective than leveraging unlabeled data conceptualization also causes slight improvements.
with pseudo labels. However, refining the pseudo labels and re-train the
student models have barely any effect. Thus, our
Framework Components. Our CAT framework bootstrapping method is the most important com-
consists of three critical components that make ponent within the entire CAT framework and can
CAT different from traditional semi-supervised effectively assist in learning conceptual knowledge.
baselines. They are denoted as:
• Bootstrapping: Assist the training of student Supervised CAT. We further study training CAT
models by retrieving alternative conceptualizations in a supervised learning setting to examine the role
and instantiations and bootstrapping them via nat- of unlabeled data. In supervised CAT, no teacher
ural language prompts. Dropping this component models are trained to provide pseudo labels. The al-
will train student models with the original textual ternative conceptualizations and instantiations are
prompts that are also used by the teacher models. retrieved directly from the annotated event con-
• CAT Cycle: Unite event and triple conceptual- ceptualization data and bootstrapped later. Two
ization tasks by assigning negative pseudo labels student models are trained on the bootstrapped data
to abstract triples whose conceptualized head is only and evaluated on the same testing set, and the
predicted as wrong conceptualization. Dropping results are reported in Table 11. Compared with
this component will separate the framework into supervised learning baselines, supervised CAT can
two lines of training, which are training event con- achieve a comparable result on the event concep-
ceptualization and triple conceptualization models tualization task. This may be due to the fact that
separately. the diversity of concepts drops without consider-
• Pseudo-label refinement: Refine the pseudo la- ing unlabeled conceptualizations. Improvements
bels with the latest student models and re-train the in the triple conceptualization task are more sig-
student models. Dropping this component will not nificant, and the results are comparable with semi-
update any pseudo label and will not re-train the supervised CAT. This indicates that our framework
student model. design and bootstrapping method are successful in
discriminating high-quality abstract commonsense
Models Event. Triple. knowledge, and leveraging a semi-supervised learn-
CAT (BERT-base) 87.4 76.3 ing paradigm benefits more in event conceptualiza-
 w/o Bootstrapping 83.1 73.0 tion discrimination.
 w/o CAT Cycle 86.5 75.1
 w/o Pseudo-label Refinement 87.4 76.2 C.2 Application and Evaluation of CAT
CAT (DeBERTa-v3-large) 89.2 80.0 C.2.1 Settings
 w/o Bootstrapping 84.0 77.7
 w/o CAT Cycle 88.1 79.0
Pretrained GPT2 models from the Hugging-
 w/o Pseudo-label Refinement 89.1 79.7 face Transformers Library and training codes5
by Hwang et al. (2021) are used as our code base.
Table 8: Ablation study on three components of CAT. The learning rate for all experiments is set to 1e-
Three components refer to the explanations above. The 5, and the batch size is fixed to 64. We use an
column Event. indicates test set AUC on the event con-
Adam (Kingma and Ba, 2015) optimizer and evalu-
ceptualization task, and the column Triple. indicates
test set AUC on the triple conceptualization task.
ate the model every 20 steps. The input and output
lengths for GPT2 models are fixed at 45 and 55 for
the two application and evaluation tasks, respec-
We then conduct ablation studies regarding these tively. Such length settings can cover all annotated
three components with semi-supervised CAT to conceptualizations and triples. For both genera-
prove the effectiveness of our framework design tive experiments, we evaluate the generations with
and proposed bootstrapping method. Each compo- BLEU (Papineni et al., 2002), METEOR (Lavie
nent is removed separately, and the test set perfor- and Agarwal, 2007), ROUGE-L (Lin, 2004), and
mances by student models are reported. The results
5
are shown in Table 8. From the results, bootstrap- https://github.com/allenai/comet-atomic-2020
CIDEr (Vedantam et al., 2015) scores. However,  528*(/

since an abstract concept usually contains one or %/(8
 
two tokens, we only report BLEU1 and BLEU2
scores for the generative event conceptualization  

528*(/

%/(8
task. Early stopping is also applied where the best  
checkpoint is selected when the minimum autore-
gressive LM loss is achieved. In addition, we no-  
tice that the number of triples from the ATOMIC  
subset is much smaller than abstract triples for the
 
commonsense inference modeling task. Thus, we      
upsample the ATOMIC subset by a ratio of 1:2 7KUHVKROG
across all experiments to guarantee a consistent and
Figure 5: Performance (%) curve by COMET (GPT2-
balanced number of training data. For generative XL) on commonsense inference generation task with
event conceptualization, the training data is simply different thresholds for determining positive pseudo la-
a mixture of annotated and pseudo-labeled event bels. Performance with the best threshold of 0.95 is
conceptualizations without any balancing measure. marked as the red dotted line.
All the models are trained and evaluated on four
NVIDIA RTX A6000 graphical cards with 48G
memory. The number of parameters is close to
the number of parameters in GPT2-XL, which is
reported in Table 11. menting with the effect of threshold tuning when
filtering abstract commonsense knowledge. Mul-
C.2.2 Annotation Settings tiple thresholds ranging from 0.5 to 0.995 are ex-
When evaluating the event conceptualization gener- perimented with to derive abstract commonsense
ator, expert annotations are conducted to evaluate knowledge of different qualities. COMET (GPT2-
concepts that are not presented in the training set. XL) generators are fine-tuned on the ATOMIC
Crowdsourced platforms such as Amazon Mechan- subset, augmented by a mixture of annotated and
ical Turk are not used since experts understand pseudo-labeled abstract triples. The performance
conceptualization better and are more reliable for curve according to the threshold is plotted in Fig-
evaluation. Subsequently, the authors of this pa- ure 5. Full version results with all metrics are re-
per are invited to serve as expert annotators. They ported in Table 19. It can be observed that grad-
are experienced in NLP research and clearly under- ually increasing the threshold from 0.75 will lead
stand the paper’s scope. The annotation guideline to better performance, which may be due to the
is carefully designed. Each question presents the improvement in data quality. However, increasing
original head event with the instance highlighted the threshold over 0.95 will cause a performance
and the corresponding conceptualization candidate drop. One possible reason is the amount of pseudo-
to be annotated. There are also several positive and labeled triples significantly drops with a relatively
negative conceptualizations attached as examples. high threshold, and COMET fails to learn well from
The authors are well-informed about the instruction annotated triples only. Using the CAT framework
and the intended use of their annotations in this pa- to pseudo-label unlabeled abstract triples leads to
per. And they all agreed to annotate as part of their better performance than leveraging a RoBERTa-
contributions. Moreover, in order to ensure that large supervised discriminator to assign pseudo-
the expert will not deliberately raise the plausible labels, which also validates the reliability of the
rate of a certain set of annotation candidates, we triple conceptualization discriminator in CAT. Also,
randomly shuffle all the data and invite one more it is noticeable that training COMET with triples
expert to cross-validate the annotations. These mea- based on our constructed ATOMIC subset is much
sures can ensure that the annotation process is free worse than training with the full ATOMIC dataset.
of ethical concerns and justifiable. This indicates that exposing the model with sub-
stantial factual commonsense knowledge is still
C.2.3 Additional Experiment Results important, and only equipping the model with ab-
We conduct a more comprehensive study on the stract commonsense knowledge is not enough for
commonsense inference generation task by experi- commonsense inference modeling.
D Computational Cost Analysis head event ho , and each of them can be conceptu-
alized in multiple ways. Formally, assume we are
In this section, we compare the number of train- conceptualizing m events, each with n instances.
ing data used for both CSKB conceptualization And each instance i concerned can be conceptu-
tasks to compare the computational cost across alized as p concepts. Each concept takes the ma-
different frameworks and methodologies empiri- jority vote of q annotators to verify. Subsequently,
cally. Both annotated and pseudo-labeled data are the number of annotations needed is O(mnpq),
counted. The comparison result is presented in which grows significantly if we conceptualize a
Table 9. All semi-supervised learning methods commonsense knowledge base at scale. Thus, it
leverage a significant amount of unlabeled data due is extremely infeasible for practitioners to anno-
to the great scarcity of annotations. With threshold tate all of the conceptualizations for verification,
filterings, PseudoReasoner (Fang et al., 2022) and which also highlights the importance of a reliable
our CAT framework can abandon more than half of discriminative conceptualization model as CAT ac-
pseudo examples with poor quality. Even though quired. Semi-supervised learning is also an ideal
our CAT framework can still outperform PseudoRe- training strategy, as there is a considerable amount
asoner and achieve the best performance among of unlabeled data.
all methods. Additionally, there is no notable in- Analyzing the errors made by our discrimina-
crease in the number of model parameters as CAT tor, we observe that models frequently make errors
also applies a teacher-student paradigm that is sim- when the instance contains the word “PersonX,”
ilar to Noisy-Student and PseudoReasoner. Even which could be caused by the reporting bias (Gor-
compared with the supervised baselines, CAT only don and Durme, 2013), as “PersonX” is seldom
doubles the parameters used. In conclusion, with used in normal natural language texts. Replacing
comparable training data and parameters against the subjects with commonly used names such as
other baselines, CAT can achieve much better re- “Alex, Bob” may alleviate such a problem. Addi-
sults and state-of-the-art performances. tionally, models make errors on some rarely seen
concepts, such as “organ,” “cognitive ability,” and
Method Event. Triple. Total “side effect.” Their absence from training data can
Supervised Baselines 107,384 65,386 172,770 partially cause this, as CSKB, like ATOMIC, may
UDA 412,367 4,916,658 5,329,025
Noisy-Student 412,367 4,916,658 5,329,025 not cover many instances under those rarely used
PseudoReasoner 316,601 1,727,865 2,044,466 concepts.
CAT 317,507 1,595,411 1,912,918
Triple Conceptualization. For triple conceptu-
Table 9: Comparison between the number of training alization discrimination, case studies are shown in
data for discriminative event conceptualization (Event.) Table 17. Similar to the analysis above, consider m
and triple conceptualization (Triple.) tasks. events with n instances, each instance with p con-
cepts. Assume that every ATOMIC head event has
t relation and tail tuples as its counterpart, and q
E Case Studies votes are required from annotators. The total num-
ber of annotations is O(mnptq) for verifying all
This section contains case studies of the four tasks abstract commonsense triples, which is also huge
we studied in this paper, including CSKB concep- compared with the total number of original com-
tualization tasks and applications of CAT. Through- monsense triples.
out these cases, we would like to offer a clearer The errors are mainly due to the loss of con-
view of the data, discuss the challenges of the con- textualization within the original head events, as
ceptualization task, and provide brief error analy- conceptualized head events with too high abstract-
ses. ness are likely to omit salient properties. For ex-
ample, conceptualizing “watching a scary movie”
E.1 CSKB Conceptualization
as “watching movie” will lose the property “scary,”
Event Conceptualization. For discriminative which further leads to a wrong abstract common-
event conceptualization, the case study is shown sense knowledge if the tail is “feel scared.” This
in Table 15. From these cases, it can be observed also highlights the importance of verifying the plau-
that several instances i can be identified within one sibility of abstract commonsense knowledge that
heavily relies on both the contextualization brought E.3 Conceptualization by Large Language
by r, t and the conceptualization of the head event. Models
Meanwhile, we observe that the models tend to With the recent advances in Large Language Mod-
make a neutral decision (plausibility score close to els (LLMs), such as GPT3.5 (Brown et al., 2020;
0.5) when encountering the situation of conceptu- Ouyang et al., 2022) and ChatGPT (OpenAI, 2022),
alizing an entire event as a concept with high-level on various NLP tasks (Qin et al., 2023; Bian et al.,
abstractness. Indeed, they are more difficult ab- 2023; Chan et al., 2023; Amin et al., 2023), we also
stract commonsense knowledge for machines to aim to explore ChatGPT’s conceptualization abil-
learn, as a higher level of abstractness leads to ity through case studies. To do so, we investigate
more possible instantiations and commonsense in- ChatGPT’s performance on three conceptualization
ferences. tasks: discriminative event conceptualization, dis-
criminative triple conceptualization, and generative
E.2 Appliaction of CAT
event conceptualization, all of which are defined in
Generative Event Conceptualization. The ex- Section 3. We randomly sample data entries from
amples are shown in Table 16. Generated concep- AbstractATOMIC and prompt ChatGPT with natu-
tualizations are generally plausible, given the head ral language commands to perform various tasks.
event as the context. Specifically, we observe that The prompts used for performing these tasks are
neural generators are more sensitive to the instance listed in Table 10. Specifically, we use OpenAI’s
and its context, as heuristic matching may con- API6 to prompt ChatGPT and retrieve its genera-
ceptualize “sleeping at night” and “having trouble tions.
sleeping at night” as “sleeping”. In contrast, neu- The case studies for three tasks are presented
ral generators can distinguish these two instances in Table 12, Table 13, and Table 14, respectively.
clearly by conceptualizing them as “sleep” and These demonstrate ChatGPT’s strong conceptual-
“sleep disorder”. One potential weakness of neural ization abilities in both discriminative and genera-
generators is that the generated conceptualizations tive manners. While ChatGPT can accurately de-
lack diversity and novelty (Du et al., 2019; Wang termine most event conceptualizations and abstract
et al., 2021), as they tend to be semantically close commonsense knowledge, it still makes some mis-
to the targeting conceptualizations in the training takes. This highlights the value of training a per-
samples. Nevertheless, it still offers a reliable and formant discriminator through CAT, as it can effec-
simplified approach to performing contextualized tively detect incorrect conceptualizations and im-
conceptualization without tedious matching and plausible abstract commonsense knowledge. Addi-
human annotations. Such results also validate the tionally, ChatGPT tends to conceptualize instances
reliability of our discriminative event conceptual- using synonyms (Hagiwara et al., 2006) and hyper-
ization model, as the pseudo-labeled conceptualiza- nyms (Yu et al., 2020) and paraphrased or explained
tions tend to be of high quality. terms rather than higher-level concepts. This under-
scores the importance of our event conceptualiza-
Commonsense Inference Modeling (COMET).
tion generator, which can generate precise, concise
Generations from COMET that are only trained
event conceptualizations. In conclusion, our work
on the ATOMIC subset, possibly augmented by
holds significant value in the realm of common-
abstract commonsense triples, are compared in Ta-
sense reasoning through conceptualization, particu-
ble 18. From these generations, we can observe
larly in light of the rise of large language models.
that the abstract commonsense knowledge-aided
COMET generator can generate tail events that are
most plausible and generalizable compared with
the one only trained on ATOMIC. It generally sup-
ports our hypothesis that abstract commonsense
knowledge may implicitly help model situational
commonsense inference, even without the instanti-
ation step. In addition, this also validates that our
automatically derived abstract knowledge is reli-
able and helpful, which also proves the reliability 6
The code for the model is gpt-3.5-turbo, and the date
of our triple conceptualization discriminator. of access is May 2023.
Task Prompt
Given the event <event>, can the <instance> be conceptualized as
Discriminative Event Conceptualization <concept>? Only answer yes or no without any other words. You are
forced to make a decision.
Given a commonsense knowledge triple, <head, relation, tail>, is this
Discriminative Triple Conceptualization knowledge plausible or not? Only answer yes or no without any
other word. You are forced to make a decision.
Given the event <event>, what are possible conceptualizations of
Generative Event Conceptualization <instance>? Only list out five short conceptualizations, and do
not provide explanations.

Table 10: Natural language prompts used to instruct ChatGPT to perform specific tasks. Words in italics and
enclosed by brackets indicate inputs replaced by sampled data entries. Restrictive commands are appended at the
end to ensure ChatGPT executes the task as intended.

Event Conceptualization Triple Conceptualization


Framework Backbone PTLM / Method
Validation Testing Validation Testing
BERT-base 110M 82.4±0.05 82.5±0.31 71.2±0.58 72.6±0.71
BERT-large 340M 82.8±0.48 83.1±0.80 72.4±0.01 73.7±0.00
BART-base 139M 83.8±0.28 84.4±0.32 72.0±0.09 72.6±0.15
BART-large 406M 85.0±0.13 85.2±0.22 74.5±0.13 76.2±0.19
RoBERTa-base 110M 84.1±0.04 84.5±0.19 72.2±0.00 74.1±0.00
RoBERTa-large 340M 85.2±0.24 85.5±0.02 75.3±0.00 76.9±0.01
Supervised DeBERTa-v3-base 214M 85.1±0.08 85.8±0.07 73.9±0.10 75.9±0.04
Learning DeBERTa-v3-large 435M 85.8±0.05 86.2±0.15 76.9±0.03 78.0±0.02
ELECTRA-base 110M 85.4±0.05 85.8±0.02 74.3±0.27 76.2±0.12
ELECTRA-large 340M 84.7±0.47 85.3±0.38 75.6±0.01 77.9±0.06
GPT2-base 117M 60.0±0.06 59.1±0.14 52.8±0.14 55.9±0.11
GPT2-medium 345M 61.2±0.11 60.3±0.08 54.6±0.17 57.4±0.09
GPT2-large 774M 64.1±0.05 62.7±0.08 60.5±0.11 59.8±0.06
GPT2-XL 1558M 64.2±0.19 63.6±0.22 62.2±0.08 61.5±0.10
UDA (TF-IDF) 83.6±0.29 83.6±0.24 75.8±1.26 76.8±1.34
UDA (back-trans.) 83.4±0.27 83.6±0.24 75.8±1.25 76.8±1.34
Semi-Supervised
Noisy-Student 86.4±0.05 86.5±0.09 75.4±0.64 76.7±0.59
Learning
PseudoReasoner (BERT-base) 83.3±0.11 84.0±0.24 73.0±0.14 74.1±0.33
PseudoReasoner (RoBERTa-large) 86.6±0.25 86.7±0.33 76.3±0.12 77.2±0.21
BERT-base 110M 83.9±0.42 84.5±0.43 73.4±0.32 73.3±0.23
BERT-large 340M 82.8±0.48 83.1±0.80 72.4±0.01 73.7±0.00
BART-base 139M 84.9±0.05 85.4±0.08 75.2±0.06 76.9±0.21
BART-large 406M 86.2±0.05 86.0±0.06 76.8±0.21 78.7±0.31
CAT RoBERTa-base 110M 85.5±0.06 86.0±0.06 76.6±0.12 77.2±0.18
(Supervised) RoBERTa-large 340M 86.2±0.31 86.2±0.31 77.7±0.19 78.5±0.28
DeBERTa-v3-base 214M 85.8±0.15 86.2±0.07 76.8±0.28 79.0±0.20
DeBERTa-v3-large 435M 86.3±0.11 86.7±0.08 78.4±0.20 79.5±0.18
ELECTRA-base 110M 85.5±0.12 85.7±0.08 76.7±0.05 77.3±0.16
ELECTRA-large 340M 86.2±0.66 86.0±0.62 77.8±0.11 78.5±0.09
BERT-base 110M 87.1±0.06 87.4±0.11 74.3±0.26 76.3±0.38
BERT-large 340M 87.7±0.16 88.0±0.19 75.8±0.23 77.8±0.36
BART-base 139M 88.2±0.09 88.2±0.09 75.7±0.09 78.0±0.14
BART-large 406M 88.6±0.07 88.7±0.10 77.2±0.12 79.0±0.14
CAT RoBERTa-base 110M 88.4±0.12 88.3±0.08 76.9±0.16 78.0±0.19
(Semi-Supervised) RoBERTa-large 340M 89.0±0.15 88.8±0.20 78.2±0.08 79.4±0.14
DeBERTa-v3-base 214M 88.8±0.12 88.9±0.08 77.5±0.10 79.9±0.07
DeBERTa-v3-large 435M 89.1±0.05 89.2±0.14 78.7±0.16 80.0±0.33
ELECTRA-base 110M 88.7±0.10 88.9±0.10 74.9±0.15 75.5±0.40
ELECTRA-large 340M 88.6±0.77 88.5±0.70 74.9±0.15 75.5±0.40

Table 11: Full experiment results (%) by our CAT framework on the discriminative event conceptualization and
triple conceptualization tasks. We report the average AUC score and standard deviation across experiments with
three random seeds. The best performances within each framework are underlined, and the best among all models
are bold-faced. All supervised baselines are comparable with experiment results by He et al. (2022).
Head Event Instance Concept Label Pred.
the invitation personal communication X X
the invitation party idea × X
the invitation friendly approach X X
PersonX accepts the invitation item × X
the invitation PersonX accepts the invitation acceptance X X
PersonX accepts the invitation approach × ×
PersonX accepts the invitation psychological treatment × ×
PersonX accepts the invitation personal communication X X
oatmeal ingredient × X
oatmeal cereal X X
oatmeal grain food X X
PersonX makes oatmeal breakfast service × ×
for breakfast breakfast meal X X
PersonX makes oatmeal for breakfast hands-on activity X X
PersonX makes oatmeal for breakfast extended school activity × X
PersonX makes oatmeal for breakfast cooking X X

Table 12: Case study of ChatGPT’s discriminative event conceptualizations. Label refers to annotation result and
Pred. stands for prediction by ChatGPT.

Conceptualized Head Event Relation Tail Event Label Pred.


xEffect to be brave X ×
xWant take medicine X X
xWant leave the hotel × X
xWant to drive home × X
medical check
xAttr cautious X X
xWant go to rest X X
xAttr diseased X ×
xNeed get injured X ×
xEffect laugh X ×
xWant to be entertained X X
xNeed to go to video store X ×
xIntent entertain themselves X X
watching movie
xWant to put movie in DVD player X X
xAttr satisfied X ×
xReact scared X X
xNeed rent it on Netflix X X

Table 13: Case study of ChatGPT’s discriminative triple conceptualizations. Underlined words in the head event
refer to conceptualizations. Label refers to annotation result and Pred. stands for prediction by ChatGPT.

Event Target Conceptualiza- CAT’s Generations ChatGPT’s Generations


tions

PersonX is having trouble time, night, nonwork night, evening, time, Dark sleep time, nocturnal period, rest-
sleeping at night time, shift late, darknight ing hours, nighttime sleeplessness, bed-
time wakefulness
PersonX is having trouble relaxation, sleeping, rest, sleep, resting, daily Nightly slumber attempt, dark rest strug-
sleeping at night resting routine, sleeping gle, sleeplessness after dark, nocturnal
insomnia, bedtime wakefulness
PersonX is having trouble sleeping, disorder, sleep disorder, problem, Nighttime sleep difficulty, dark restless-
sleeping at night sleep problem, trou- sleep disturbance, diffi- ness problem, nocturnal insomnia strug-
ble, insomnia, sleep culty, trouble sleeping gle, bedtime wakefulness issue, sleep-
disorder lessness after dark challenge

Table 14: Case study of ChatGPT’s generative event conceptualizations. The instance candidate in each event is
underlined. Target conceptualizations are positive conceptualizations extracted from AbstractATOMIC, including
the annotated conceptualizations and ones that are positively pseudo-labeled by our framework.
Head Event Instance Concept Label Pred.
night nonwork time X X
night night X X
sleeping at night lifestyle factor X ×
PersonX is having trouble sleeping at night basic need X X
sleeping at night trouble sleeping at night board game × ×
trouble sleeping at night problem X X
PersonX is having trouble sleeping at night variable × ×
PersonX is having trouble sleeping at night personal characteristic X X
friends person X X
friends support person X X
making friends relationship X X
PersonX is nervous about making friends social activity X X
making friends nervous about making friends organ × X
nervous about making friends side effect × X
PersonX is nervous about making friends emotion X X
PersonX is nervous about making friends nervous disorder X X
the piano instrument X X
the piano western instrument X X
how to play the piano musical activity X X
PersonX wants to learn how how to play the piano play X ×
to play the piano to learn how to play the piano button × ×
to learn how to play the piano learning activity X X
PersonX wants to learn how to play the piano cultural event × ×
PersonX wants to learn how to play the piano cognitive ability X ×
PersonX’s pants pant X ×
PersonX’s pants clothing X X
PersonX’s leg leg X ×
PersonX puts PersonX’s pants PersonX’s leg limb X ×
on PersonX’s leg at a time a time resource × ×
a time time X X
PersonX puts PersonX’s pants on PersonX’s leg dressing X X
PersonX puts PersonX’s pants on PersonX’s leg action × ×

Table 15: Case study of CAT’s discriminative event conceptualizations. A head event can be conceptualized in
multiple ways, as shown in the table. Label refers to annotation result and Pred. stands for prediction by our
framework.

Event Target Conceptualizations Generated Conceptualizations

PersonX is having trouble time, night, nonwork time, shift night, evening, time, late, darknight
sleeping at night
PersonX is having trouble relaxation, sleeping, resting rest, sleep, resting, daily routine, sleeping
sleeping at night
PersonX is having trouble sleeping, disorder, sleep problem, trouble, sleep disorder, problem, sleep disturbance,
sleeping at night insomnia, sleep disorder difficulty, trouble sleeping

PersonX gets great grades in accomplishment, result, grades, good achievement, grades, good grade, aca-
school performance, achievement demic excellence, grade
PersonX asks what was wrong problems, concern, seeking information, query, question, asking, communication,
questioning, query, communication inquiry
PersonX needs new shoes necessity, product, personal item, item, requirement, item, need, necessity, needs
clothing, shoes
PersonX is failing math negative experience, negative issue, difficulty, poor performance, problem, aca-
problem, poor performance demic failure, math problem

Table 16: Case study of CAT’s generative event conceptualizations. The instance candidate in each event is
underlined. Target conceptualizations are positive conceptualizations extracted from AbstractATOMIC, including
the annotated conceptualizations and ones that are positively pseudo-labeled by our framework.
Conceptualized Head Event Relation Tail Event Label Pred.
xAttr rich X X
xAttr skillful × X
xIntent look pretty X X
xNeed book an appointment X X
PersonX gets nailcare service
xEffect show off X X
xReact excited X X
oWant to tell her they like them X X
xWant to go home X X
xEffect laugh X X
xWant to be entertained X X
xNeed to go to video store X ×
xIntent entertain themselves X X
watching movie
xWant to put movie in DVD player X X
xAttr satisfied X X
xReact scared X ×
xNeed rent it on Netflix X X
xEffect to be brave X X
xWant take medicine X X
xWant leave the hotel × X
xWant to drive home × X
medical check
xAttr cautious X X
xWant go to rest X X
xAttr diseased X X
xNeed get injured X X

Table 17: Case study of CAT’s discriminative triple conceptualizations. The abstract concept within each con-
ceptualized head event is underlined. Label refers to annotation result and Pred. stands for prediction by our
framework.
Head Relation Source Tail
ATOMIC to tip PersonX
PersonX washes PersonY’s car oWant COMETATOMIC to wash their car
COMETCAT to thank PersonX
ATOMIC to practice
PersonX meets PersonX’s standards xNeed COMETATOMIC to study
COMETCAT to practice hard
ATOMIC to give PersonY something
PersonX stretches out PersonX’s hand xWant COMETATOMIC to touch
COMETCAT to grab something for PersonY
ATOMIC interested
PersonX learns how to bake a cake xAttr COMETATOMIC curious
COMETCAT skilled
ATOMIC to retake the class
PersonX fails PersonX’s class xWant COMETATOMIC to study hard
COMETCAT to try again in the class
ATOMIC X gets receipt
PersonX buys dog food xEffect COMETATOMIC loses weight
COMETCAT gets a receipt
ATOMIC has hair burned
PersonX hits by lightning xEffect COMETATOMIC gets electrocuted
COMETCAT screams in pain
ATOMIC is chastised
PersonX forgets my wallet xEffect COMETATOMIC gets robbed
COMETCAT thinks about it
ATOMIC make a plan
PersonX realizes something xWant COMETATOMIC to solve the problem
COMETCAT to do something about it

Table 18: Case study of commonsense inference generation (COMET). Examples are selected from the original
ATOMIC testing set. ATOMIC refers to the target tail in the original ATOMIC. COMETATOMIC and COMETCAT
stand for generations by COMET trained on an ATOMIC subset or aided with abstract knowledge derived by CAT.
BLEU-1 BLEU-2 BLEU-3 BLEU-4 METEOR ROUGE-L CIDEr
Training Data
Dev Test Dev Test Dev Test Dev Test Dev Test Dev Test Dev Test
Zero-Shot 5.42 4.89 1.84 1.51 0.65 0.52 0.26 0.21 6.50 5.70 6.40 5.90 1.60 1.20
ATOMIC (subset) 38.1 38.1 25.4 25.7 18.7 18.8 15.5 15.7 14.9 14.9 33.0 33.2 27.6 27.8
+Dtl 38.1 38.5 24.8 25.5 17.8 18.4 14.7 15.2 15.3 15.6 33.1 33.7 26.8 27.3
+Finetune 38.6 39.0 25.8 26.6 18.9 19.7 15.7 16.4 15.1 15.4 33.6 34.4 28.8 30.0
u
+DAbs.ATM. 40.0 40.3 27.1 27.8 20.0 20.8 16.5 17.5 16.1 16.3 35.3 35.7 31.6 31.7
+Finetune 40.1 40.5 27.1 27.8 20.1 20.8 16.7 17.4 16.2 16.4 35.4 35.9 31.8 31.7
+Dtl + DAbs.ATM.
u
40.2 40.6 26.2 27.4 19.0 20.4 15.1 16.8 16.3 16.5 35.0 35.4 31.0 31.3
+Finetune 40.0 40.4 26.0 26.9 18.7 19.7 15.0 16.1 16.3 16.4 35.0 35.4 30.3 30.7
u
+D0.995 39.7 39.8 26.5 26.8 19.5 19.8 15.6 16.1 15.8 15.8 35.0 34.9 30.8 30.7
+Finetune 41.0 41.0 27.1 27.5 20.0 20.2 16.1 16.3 16.7 16.6 36.0 35.9 31.9 31.7
u
+D0.99 39.5 39.9 26.1 27.0 19.3 20.0 15.9 16.6 15.7 15.9 34.7 34.8 30.6 30.8
+Finetune 40.8 41.0 27.0 27.6 20.0 20.5 16.2 16.9 16.7 16.6 35.8 35.7 31.9 31.6
u
+D0.95 41.2 41.9 28.1 29.0 20.7 21.5 16.5 17.8 16.6 16.9 35.9 36.5 33.4 33.7
+Finetune 41.1 42.0 28.0 29.0 20.4 21.5 16.4 17.6 16.6 17.0 36.0 36.8 33.2 33.8
u
+D0.90 41.6 41.6 28.1 28.5 20.9 21.5 17.1 17.7 16.9 16.8 36.7 36.4 33.4 33.1
+Finetune 41.8 41.7 28.3 28.5 21.0 21.4 17.0 17.5 17.0 17.0 36.7 36.6 33.4 33.1
u
+D0.85 41.3 41.4 27.8 28.1 20.7 21.1 16.8 17.6 16.7 16.8 36.3 36.6 32.6 32.9
+Finetune 41.5 41.5 27.9 28.2 20.6 21.1 16.8 17.5 16.8 16.9 36.3 36.7 32.6 33.0
u
+D0.80 41.6 41.6 27.3 28.0 20.1 20.7 16.3 17.0 17.0 16.9 36.6 36.4 33.0 32.6
+Finetune 41.6 41.5 27.5 27.9 20.2 20.6 16.3 16.8 17.0 16.9 36.6 36.3 33.0 32.3
u
+D0.75 40.6 40.8 27.1 28.0 19.9 20.9 16.2 17.2 16.4 16.6 35.5 35.7 31.6 32.1
+Finetune 40.9 41.2 27.2 28.1 19.9 21.0 16.2 17.0 16.6 16.9 35.7 36.1 31.8 32.7
u
+D0.70 40.6 40.9 27.1 27.8 19.9 20.7 16.6 17.2 16.4 16.6 35.6 36.1 31.6 32.4
+Finetune 41.4 41.4 27.5 28.1 20.1 21.0 16.4 17.4 16.9 16.9 36.2 36.4 32.5 33.0
u
+D0.50 41.1 41.5 27.3 28.2 20.4 21.2 16.7 17.6 16.7 16.7 35.8 36.1 32.4 32.8
+Finetune 41.5 41.7 27.7 28.5 20.7 21.4 17.0 17.8 16.9 16.9 36.3 36.5 32.7 33.1
+Dtl + D0.995
u
39.4 39.3 26.1 26.4 19.2 19.5 15.5 15.8 15.7 15.5 33.9 33.8 29.8 29.2
+Finetune 39.7 40.0 26.7 27.5 19.5 20.3 15.8 16.6 15.7 15.7 34.7 34.9 30.6 30.9
+Dtl + D0.99
u
39.4 39.7 25.7 26.5 18.6 19.5 15.2 16.5 15.8 15.9 34.6 35.0 29.7 30.2
+Finetune 39.7 40.4 26.6 27.6 19.6 20.5 16.0 16.8 15.7 16.1 34.2 35.0 30.5 31.1
+Dtl + D0.95
u
39.9 40.5 26.2 27.4 19.3 20.6 16.0 17.4 16.0 16.2 35.0 35.4 30.8 31.3
+Finetune 40.4 41.0 26.6 27.6 19.5 20.7 16.1 17.1 16.2 16.5 35.4 35.8 31.3 31.5
+Dtl + D0.90
u
39.4 39.7 26.1 27.0 18.9 19.9 15.3 16.4 15.6 15.8 34.5 35.0 29.6 30.2
+Finetune 40.4 40.4 26.2 26.9 19.1 19.6 15.2 15.8 16.3 16.4 35.5 35.7 30.5 30.7
+Dtl + D0.85
u
39.8 40.0 26.3 26.9 19.3 19.8 15.8 16.1 16.0 16.2 34.8 35.2 30.5 30.6
+Finetune 39.9 40.0 26.2 26.7 19.3 19.5 15.8 15.8 16.1 16.3 34.9 35.5 30.4 30.7
+Dtl + D0.80
u
39.9 40.4 26.4 27.6 19.2 20.5 15.4 16.8 16.2 16.3 34.9 35.3 30.3 31.3
+Finetune 39.9 40.4 26.2 27.5 18.9 20.3 15.2 16.7 16.2 16.5 35.0 35.6 30.2 31.3
+Dtl + D0.75
u
39.7 39.8 25.9 26.6 18.9 19.4 15.3 15.8 15.6 15.7 34.6 34.9 29.7 30.1
+Finetune 39.8 39.9 25.9 26.7 18.8 19.5 15.3 15.9 15.7 15.9 34.7 35.1 29.6 30.3
+Dtl + D0.70
u
40.2 40.5 26.4 27.2 19.4 20.1 15.8 16.4 16.4 16.5 35.2 35.5 30.8 31.0
+Finetune 40.3 40.6 26.4 27.1 19.4 19.9 15.9 16.0 16.5 16.6 35.2 35.7 30.5 30.9
+Dtl + D0.50
u
39.3 39.8 26.2 27.5 18.9 20.3 15.2 16.7 15.7 16.0 33.9 34.4 29.4 30.6
+Finetune 39.5 40.1 26.3 27.6 19.0 20.5 15.4 17.1 15.8 16.2 34.2 34.9 29.3 30.8

ATOMIC (full) 42.7 42.9 29.6 30.0 22.0 22.5 18.6 18.7 29.1 29.7 51.1 52.7 74.5 75.4

Table 19: Full experiment results (%) by GPT2 (XL) on commonsense inference generation (COMET) task. We
evaluate the models on the original ATOMIC dev and test sets. Dtl stands for annotated abstract commonsense
triples, and Du stands for unlabeled triples pseudo-labeled by our CAT framework. The underfoot value is the
threshold for selecting plausible pseudo labels. Fine-tune refers to fine-tuning back on the training set of our
constructed ATOMIC subset. Rows with the best performance, which are reported in the paper, are colored in gray.
We also report performances by COMET trained on the complete ATOMIC training set in the bottom row.

You might also like