0% found this document useful (0 votes)
69 views

Semantics-Aware BERT For Language Understanding: Zhuosheng Zhang, Yuwei Wu, Hai Zhao, Zuchao Li

Uploaded by

刘中坤
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
69 views

Semantics-Aware BERT For Language Understanding: Zhuosheng Zhang, Yuwei Wu, Hai Zhao, Zuchao Li

Uploaded by

刘中坤
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

The Thirty-Fourth AAAI Conference on Artificial Intelligence (AAAI-20)

Semantics-Aware BERT for Language Understanding



Zhuosheng Zhang,1,2,3, Yuwei Wu,1,2,3,4,* Hai Zhao,1,2,3,† Zuchao Li,1,2,3
Shuailiang Zhang,1,2,3 Xi Zhou,5 Xiang Zhou5
1
Department of Computer Science and Engineering, Shanghai Jiao Tong University
2
Key Laboratory of Shanghai Education Commission for Intelligent Interaction
and Cognitive Engineering, Shanghai Jiao Tong University, Shanghai, China
3
MoE Key Lab of Artificial Intelligence, AI Institute, Shanghai Jiao Tong University, Shanghai, China
4
College of Zhiyuan, Shanghai Jiao Tong University, China
5
CloudWalk Technology, Shanghai, China
{zhangzs, will8821}@sjtu.edu.cn, [email protected]

Abstract easily applied to downstream models as the encoder or used


for fine-tuning.
The latest work on language representations carefully in- Despite the success of those well pre-trained language
tegrates contextualized features into language model train-
ing, which enables a series of success especially in vari-
models, we argue that current techniques which only fo-
ous machine reading comprehension and natural language in- cus on language modeling restrict the power of the pre-
ference tasks. However, the existing language representation trained representations. The major limitation of existing lan-
models including ELMo, GPT and BERT only exploit plain guage models lies in only taking plain contextual features
context-sensitive features such as character or word embed- for both representation and training objective, rarely consid-
dings. They rarely consider incorporating structured seman- ering explicit contextual semantic clues. Even though well
tic information which can provide rich semantics for lan- pre-trained language models can implicitly represent con-
guage representation. To promote natural language under- textual semantics more or less (Clark et al. 2019), they can
standing, we propose to incorporate explicit contextual se- be further enhanced by incorporating external knowledge.
mantics from pre-trained semantic role labeling, and intro- To this end, there is a recent trend of incorporating ex-
duce an improved language representation model, Semantics-
aware BERT (SemBERT), which is capable of explicitly ab-
tra knowledge to pre-trained language models (Zhang et al.
sorbing contextual semantics over a BERT backbone. Sem- 2020b).
BERT keeps the convenient usability of its BERT precursor in A number of studies have found deep learning models
a light fine-tuning way without substantial task-specific modi- might not really understand the natural language texts (Mu-
fications. Compared with BERT, semantics-aware BERT is as drakarta et al. 2018) and vulnerably suffer from adversar-
simple in concept but more powerful. It obtains new state-of- ial attacks (Jia and Liang 2017). Through their observation,
the-art or substantially improves results on ten reading com- deep learning models pay great attention to non-significant
prehension and language inference tasks. words and ignore important ones. For attractive question
answering challenge (Rajpurkar et al. 2016), we observe a
number of answers produced by previous models are seman-
1 Introduction tically incomplete (As shown in Section 6.3), which suggests
Recently, deep contextual language model (LM) has been that the current NLU models suffer from insufficient contex-
shown effective for learning universal language representa- tual semantic representation and learning.
tions, achieving state-of-the-art results in a series of flag- Actually, NLU tasks share the similar task purpose as sen-
ship natural language understanding (NLU) tasks. Some tence contextual semantic analysis. Briefly, semantic role la-
prominent examples are Embedding from Language models beling (SRL) over a sentence is to discover who did what to
(ELMo) (Peters et al. 2018), Generative Pre-trained Trans- whom, when and why with respect to the central meaning
former (OpenAI GPT) (Radford et al. 2018), Bidirectional of the sentence, which naturally matches the task target of
Encoder Representations from Transformers (BERT) (De- NLU. For example, in question answering tasks, questions
vlin et al. 2018) and Generalized Autoregressive Pretrain- are usually formed with who, what, how, when and why,
ing (XLNet) (Yang et al. 2019). Providing fine-grained con- which can be conveniently formulized into the predicate-
textual embedding, these pre-trained models could be either argument relationship in terms of contextual semantics.

In human language, a sentence usually involves various
These authors contribute equally. †Corresponding author. This predicate-argument structures, while neural models encode
paper was partially supported by National Key Research and
Development Program of China (No. 2017YFB0304100) and
sentence into embedding representation, with little consider-
Key Projects of National Natural Science Foundation of China ation of the modeling of multiple semantic structures. Thus
(U1836222 and 61733011). we are motivated to enrich the sentence contextual seman-
Copyright  c 2020, Association for the Advancement of Artificial tics in multiple predicate-specific argument sequences by
Intelligence (www.aaai.org). All rights reserved. presenting SemBERT: Semantics-aware BERT, which is a

9628
fine-tuned BERT with explicit contextual semantic clues. The major technical improvement over traditional embed-
The proposed SemBERT learns the representation in a fine- dings of these newly proposed language models is that they
grained manner and takes both strengths of BERT on plain focus on extracting context-sensitive features from language
context representation and explicit semantics for deeper models. When integrating these contextual word embed-
meaning representation. dings with existing task-specific architectures, ELMo helps
Our model consists of three components: 1) an out-of- boost several major NLP benchmarks (Peters et al. 2018) in-
shelf semantic role labeler to annotate the input sentences cluding question answering on SQuAD, sentiment analysis
with a variety of semantic role labels; 2) an sequence en- (Socher et al. 2013), and named entity recognition (Sang and
coder where a pre-trained language model is used to build De Meulder 2003), while BERT especially shows effective
representation for input raw texts and the semantic role la- on language understanding tasks on GLUE, MultiNLI and
bels are mapped to embedding in parallel; 3) a semantic SQuAD (Devlin et al. 2018). In this work, we follow this line
integration component to integrate the text representation of extracting context-sensitive features and take pre-trained
with the contextual explicit semantic embedding to obtain BERT as our backbone encoder for jointly learning explicit
the joint representation for downstream tasks. context semantics.
The proposed SemBERT will be directly applied to typ-
ical NLU tasks. Our model is evaluated on 11 benchmark 2.2 Explicit Contextual Semantics
datasets involving natural language inference, question an- Although distributed representations including the latest ad-
swering, semantic similarity and text classification. Sem- vanced pre-trained contextual language models have already
BERT obtains new state-of-the-art on SNLI and also ob- been strengthened by semantics to some extent from linguis-
tains significant gains on the GLUE benchmark and SQuAD tic sense (Clark et al. 2019), we argue such implicit seman-
2.0. Ablation studies and analysis verify that our intro- tics may not be enough to support a powerful contextual rep-
duced explicit semantics is essential to the further perfor- resentation for NLU, according to our observation on the se-
mance improvement and SemBERT essentially and effec- mantically incomplete answer span generated by BERT on
tively works as a unified semantics-enriched language rep- SQuAD, which motivates us to directly introduce explicit
resentation model1 . semantics.
There are a few formal semantic frames, including
2 Background and Related Work FrameNet (Baker, Fillmore, and Lowe 1998) and PropBank
2.1 Language Modeling for NLU (Palmer, Gildea, and Kingsbury 2005), in which the latter is
Natural language understanding tasks require a comprehen- more popularly implemented in computational linguistics.
sive understanding of natural languages and the ability to Formal semantics generally presents the semantic relation-
do further inference and reasoning. A common trend among ship as predicate-argument structure. For example, given the
NLU studies is that models are becoming more and more following sentence with target verb (predicate) sold, all the
sophisticated with stacked attention mechanisms or large arguments are labeled as follows,
amount of corpus (Zhang et al. 2018; 2020a; Zhou, Zhang, [ARG0 Charlie] [V sold] [ARG1 a book] [ARG2 to Sherry]
and Zhao 2019), resulting in explosive growth of compu- [AM −T M P last week].
tational cost. Notably, well pre-trained contextual language where ARG0 represents the seller (agent), ARG1 repre-
models such as ELMo (Peters et al. 2018), GPT (Radford et sents the thing sold (theme), ARG2 represents the buyer (re-
al. 2018) and BERT (Devlin et al. 2018) have been shown cipient), AM − T M P is an adjunct indicating the timing of
powerful to boost NLU tasks to reach new high perfor- the action and V represents the predicate.
mance. To parse the predicate-argument structure, we have an
Distributed representations have been widely used as a NLP task, semantic role labeling (SRL) (Zhao, Chen, and
standard part of NLP models due to the ability to capture Kit 2009; Zhao, Zhang, and Kit 2013). Recently, end-to-
the local co-occurence of words from large scale unlabeled end SRL system neural models have been introduced (He
text (Mikolov et al. 2013). However, these approaches for et al. 2017; Li et al. 2019). These studies tackle argument
learning word vectors only involve a single, context indepen- identification and argument classification in one shot. He et
dent representation for each word with litter consideration of al. (2017) presented a deep highway BiLSTM architecture
contextual encoding in sentence level. Thus recently intro- with constrained decoding, which is simple and effective,
duced contextual language models including ELMo, GPT, enabling us to select it as our basic semantic role labeler. In-
BERT and XLNet fill the gap by strengthening the con- spired by recent advances, we can easily integrate SRL into
textual sentence modeling for better representation, among NLU.
which BERT uses a different pre-training objective, masked
language model, which allows capturing both sides of con- 3 Semantics-aware BERT
text, left and right. Besides, BERT also introduces a next
Figure 1 overviews our semantics-aware BERT framework.
sentence prediction task that jointly pre-trains text-pair rep-
We omit rather extensive formulations of BERT and recom-
resentations. The latest evaluation shows that BERT is pow-
mend readers to get the details from (Devlin et al. 2018).
erful and convenient for downstream NLU tasks.
SemBERT is designed to be capable of handling multiple se-
1 quence inputs. In SemBERT, words in the input sequence are
The code is publicly available at https://github.com/cooelf/
SemBERT. passed to semantic role labeler to fetch multiple predicate-

9629
reconstructing dormitories will not be approved by cavanaugh

reconstructing dormitories ... by cavanaugh

... ...
semantics
pooling pooling pooling pooling integration
Perspective integration

... reconstructing dormitories will not be approved by cavanaugh

conv. conv. conv. conv.


...... ...... ...... ...... ......
...

rec ##ons ##tructing dorm ##itor ##ies ... by [PAD] [PAD] ca ##vana ##ugh Lookup table

T1 T2 ... T2 TN
Semantic role labels (various aspects)

Trm Trm ... Trm Trm


Verb ARG1 O (x6)
BERT
... ...... ...... ...... ...... ......
Trm Trm Trm Trm

ARG1 (x2) MODAL NEG O Verb ARG0 (x2)


E1 E2 ... E EN

reconstructing dormitories by cavanaugh reconstructing dormitories will not be approved by cavanaugh

BERT tokenization *Semantic Role Labeling

reconstructing dormitories will not be approved by cavanaugh

For the text, {reconstructing dormitories will not be approved by cavanaugh}, it will be tokenized to a subword-level sequence, {rec, ##ons, ##tructing, dorm, ##itor, ##ies, will,
not, be, approved, by, ca, ##vana, ##ugh}. Meanwhile, there are two kinds of word-level semantic structures,
[ARG1: reconstructing dormitories] [ARGM-MOD: will] [ARGM-NEG: not] be [V: approved] [ARG0: by cavanaugh]
[V: reconstructing] [ARG1: dormitories] will not be approved by cavanaugh

Figure 1: Semantics-aware BERT. * denotes the pre-trained labeler which will not be fine-tuned in our framework.

derived structures of explicit semantics and the correspond- picted in Figure 2.


ing embeddings are aggregated after a linear layer to form
the final semantic embedding. In parallel, the input sequence 3.2 Encoding
is segmented to subwords (if any) by BERT word-piece tok- The raw text sequences and semantic role label sequences
enizer, then the subword representation is transformed back are firstly represented as embedding vectors to feed a pre-
to word level via a convolutional layer to obtain the contex- trained BERT. The input sentence X = {x1 , . . . , xn } is
tual word representations. At last, the word representations a sequence of words of length n, which is first tokenized
and semantic embedding are concatenated to form the joint to word pieces (subword tokens). Then the transformer en-
representation for downstream tasks. coder captures the contextual information for each token via
self-attention and produces a sequence of contextual embed-
3.1 Semantic Role Labeling dings.
During the data pre-processing, each sentence is annotated For m label sequences related to each predicate, we
into several semantic sequences using our pre-trained se- have T = {t1 , . . . , tm } where ti contains n labels de-
mantic labeler. We take PropBank style (Palmer, Gildea, and noted as {label1i , label2i , ..., labelni }. Since our labels are
Kingsbury 2005) of semantic roles to annotate every token in word-level, the length is equal to the original sentence
of input sequence with semantic labels. Given a specific sen- length n of X. We regard the semantic signals as embed-
tence, there would be various predicate-argument structures. dings and use a lookup table to map these labels to vec-
As shown in Figure 1, for the text, [reconstructing dormi- tors {v1i , v2i , ..., vni } and feed a BiGRU layer to obtain the
tories will not be approved by cavanaugh], there are two label representations for m label sequences in latent space,
semantic structures in the view of the predicates in the sen- e(ti ) = BiGRU (v1i , v2i , . . . , vni ) where 0 < i  m. For m
tence. label sequences, let Li denote the label sequences for token
To disclose the multidimensional semantics, we group the xi , we have e(Li ) = {e(t1 ), . . . , e(tm )}. We concatenate
semantic labels and integrate them with text embeddings in the m sequences of label representation and feed them to a
the next encoding component. The input data flow is de- fully connected layer to obtain the refined joint representa-

9630
Input reconstructing dormitories will not be approved by cavanaugh

BERT
rec ##ons ##tructing dorm ##itor ##ies will not be approved by ca ##vana ##ugh
Subword

Word-level
Embeddings reconstructing dormitories will not be approved by cavanaugh

Explicit Verb ARG1 O (x6)


Semantic
Embeddings
ARG1 (x2) MODAL NEG O VERB ARG0 (x2)

Figure 2: The input representation flow.

tion et in dimension d: with only a linear layer for prediction. For simplicity, we
 only show the straightforward SemBERT that directly gives
e (Li ) = W2 [e(t1 ), e(t2 ), . . . , e(tm )] + b2 ,
(1) the predictions after fine-tuning3 .
et = {e (L1 ), ..., e (Ln )},
4.1 Semantic Role Labeler
where W2 and b2 are trainable parameters.
To obtain the semantic labels, we use a pre-trained SRL
3.3 Integration module to predict all predicates and corresponding argu-
ments in one shot. We implement the semantic role labeler
This integration module fuses the lexical text embedding and
from Peters et al. (2018), achieving an F1 of 84.6%4 on
label representations. As the original pre-trained BERT is
English OntoNotes v5.0 benchmark dataset (Pradhan et al.
based on a sequence of subwords, while our introduced se-
2013) for the CoNLL-2012 shared task. At test time, we
mantic labels are on words, we need to align these differ-
perform Viterbi decoding to enforce valid spans using BIO
ent sized sequences. Thus we group the subwords for each
constraints. In our implementation, there are 104 labels in
word and use convolutional neural network (CNN) with a
total. We use O for non-argument words and Verb label for
max pooling to obtain the representation in word-level. We
predicates.
select CNN because of fast speed and our preliminary ex-
periments show that it also gives better results than RNNs
in our concerned tasks where we think the local feature cap-
4.2 Task-specific Fine-tuning
tured by CNN would be beneficial for subword-derived LM In Section 3, we have described how to obtain the semantics-
modeling. aware BERT representations. Here, we show how to adapt
We take one word for example. Supposing that word xi is SemBERT to classification, regression and span-based MRC
made up of a sequence of subwords [s1 , s2 , ..., sl ], where l tasks. We transform the fused contextual semantic and LM
is the number of subwords for word xi . Denoting the repre- representations h to a lower dimension and obtain the pre-
sentation of subword sj from BERT as e(sj ), we first utilize diction distributions. Note that this part is basically the same
a Conv1D layer, ei = W1 [e(si ), e(si+1 ), . . . , e(si+k−1 )] + as the implementation in BERT without any modification, to
b1 , where W1 and b1 are trainable parameters and k is the avoid extra influence and focus on the intrinsic performance
kernel size. We then apply ReLU and max pooling to the of SemBERT. We outline here to keep the completeness of
output embedding sequence for xi : the implementation.
For classification and regression tasks, h is directly passed
e∗i = ReLU (ei ), e(xi ) = M axP ooling(e∗1 , ..., e∗l−k+1 ), to a fully connection layer to get the class logits or score,
(2) respectively. The training objectives are CrossEntropy for
Therefore, the whole representation for word sequence X is classification tasks and Mean Square Error loss for regres-
represented as ew = {e(x1 ), . . . e(xn )} ∈ Rn×dw where dw sion tasks.
denotes the dimension of word embedding. For span-based reading comprehension, h is passed to a
The aligned context and distilled semantic embeddings fully connection layer to get the start logits s and end log-
are then merged by a fusion function h = ew  et , where its e of all tokens. The score of a candidate span from po-
 represents concatenation operation2 . sition i to position j is defined as si + ej , and the maxi-
mum scoring span where j ≥ i is used as a prediction5 . For
4 Model Implementation prediction, we compare the score of the pooled first token
Now, we introduce the specific implementation parts of our span: snull = s0 + e0 to the score of the best non-null span
SemBERT. SemBERT could be a forepart encoder for a wide
3
range of tasks and could also become an end-to-end model We only use single model for each task without jointly training
and parameter sharing.
2 4
We also tried summation, multiplication and attention mecha- This result nearly reaches the SOTA in (He et al. 2018).
5
nisms, but our experiments show that concatenation is the best. All the candidate scores are normanized by softmax.

9631
Method Classification Natural Language Inference Semantic Similarity Score
CoLA SST-2 MNLI QNLI RTE MRPC QQP STS-B -
(mc) (acc) m/mm(acc) (acc) (acc) (F1) (F1) (pc) -
Leaderboard (September, 2019)
ALBERT 69.1 97.1 91.3/91.0 99.2 89.2 93.4 74.2 92.5 89.4
RoBERTa 67.8 96.7 90.8/90.2 98.9 88.2 92.1 90.2 92.2 88.5
XLNET 67.8 96.8 90.2/89.8 98.6 86.3 93.0 90.3 91.6 88.4
In literature (April, 2019)
BiLSTM+ELMo+Attn 36.0 90.4 76.4/76.1 79.9 56.8 84.9 64.8 75.1 70.5
GPT 45.4 91.3 82.1/81.4 88.1 56.0 82.3 70.3 82.0 72.8
GPT on STILTs 47.2 93.1 80.8/80.6 87.2 69.1 87.7 70.1 85.3 76.9
MT-DNN 61.5 95.6 86.7/86.0 - 75.5 90.0 72.4 88.3 82.2
BERTBASE 52.1 93.5 84.6/83.4 - 66.4 88.9 71.2 87.1 78.3
BERTLARGE 60.5 94.9 86.7/85.9 92.7 70.1 89.3 72.1 87.6 80.5
Our implementation
SemBERTBASE 57.8 93.5 84.4/84.0 90.9 69.3 88.2 71.8 87.3 80.9
SemBERTLARGE 62.3 94.6 87.6/86.3 94.6 84.5 91.2 72.8 87.8 82.9
Table 1: Results on GLUE benchmark. The block In literatures shows the comparable results from (Liu et al. 2019; Radford et
al. 2018) at the time of submitting SemBERT to GLUE (April, 2019).

sˆi,j = maxj≥i (si + ej ). We predict a non-null answer when Reading Comprehension As a widely used MRC bench-
sˆi,j > snull +τ , where the threshold τ is selected on the dev mark dataset, SQuAD 2.0 (Rajpurkar, Jia, and Liang 2018)
set to maximize F1. combines the 100,000 questions in SQuAD 1.1 (Rajpurkar
et al. 2016) with over 50,000 new, unanswerable questions
5 Experiments that are written adversarially by crowdworkers to look sim-
ilar to answerable ones. For SQuAD 2.0, systems must not
5.1 Setup only answer questions when possible, but also abstain from
answering when no answer is supported by the paragraph.
Our implementation is based on the PyTorch implementa-
tion of BERT6 . We use the pre-trained weights of BERT
and follow the same fine-tuning procedure as BERT without Natural Language Inference Natural Language Infer-
any modification, and all the layers are tuned with moderate ence involves reading a pair of sentences and judging the re-
model size increasing, as the extra SRL embedding volume lationship between their meanings, such as entailment, neu-
is less than 15% of the original encoder size. We set the ini- tral and contradiction. We evaluate on 4 diverse datasets, in-
tial learning rate in {8e-6, 1e-5, 2e-5, 3e-5} with warm-up cluding Stanford Natural Language Inference (SNLI) (Bow-
rate of 0.1 and L2 weight decay of 0.01. The batch size is man et al. 2015), Multi-Genre Natural Language Inference
selected in {16, 24, 32}. The maximum number of epochs (MNLI) (Nangia et al. 2017), Question Natural Language
is set in [2, 5] depending on tasks. Texts are tokenized using Inference (QNLI) (Rajpurkar et al. 2016) and Recognizing
wordpieces, with maximum length of 384 for SQuAD and Textual Entailment (RTE) (Bentivogli et al. 2009).
200 for other tasks. The dimension of SRL embedding is set
to 10. The default maximum number of predicate-argument Semantic Similarity Semantic similarity tasks aim to pre-
structures m is set to 3. Hyper-parameters were selected us- dict whether two sentences are semantically equivalent or
ing the dev set. not. The challenge lies in recognizing rephrasing of con-
cepts, understanding negation, and handling syntactic am-
5.2 Tasks and Datasets biguity. Three datasets are used, including Microsoft Para-
phrase corpus (MRPC) (Dolan and Brockett 2005), Quora
Our evaluation is performed on ten NLU benchmark datasets Question Pairs (QQP) dataset (Chen et al. 2018) and Seman-
involving natural language inference, machine reading com- tic Textual Similarity benchmark (STS-B) (Cer et al. 2017).
prehension, semantic similarity and text classification. Some
of these tasks are available from the recently released GLUE Classification The Corpus of Linguistic Acceptability
benchmark (Wang et al. 2018), which is a collection of nine (CoLA) (Warstadt, Singh, and Bowman 2018) is used to pre-
NLU tasks. We also extend our experiments to two widely- dict whether an English sentence is linguistically acceptable
used tasks, SNLI (Bowman et al. 2015) and SQuAD 2.0 (Ra- or not. The Stanford Sentiment Treebank (SST-2) (Socher
jpurkar, Jia, and Liang 2018) to show the superiority. et al. 2013) provides a dataset for sentiment classification
6 7
https://github.com/huggingface/pytorch-pretrained-BERT https://nlp.stanford.edu/seminar/details/jdevlin.pdf

9632
Model EM F1 Model Dev Test
#1 BERT + DAE + AoA† 85.9 88.6 In literature
#2 SG-Net† 85.2 87.9 DRCN (Kim et al. 2018) - 90.1
#3 BERT + NGM + SST† 85.2 87.7 SJRC (Zhang et al. 2019) - 91.3
U-Net (Sun et al. 2018) 69.2 72.6 MT-DNN (Liu et al. 2019)† 92.2 91.6
RMR + ELMo + Verifier (Hu et al. 2018) 71.7 74.2 Our implementation
Our implementation BERTBASE 90.8 90.7
BERTLARGE 80.5 83.6 BERTLARGE 91.3 91.1
SemBERTLARGE 82.4 85.2 SemBERTBASE 91.2 91.0
SemBERT∗LARGE 84.8 87.9 SemBERTLARGE 92.3 91.6

Table 2: Exact Match (EM) and F1 scores on SQuAD 2.0 test Table 3: Accuracy on SNLI dataset. Previous state-of-the-
set for single models. † denotes the top 3 single submissions art result is marked by †. Both our SemBERT and BERT are
from the leaderboard at the time of submitting SemBERT single models, fine-tuned based on the pre-trained models.
(11 April, 2019). Most of the top results from the SQuAD
leaderboard do not have public model descriptions available, Model Params Shared Rate
and it is allowed to use any public data for system training. (M) (M)
We therefore further adopt synthetic self training7 for data MT-DNN 3,060 340 9.1
augmentation, denoted as SemBERT∗LARGE . BERT on STILTs 335 - 1.0
BERT 335 - 1.0
SemBERT 340 - 1.0
that needs to determine whether the sentiment of a sentence
extracted from movie reviews is positive or negative. Table 4: Parameter Comparison on LARGE models.
The numbers are from GLUE leaderboard (https://
5.3 Results gluebenchmark.com/leaderboard).
Table 1 shows results on the GLUE benchmark datasets,
showing SemBERT gives substantial gains over BERT and semble models10 by a large margin.
outperforms all the previous state-of-the-art models in liter-
ature7 . Since SemBERT takes BERT as the backbone with
the same evaluation procedure, the gain is entirely owing to
6 Analysis
newly introduced explicit contextual semantics. Though re- 6.1 Ablation Study
cent dominant models take advance of multi-tasking, knowl- To evaluate the contributions of key factors in our method,
edge distillation, transfer learning or ensemble, our single we perform an ablation study on the SNLI and SQuAD 2.0
model is lightweight and competitive, even yields better re- dev sets as shown in Table 6. Since SemBERT absorbs con-
sults with simple design and less parameters. Model parame- textual semantics in a deep processing way, we wonder if
ter comparison is shown in Table 4. We observe that without a simple and straightforward way integrating such semantic
multi-task learning like MT-DNN8 , our model still achieves information may still work, thus we concatenate the SRL
remarkable results. embedding with BERT subword embeddings for a direct
Particularly, we observe substantial improvements on comparison, where the semantic role labels are copied to
small datasets such as RTE, MRPC, CoLA, which demon- the number of subwords for each original word, without
strates involving explicit semantics helps the model work CNN and pooling for word-level alignment. From the re-
better with small training data, which is important for most sults, we observe that the concatenation would yield an im-
NLP tasks as large-scale annotated data is unavailable. provement, verifying that integrating contextual semantics
Table 2 shows the results for reading comprehension on would be quite useful for language understanding. However,
SQuAD 2.0 test set9 . SemBERT boosts the strong BERT SemBERT still outperforms the simple BERT+SRL model
baseline essentially on both EM and F1. It also outperforms just like the latter outperforms the original BERT by a large
all the published works and achieves comparable perfor- performance margin, which shows that SemBERT works
mance with a few unpublished models from the leaderboard. more effectively for integrating both plain contextual rep-
Table 3 shows SemBERT also achieves a new state-of- resentation and contextual semantics at the same time.
the-art on SNLI benchmark and even outperforms all the en-
6.2 The influence of the number m
7
We find that MNLI model can be effectively transferred for We investigate the influence of the max number of predicate-
RTE and MRPC datasets, thus the models for RTE and MRPC are argument structures m by setting it from 1 to 5. Table 7
fine-tuned base on our MNLI model. shows the result. We observe that the modest number of m
8
Since MT-DNN is a multi-task learning framework with would be better.
shared parameters on 9 task-specific layers, we count the 340M
shared parameters for nine times for fair comparison. 10
https://nlp.stanford.edu/projects/snli/. As ensemble models are
9
There is a restriction of submission frequency for online commonly composed of multiple heterogeneous models and re-
SQuAD 2.0 evaluation, we do not submit our base models. sources, we exclude them in our table to save space.

9633
Question Baseline SemBERT
What is a very seldom used unit of mass in the metric system? The ki metric slug
What is the lone MLS team that belongs to southern California? Galaxy LA Galaxy
How many people does the Greater Los Angeles Area have? 17.5 million over 17.5 million

Table 5: The comparison of answers from baseline and our model. In these examples, answers from SemBERT are the same as
the ground truth.

SNLI SQuAD 2.0 6.4 Infulence of Accuracy of SRL


Model
Dev EM F1
BERTLARGE 91.3 79.6 82.4 Our model relies on a semantic role labeler that would influ-
BERTLARGE +SRL 91.5 80.3 83.1 ence the overall model performance. To investigate influence
SemBERTLARGE 92.3 80.9 83.6 of the accuracy of the labeler, we degrade our labeler by ran-
domly turning specific proportion [0, 20%, 40%] of labels
Table 6: Analysis on SNLI and SQuAD 2.0 datasets. into random error ones as cascading errors. The F1 scores
of SQuAD are respectively [87.93, 87.31, 87.24]. This ad-
vantage can be attributed to the concatenation operation of
Number 1 2 3 4 5 BERT hidden states and SRL representation, in which the
Accuracy 91.49 91.36 91.57 91.29 91.42 lower dimensional SRL representation (even noisy) would
not affect the former one intensely. This result indicates that
Table 7: The influence of the max number of predicate- the LM can not only benefit from high-accuracy labeler but
argument structures m. also keep robust against noisy labels.
Besides the wide range of tasks verified in this work, Sem-
BERT could also be easily adapted to other languages. As
SRL is a fundamental NLP task, it is convenient to train a
6.3 Model Prediction labeler for main languages as CoNLL 2009 provides 7 SRL
treebanks. For those without available treebanks, unsuper-
To have an intuitive observation of the predictions of Sem- vised SRL methods can be effectively applied. For out-of-
BERT, we show a list of prediction examples on SQuAD 2.0 domain issue, the datasets (GLUE and SQuAD) that we are
from baseline BERT and SemBERT11 in Table 5. The com- working on cover quite diverse domains, and experiments
parison indicates that our model could extract more seman- show that our method still works.
tically accurate answer, yielding more exact match answers
while those from the baseline BERT model are often seman-
tically incomplete. This shows that utilizing explicit seman- 7 Conclusion
tics is potential to guide the model to produce meaningful This paper proposes a novel semantics-aware BERT net-
predictions. Intuitively, the advance would attribute to better work architecture for fine-grained language representation.
awareness of semantic role spans, which guides the model Experiments on a wide range of NLU tasks including nat-
to learn the patterns like who did what to whom explicitly. ural language inference, question answering, machine read-
Through the comparison, we observe SemBERT might ing comprehension, semantic similarity and text classifica-
benefit from better span segmentation through span-based tion show the superiority over the strong baseline BERT.
SRL labeling. We conduct a case study on our best model of Our model has surpassed all the published works in all of
SQuAD 2.0, by transforming SRL into segmentation tags to the concerned NLU tasks. This work discloses the effec-
indicate which token is inside or outside the segmented span. tiveness of semantics-aware BERT in natural language un-
The result is 83.69(EM)/87.02(F1), which shows that the derstanding, which demonstrates that explicit contextual se-
segmentation indeed works but marginally beneficial com- mantics can be effectively integrated with state-of-the-art
pared with our complete architecture. pre-trained language representation for even better perfor-
It is worth noting that we are motivated to use the SRL mance improvement. Recently, most works focus on heuris-
signals to help the model to capture the span relationships in- tically stacking complex mechanisms for performance im-
side sentence, which results in both sides of semantic label provement, instead, we hope to shed some lights on fusing
hints and segmentation benefits across semantic role spans accurate semantic signals for deeper comprehension and in-
to some extent. The segmentation could also be regarded ference through a simple but effective method.
as the awareness of semantics even with better semantic
span segmentations. Intuitively, this indicates that our model References
evolves from BERT subword-level representation to inter-
mediate word-level and final semantic representations. Baker, C. F.; Fillmore, C. J.; and Lowe, J. B. 1998. The berkeley
framenet project. In COLING.
Bentivogli, L.; Clark, P.; Dagan, I.; and Giampiccolo, D. 2009.
11
Henceforth, we use the SemBERT* model from Table 2 as the The fifth pascal recognizing textual entailment challenge. In ACL-
strong and challenging baseline for ablation. PASCAL.

9634
Bowman, S. R.; Angeli, G.; Potts, C.; and Manning, C. D. 2015. A Rajpurkar, P.; Jia, R.; and Liang, P. 2018. Know what you don’t
large annotated corpus for learning natural language inference. In know: Unanswerable questions for SQuAD. In ACL.
EMNLP. Sang, E. F., and De Meulder, F. 2003. Introduction to the conll-
Cer, D.; Diab, M.; Agirre, E.; Lopez-Gazpio, I.; and Specia, 2003 shared task: Language-independent named entity recognition.
L. 2017. Semeval-2017 task 1: Semantic textual similarity- arXiv preprint cs/0306050.
multilingual and cross-lingual focused evaluation. arXiv preprint Socher, R.; Perelygin, A.; Wu, J.; Chuang, J.; Manning, C. D.; Ng,
arXiv:1708.00055. A.; and Potts, C. 2013. Recursive deep models for semantic com-
Chen, Z.; Zhang, H.; Zhang, X.; and Zhao, L. 2018. Quora question positionality over a sentiment treebank. In EMNLP.
pairs. Sun, F.; Li, L.; Qiu, X.; and Liu, Y. 2018. U-net: Machine read-
Clark, K.; Khandelwal, U.; Levy, O.; and Manning, C. D. 2019. ing comprehension with unanswerable questions. arXiv preprint
What does BERT look at? an analysis of bert’s attention. arXiv arXiv:1810.06638.
preprint arXiv:1906.04341. Wang, A.; Singh, A.; Michael, J.; Hill, F.; Levy, O.; and Bowman,
Devlin, J.; Chang, M.-W.; Lee, K.; and Toutanova, K. 2018. BERT: S. 2018. Glue: A multi-task benchmark and analysis platform
Pre-training of deep bidirectional transformers for language under- for natural language understanding. In 2018 EMNLP Workshop
standing. arXiv preprint arXiv:1810.04805. BlackboxNLP.
Dolan, W. B., and Brockett, C. 2005. Automatically constructing Warstadt, A.; Singh, A.; and Bowman, S. R. 2018. Neural network
a corpus of sentential paraphrases. In IWP2005. acceptability judgments. arXiv preprint arXiv:1805.12471.
He, L.; Lee, K.; Lewis, M.; Zettlemoyer, L.; He, L.; Lee, K.; Lewis, Yang, Z.; Dai, Z.; Yang, Y.; Carbonell, J.; Salakhutdinov, R.; and
M.; Zettlemoyer, L.; He, L.; and Lee, K. 2017. Deep semantic role Le, Q. V. 2019. XLNet: Generalized autoregressive pretraining for
labeling: What works and what’s next. In ACL. language understanding. arXiv preprint arXiv:1906.08237.
He, L.; Lee, K.; Levy, O.; and Zettlemoyer, L. 2018. Jointly pre- Zhang, Z.; Li, J.; Zhu, P.; and Zhao, H. 2018. Modeling multi-turn
dicting predicates and arguments in neural semantic role labeling. conversation with deep utterance aggregation. In COLING. arXiv
In ACL. preprint arXiv:1806.09102.
Hu, M.; Peng, Y.; Huang, Z.; Yang, N.; Zhou, M.; et al. 2018. Zhang, Z.; Wu, Y.; Li, Z.; and Zhao, H. 2019. Explicit contextual
Read+ verify: Machine reading comprehension with unanswerable semantics for text comprehension. In PACLIC 33. arXiv preprint
questions. arXiv preprint arXiv:1808.05759. arXiv:1809.02794.
Jia, R., and Liang, P. 2017. Adversarial examples for evaluating Zhang, S.; Zhao, H.; Wu, Y.; Zhang, Z.; Zhou, X.; and Zhou, X.
reading comprehension systems. In EMNLP. 2020a. Dual co-matching network for multi-choice reading com-
Kim, S.; Hong, J.-H.; Kang, I.; and Kwak, N. 2018. Semantic sen- prehension. In AAAI. arXiv preprint arXiv:1901.09381.
tence matching with densely-connected recurrent and co-attentive Zhang, Z.; Wu, Y.; Zhou, J.; Duan, S.; and Zhao, H. 2020b. SG-
information. arXiv preprint arXiv:1805.11360. Net: Syntax-guided machine reading comprehension. In AAAI.
Li, Z.; He, S.; Zhao, H.; Zhang, Y.; Zhang, Z.; Zhou, X.; and Zhou, arXiv preprint arXiv:1908.05147.
X. 2019. Dependency or span, end-to-end uniform semantic role Zhao, H.; Chen, W.; and Kit, C. 2009. Semantic dependency pars-
labeling. In AAAI. arXiv preprint arXiv:1901.05280. ing of nombank and propbank: An efficient integrated approach via
Liu, X.; He, P.; Chen, W.; and Gao, J. 2019. Multi-task deep a large-scale feature selection. In EMNLP.
neural networks for natural language understanding. arXiv preprint Zhao, H.; Zhang, X.; and Kit, C. 2013. Integrative semantic depen-
arXiv:1901.11504. dency parsing via efficient large-scale feature selection. Journal of
Mikolov, T.; Sutskever, I.; Chen, K.; Corrado, G. S.; and Dean, J. Artificial Intelligence Research 46:203–233.
2013. Distributed representations of words and phrases and their Zhou, J.; Zhang, Z.; and Zhao, H. 2019. LIMIT-BERT: Linguistic
compositionality. In NIPS. informed multi-task bert. arXiv preprint arXiv:1910.14296.
Mudrakarta, P. K.; Taly, A.; Sundararajan, M.; and Dhamdhere, K.
2018. Did the model understand the question? In ACL.
Nangia, N.; Williams, A.; Lazaridou, A.; and Bowman, S. R. 2017.
The repeval 2017 shared task: Multi-genre natural language infer-
ence with sentence representations. In RepEval.
Palmer, M.; Gildea, D.; and Kingsbury, P. 2005. The proposition
bank: An annotated corpus of semantic roles. Computational lin-
guistics 31(1):71–106.
Peters, M. E.; Neumann, M.; Iyyer, M.; Gardner, M.; Clark, C.;
Lee, K.; and Zettlemoyer, L. 2018. Deep contextualized word
representations. In NAACL-HLT.
Pradhan, S.; Moschitti, A.; Xue, N.; Ng, H. T.; Björkelund, A.;
Uryupina, O.; Zhang, Y.; and Zhong, Z. 2013. Towards robust
linguistic analysis using OntoNotes. In CoNLL.
Radford, A.; Narasimhan, K.; Salimans, T.; and Sutskever, I.
2018. Improving language understanding by generative pre-
training. Technical report.
Rajpurkar, P.; Zhang, J.; Lopyrev, K.; and Liang, P. 2016.
SQuAD: 100,000+ questions for machine comprehension of text.
In EMNLP.

9635

You might also like