Semantics-Aware BERT For Language Understanding: Zhuosheng Zhang, Yuwei Wu, Hai Zhao, Zuchao Li
Semantics-Aware BERT For Language Understanding: Zhuosheng Zhang, Yuwei Wu, Hai Zhao, Zuchao Li
9628
fine-tuned BERT with explicit contextual semantic clues. The major technical improvement over traditional embed-
The proposed SemBERT learns the representation in a fine- dings of these newly proposed language models is that they
grained manner and takes both strengths of BERT on plain focus on extracting context-sensitive features from language
context representation and explicit semantics for deeper models. When integrating these contextual word embed-
meaning representation. dings with existing task-specific architectures, ELMo helps
Our model consists of three components: 1) an out-of- boost several major NLP benchmarks (Peters et al. 2018) in-
shelf semantic role labeler to annotate the input sentences cluding question answering on SQuAD, sentiment analysis
with a variety of semantic role labels; 2) an sequence en- (Socher et al. 2013), and named entity recognition (Sang and
coder where a pre-trained language model is used to build De Meulder 2003), while BERT especially shows effective
representation for input raw texts and the semantic role la- on language understanding tasks on GLUE, MultiNLI and
bels are mapped to embedding in parallel; 3) a semantic SQuAD (Devlin et al. 2018). In this work, we follow this line
integration component to integrate the text representation of extracting context-sensitive features and take pre-trained
with the contextual explicit semantic embedding to obtain BERT as our backbone encoder for jointly learning explicit
the joint representation for downstream tasks. context semantics.
The proposed SemBERT will be directly applied to typ-
ical NLU tasks. Our model is evaluated on 11 benchmark 2.2 Explicit Contextual Semantics
datasets involving natural language inference, question an- Although distributed representations including the latest ad-
swering, semantic similarity and text classification. Sem- vanced pre-trained contextual language models have already
BERT obtains new state-of-the-art on SNLI and also ob- been strengthened by semantics to some extent from linguis-
tains significant gains on the GLUE benchmark and SQuAD tic sense (Clark et al. 2019), we argue such implicit seman-
2.0. Ablation studies and analysis verify that our intro- tics may not be enough to support a powerful contextual rep-
duced explicit semantics is essential to the further perfor- resentation for NLU, according to our observation on the se-
mance improvement and SemBERT essentially and effec- mantically incomplete answer span generated by BERT on
tively works as a unified semantics-enriched language rep- SQuAD, which motivates us to directly introduce explicit
resentation model1 . semantics.
There are a few formal semantic frames, including
2 Background and Related Work FrameNet (Baker, Fillmore, and Lowe 1998) and PropBank
2.1 Language Modeling for NLU (Palmer, Gildea, and Kingsbury 2005), in which the latter is
Natural language understanding tasks require a comprehen- more popularly implemented in computational linguistics.
sive understanding of natural languages and the ability to Formal semantics generally presents the semantic relation-
do further inference and reasoning. A common trend among ship as predicate-argument structure. For example, given the
NLU studies is that models are becoming more and more following sentence with target verb (predicate) sold, all the
sophisticated with stacked attention mechanisms or large arguments are labeled as follows,
amount of corpus (Zhang et al. 2018; 2020a; Zhou, Zhang, [ARG0 Charlie] [V sold] [ARG1 a book] [ARG2 to Sherry]
and Zhao 2019), resulting in explosive growth of compu- [AM −T M P last week].
tational cost. Notably, well pre-trained contextual language where ARG0 represents the seller (agent), ARG1 repre-
models such as ELMo (Peters et al. 2018), GPT (Radford et sents the thing sold (theme), ARG2 represents the buyer (re-
al. 2018) and BERT (Devlin et al. 2018) have been shown cipient), AM − T M P is an adjunct indicating the timing of
powerful to boost NLU tasks to reach new high perfor- the action and V represents the predicate.
mance. To parse the predicate-argument structure, we have an
Distributed representations have been widely used as a NLP task, semantic role labeling (SRL) (Zhao, Chen, and
standard part of NLP models due to the ability to capture Kit 2009; Zhao, Zhang, and Kit 2013). Recently, end-to-
the local co-occurence of words from large scale unlabeled end SRL system neural models have been introduced (He
text (Mikolov et al. 2013). However, these approaches for et al. 2017; Li et al. 2019). These studies tackle argument
learning word vectors only involve a single, context indepen- identification and argument classification in one shot. He et
dent representation for each word with litter consideration of al. (2017) presented a deep highway BiLSTM architecture
contextual encoding in sentence level. Thus recently intro- with constrained decoding, which is simple and effective,
duced contextual language models including ELMo, GPT, enabling us to select it as our basic semantic role labeler. In-
BERT and XLNet fill the gap by strengthening the con- spired by recent advances, we can easily integrate SRL into
textual sentence modeling for better representation, among NLU.
which BERT uses a different pre-training objective, masked
language model, which allows capturing both sides of con- 3 Semantics-aware BERT
text, left and right. Besides, BERT also introduces a next
Figure 1 overviews our semantics-aware BERT framework.
sentence prediction task that jointly pre-trains text-pair rep-
We omit rather extensive formulations of BERT and recom-
resentations. The latest evaluation shows that BERT is pow-
mend readers to get the details from (Devlin et al. 2018).
erful and convenient for downstream NLU tasks.
SemBERT is designed to be capable of handling multiple se-
1 quence inputs. In SemBERT, words in the input sequence are
The code is publicly available at https://github.com/cooelf/
SemBERT. passed to semantic role labeler to fetch multiple predicate-
9629
reconstructing dormitories will not be approved by cavanaugh
... ...
semantics
pooling pooling pooling pooling integration
Perspective integration
rec ##ons ##tructing dorm ##itor ##ies ... by [PAD] [PAD] ca ##vana ##ugh Lookup table
T1 T2 ... T2 TN
Semantic role labels (various aspects)
For the text, {reconstructing dormitories will not be approved by cavanaugh}, it will be tokenized to a subword-level sequence, {rec, ##ons, ##tructing, dorm, ##itor, ##ies, will,
not, be, approved, by, ca, ##vana, ##ugh}. Meanwhile, there are two kinds of word-level semantic structures,
[ARG1: reconstructing dormitories] [ARGM-MOD: will] [ARGM-NEG: not] be [V: approved] [ARG0: by cavanaugh]
[V: reconstructing] [ARG1: dormitories] will not be approved by cavanaugh
Figure 1: Semantics-aware BERT. * denotes the pre-trained labeler which will not be fine-tuned in our framework.
9630
Input reconstructing dormitories will not be approved by cavanaugh
BERT
rec ##ons ##tructing dorm ##itor ##ies will not be approved by ca ##vana ##ugh
Subword
Word-level
Embeddings reconstructing dormitories will not be approved by cavanaugh
tion et in dimension d: with only a linear layer for prediction. For simplicity, we
only show the straightforward SemBERT that directly gives
e (Li ) = W2 [e(t1 ), e(t2 ), . . . , e(tm )] + b2 ,
(1) the predictions after fine-tuning3 .
et = {e (L1 ), ..., e (Ln )},
4.1 Semantic Role Labeler
where W2 and b2 are trainable parameters.
To obtain the semantic labels, we use a pre-trained SRL
3.3 Integration module to predict all predicates and corresponding argu-
ments in one shot. We implement the semantic role labeler
This integration module fuses the lexical text embedding and
from Peters et al. (2018), achieving an F1 of 84.6%4 on
label representations. As the original pre-trained BERT is
English OntoNotes v5.0 benchmark dataset (Pradhan et al.
based on a sequence of subwords, while our introduced se-
2013) for the CoNLL-2012 shared task. At test time, we
mantic labels are on words, we need to align these differ-
perform Viterbi decoding to enforce valid spans using BIO
ent sized sequences. Thus we group the subwords for each
constraints. In our implementation, there are 104 labels in
word and use convolutional neural network (CNN) with a
total. We use O for non-argument words and Verb label for
max pooling to obtain the representation in word-level. We
predicates.
select CNN because of fast speed and our preliminary ex-
periments show that it also gives better results than RNNs
in our concerned tasks where we think the local feature cap-
4.2 Task-specific Fine-tuning
tured by CNN would be beneficial for subword-derived LM In Section 3, we have described how to obtain the semantics-
modeling. aware BERT representations. Here, we show how to adapt
We take one word for example. Supposing that word xi is SemBERT to classification, regression and span-based MRC
made up of a sequence of subwords [s1 , s2 , ..., sl ], where l tasks. We transform the fused contextual semantic and LM
is the number of subwords for word xi . Denoting the repre- representations h to a lower dimension and obtain the pre-
sentation of subword sj from BERT as e(sj ), we first utilize diction distributions. Note that this part is basically the same
a Conv1D layer, ei = W1 [e(si ), e(si+1 ), . . . , e(si+k−1 )] + as the implementation in BERT without any modification, to
b1 , where W1 and b1 are trainable parameters and k is the avoid extra influence and focus on the intrinsic performance
kernel size. We then apply ReLU and max pooling to the of SemBERT. We outline here to keep the completeness of
output embedding sequence for xi : the implementation.
For classification and regression tasks, h is directly passed
e∗i = ReLU (ei ), e(xi ) = M axP ooling(e∗1 , ..., e∗l−k+1 ), to a fully connection layer to get the class logits or score,
(2) respectively. The training objectives are CrossEntropy for
Therefore, the whole representation for word sequence X is classification tasks and Mean Square Error loss for regres-
represented as ew = {e(x1 ), . . . e(xn )} ∈ Rn×dw where dw sion tasks.
denotes the dimension of word embedding. For span-based reading comprehension, h is passed to a
The aligned context and distilled semantic embeddings fully connection layer to get the start logits s and end log-
are then merged by a fusion function h = ew et , where its e of all tokens. The score of a candidate span from po-
represents concatenation operation2 . sition i to position j is defined as si + ej , and the maxi-
mum scoring span where j ≥ i is used as a prediction5 . For
4 Model Implementation prediction, we compare the score of the pooled first token
Now, we introduce the specific implementation parts of our span: snull = s0 + e0 to the score of the best non-null span
SemBERT. SemBERT could be a forepart encoder for a wide
3
range of tasks and could also become an end-to-end model We only use single model for each task without jointly training
and parameter sharing.
2 4
We also tried summation, multiplication and attention mecha- This result nearly reaches the SOTA in (He et al. 2018).
5
nisms, but our experiments show that concatenation is the best. All the candidate scores are normanized by softmax.
9631
Method Classification Natural Language Inference Semantic Similarity Score
CoLA SST-2 MNLI QNLI RTE MRPC QQP STS-B -
(mc) (acc) m/mm(acc) (acc) (acc) (F1) (F1) (pc) -
Leaderboard (September, 2019)
ALBERT 69.1 97.1 91.3/91.0 99.2 89.2 93.4 74.2 92.5 89.4
RoBERTa 67.8 96.7 90.8/90.2 98.9 88.2 92.1 90.2 92.2 88.5
XLNET 67.8 96.8 90.2/89.8 98.6 86.3 93.0 90.3 91.6 88.4
In literature (April, 2019)
BiLSTM+ELMo+Attn 36.0 90.4 76.4/76.1 79.9 56.8 84.9 64.8 75.1 70.5
GPT 45.4 91.3 82.1/81.4 88.1 56.0 82.3 70.3 82.0 72.8
GPT on STILTs 47.2 93.1 80.8/80.6 87.2 69.1 87.7 70.1 85.3 76.9
MT-DNN 61.5 95.6 86.7/86.0 - 75.5 90.0 72.4 88.3 82.2
BERTBASE 52.1 93.5 84.6/83.4 - 66.4 88.9 71.2 87.1 78.3
BERTLARGE 60.5 94.9 86.7/85.9 92.7 70.1 89.3 72.1 87.6 80.5
Our implementation
SemBERTBASE 57.8 93.5 84.4/84.0 90.9 69.3 88.2 71.8 87.3 80.9
SemBERTLARGE 62.3 94.6 87.6/86.3 94.6 84.5 91.2 72.8 87.8 82.9
Table 1: Results on GLUE benchmark. The block In literatures shows the comparable results from (Liu et al. 2019; Radford et
al. 2018) at the time of submitting SemBERT to GLUE (April, 2019).
sˆi,j = maxj≥i (si + ej ). We predict a non-null answer when Reading Comprehension As a widely used MRC bench-
sˆi,j > snull +τ , where the threshold τ is selected on the dev mark dataset, SQuAD 2.0 (Rajpurkar, Jia, and Liang 2018)
set to maximize F1. combines the 100,000 questions in SQuAD 1.1 (Rajpurkar
et al. 2016) with over 50,000 new, unanswerable questions
5 Experiments that are written adversarially by crowdworkers to look sim-
ilar to answerable ones. For SQuAD 2.0, systems must not
5.1 Setup only answer questions when possible, but also abstain from
answering when no answer is supported by the paragraph.
Our implementation is based on the PyTorch implementa-
tion of BERT6 . We use the pre-trained weights of BERT
and follow the same fine-tuning procedure as BERT without Natural Language Inference Natural Language Infer-
any modification, and all the layers are tuned with moderate ence involves reading a pair of sentences and judging the re-
model size increasing, as the extra SRL embedding volume lationship between their meanings, such as entailment, neu-
is less than 15% of the original encoder size. We set the ini- tral and contradiction. We evaluate on 4 diverse datasets, in-
tial learning rate in {8e-6, 1e-5, 2e-5, 3e-5} with warm-up cluding Stanford Natural Language Inference (SNLI) (Bow-
rate of 0.1 and L2 weight decay of 0.01. The batch size is man et al. 2015), Multi-Genre Natural Language Inference
selected in {16, 24, 32}. The maximum number of epochs (MNLI) (Nangia et al. 2017), Question Natural Language
is set in [2, 5] depending on tasks. Texts are tokenized using Inference (QNLI) (Rajpurkar et al. 2016) and Recognizing
wordpieces, with maximum length of 384 for SQuAD and Textual Entailment (RTE) (Bentivogli et al. 2009).
200 for other tasks. The dimension of SRL embedding is set
to 10. The default maximum number of predicate-argument Semantic Similarity Semantic similarity tasks aim to pre-
structures m is set to 3. Hyper-parameters were selected us- dict whether two sentences are semantically equivalent or
ing the dev set. not. The challenge lies in recognizing rephrasing of con-
cepts, understanding negation, and handling syntactic am-
5.2 Tasks and Datasets biguity. Three datasets are used, including Microsoft Para-
phrase corpus (MRPC) (Dolan and Brockett 2005), Quora
Our evaluation is performed on ten NLU benchmark datasets Question Pairs (QQP) dataset (Chen et al. 2018) and Seman-
involving natural language inference, machine reading com- tic Textual Similarity benchmark (STS-B) (Cer et al. 2017).
prehension, semantic similarity and text classification. Some
of these tasks are available from the recently released GLUE Classification The Corpus of Linguistic Acceptability
benchmark (Wang et al. 2018), which is a collection of nine (CoLA) (Warstadt, Singh, and Bowman 2018) is used to pre-
NLU tasks. We also extend our experiments to two widely- dict whether an English sentence is linguistically acceptable
used tasks, SNLI (Bowman et al. 2015) and SQuAD 2.0 (Ra- or not. The Stanford Sentiment Treebank (SST-2) (Socher
jpurkar, Jia, and Liang 2018) to show the superiority. et al. 2013) provides a dataset for sentiment classification
6 7
https://github.com/huggingface/pytorch-pretrained-BERT https://nlp.stanford.edu/seminar/details/jdevlin.pdf
9632
Model EM F1 Model Dev Test
#1 BERT + DAE + AoA† 85.9 88.6 In literature
#2 SG-Net† 85.2 87.9 DRCN (Kim et al. 2018) - 90.1
#3 BERT + NGM + SST† 85.2 87.7 SJRC (Zhang et al. 2019) - 91.3
U-Net (Sun et al. 2018) 69.2 72.6 MT-DNN (Liu et al. 2019)† 92.2 91.6
RMR + ELMo + Verifier (Hu et al. 2018) 71.7 74.2 Our implementation
Our implementation BERTBASE 90.8 90.7
BERTLARGE 80.5 83.6 BERTLARGE 91.3 91.1
SemBERTLARGE 82.4 85.2 SemBERTBASE 91.2 91.0
SemBERT∗LARGE 84.8 87.9 SemBERTLARGE 92.3 91.6
Table 2: Exact Match (EM) and F1 scores on SQuAD 2.0 test Table 3: Accuracy on SNLI dataset. Previous state-of-the-
set for single models. † denotes the top 3 single submissions art result is marked by †. Both our SemBERT and BERT are
from the leaderboard at the time of submitting SemBERT single models, fine-tuned based on the pre-trained models.
(11 April, 2019). Most of the top results from the SQuAD
leaderboard do not have public model descriptions available, Model Params Shared Rate
and it is allowed to use any public data for system training. (M) (M)
We therefore further adopt synthetic self training7 for data MT-DNN 3,060 340 9.1
augmentation, denoted as SemBERT∗LARGE . BERT on STILTs 335 - 1.0
BERT 335 - 1.0
SemBERT 340 - 1.0
that needs to determine whether the sentiment of a sentence
extracted from movie reviews is positive or negative. Table 4: Parameter Comparison on LARGE models.
The numbers are from GLUE leaderboard (https://
5.3 Results gluebenchmark.com/leaderboard).
Table 1 shows results on the GLUE benchmark datasets,
showing SemBERT gives substantial gains over BERT and semble models10 by a large margin.
outperforms all the previous state-of-the-art models in liter-
ature7 . Since SemBERT takes BERT as the backbone with
the same evaluation procedure, the gain is entirely owing to
6 Analysis
newly introduced explicit contextual semantics. Though re- 6.1 Ablation Study
cent dominant models take advance of multi-tasking, knowl- To evaluate the contributions of key factors in our method,
edge distillation, transfer learning or ensemble, our single we perform an ablation study on the SNLI and SQuAD 2.0
model is lightweight and competitive, even yields better re- dev sets as shown in Table 6. Since SemBERT absorbs con-
sults with simple design and less parameters. Model parame- textual semantics in a deep processing way, we wonder if
ter comparison is shown in Table 4. We observe that without a simple and straightforward way integrating such semantic
multi-task learning like MT-DNN8 , our model still achieves information may still work, thus we concatenate the SRL
remarkable results. embedding with BERT subword embeddings for a direct
Particularly, we observe substantial improvements on comparison, where the semantic role labels are copied to
small datasets such as RTE, MRPC, CoLA, which demon- the number of subwords for each original word, without
strates involving explicit semantics helps the model work CNN and pooling for word-level alignment. From the re-
better with small training data, which is important for most sults, we observe that the concatenation would yield an im-
NLP tasks as large-scale annotated data is unavailable. provement, verifying that integrating contextual semantics
Table 2 shows the results for reading comprehension on would be quite useful for language understanding. However,
SQuAD 2.0 test set9 . SemBERT boosts the strong BERT SemBERT still outperforms the simple BERT+SRL model
baseline essentially on both EM and F1. It also outperforms just like the latter outperforms the original BERT by a large
all the published works and achieves comparable perfor- performance margin, which shows that SemBERT works
mance with a few unpublished models from the leaderboard. more effectively for integrating both plain contextual rep-
Table 3 shows SemBERT also achieves a new state-of- resentation and contextual semantics at the same time.
the-art on SNLI benchmark and even outperforms all the en-
6.2 The influence of the number m
7
We find that MNLI model can be effectively transferred for We investigate the influence of the max number of predicate-
RTE and MRPC datasets, thus the models for RTE and MRPC are argument structures m by setting it from 1 to 5. Table 7
fine-tuned base on our MNLI model. shows the result. We observe that the modest number of m
8
Since MT-DNN is a multi-task learning framework with would be better.
shared parameters on 9 task-specific layers, we count the 340M
shared parameters for nine times for fair comparison. 10
https://nlp.stanford.edu/projects/snli/. As ensemble models are
9
There is a restriction of submission frequency for online commonly composed of multiple heterogeneous models and re-
SQuAD 2.0 evaluation, we do not submit our base models. sources, we exclude them in our table to save space.
9633
Question Baseline SemBERT
What is a very seldom used unit of mass in the metric system? The ki metric slug
What is the lone MLS team that belongs to southern California? Galaxy LA Galaxy
How many people does the Greater Los Angeles Area have? 17.5 million over 17.5 million
Table 5: The comparison of answers from baseline and our model. In these examples, answers from SemBERT are the same as
the ground truth.
9634
Bowman, S. R.; Angeli, G.; Potts, C.; and Manning, C. D. 2015. A Rajpurkar, P.; Jia, R.; and Liang, P. 2018. Know what you don’t
large annotated corpus for learning natural language inference. In know: Unanswerable questions for SQuAD. In ACL.
EMNLP. Sang, E. F., and De Meulder, F. 2003. Introduction to the conll-
Cer, D.; Diab, M.; Agirre, E.; Lopez-Gazpio, I.; and Specia, 2003 shared task: Language-independent named entity recognition.
L. 2017. Semeval-2017 task 1: Semantic textual similarity- arXiv preprint cs/0306050.
multilingual and cross-lingual focused evaluation. arXiv preprint Socher, R.; Perelygin, A.; Wu, J.; Chuang, J.; Manning, C. D.; Ng,
arXiv:1708.00055. A.; and Potts, C. 2013. Recursive deep models for semantic com-
Chen, Z.; Zhang, H.; Zhang, X.; and Zhao, L. 2018. Quora question positionality over a sentiment treebank. In EMNLP.
pairs. Sun, F.; Li, L.; Qiu, X.; and Liu, Y. 2018. U-net: Machine read-
Clark, K.; Khandelwal, U.; Levy, O.; and Manning, C. D. 2019. ing comprehension with unanswerable questions. arXiv preprint
What does BERT look at? an analysis of bert’s attention. arXiv arXiv:1810.06638.
preprint arXiv:1906.04341. Wang, A.; Singh, A.; Michael, J.; Hill, F.; Levy, O.; and Bowman,
Devlin, J.; Chang, M.-W.; Lee, K.; and Toutanova, K. 2018. BERT: S. 2018. Glue: A multi-task benchmark and analysis platform
Pre-training of deep bidirectional transformers for language under- for natural language understanding. In 2018 EMNLP Workshop
standing. arXiv preprint arXiv:1810.04805. BlackboxNLP.
Dolan, W. B., and Brockett, C. 2005. Automatically constructing Warstadt, A.; Singh, A.; and Bowman, S. R. 2018. Neural network
a corpus of sentential paraphrases. In IWP2005. acceptability judgments. arXiv preprint arXiv:1805.12471.
He, L.; Lee, K.; Lewis, M.; Zettlemoyer, L.; He, L.; Lee, K.; Lewis, Yang, Z.; Dai, Z.; Yang, Y.; Carbonell, J.; Salakhutdinov, R.; and
M.; Zettlemoyer, L.; He, L.; and Lee, K. 2017. Deep semantic role Le, Q. V. 2019. XLNet: Generalized autoregressive pretraining for
labeling: What works and what’s next. In ACL. language understanding. arXiv preprint arXiv:1906.08237.
He, L.; Lee, K.; Levy, O.; and Zettlemoyer, L. 2018. Jointly pre- Zhang, Z.; Li, J.; Zhu, P.; and Zhao, H. 2018. Modeling multi-turn
dicting predicates and arguments in neural semantic role labeling. conversation with deep utterance aggregation. In COLING. arXiv
In ACL. preprint arXiv:1806.09102.
Hu, M.; Peng, Y.; Huang, Z.; Yang, N.; Zhou, M.; et al. 2018. Zhang, Z.; Wu, Y.; Li, Z.; and Zhao, H. 2019. Explicit contextual
Read+ verify: Machine reading comprehension with unanswerable semantics for text comprehension. In PACLIC 33. arXiv preprint
questions. arXiv preprint arXiv:1808.05759. arXiv:1809.02794.
Jia, R., and Liang, P. 2017. Adversarial examples for evaluating Zhang, S.; Zhao, H.; Wu, Y.; Zhang, Z.; Zhou, X.; and Zhou, X.
reading comprehension systems. In EMNLP. 2020a. Dual co-matching network for multi-choice reading com-
Kim, S.; Hong, J.-H.; Kang, I.; and Kwak, N. 2018. Semantic sen- prehension. In AAAI. arXiv preprint arXiv:1901.09381.
tence matching with densely-connected recurrent and co-attentive Zhang, Z.; Wu, Y.; Zhou, J.; Duan, S.; and Zhao, H. 2020b. SG-
information. arXiv preprint arXiv:1805.11360. Net: Syntax-guided machine reading comprehension. In AAAI.
Li, Z.; He, S.; Zhao, H.; Zhang, Y.; Zhang, Z.; Zhou, X.; and Zhou, arXiv preprint arXiv:1908.05147.
X. 2019. Dependency or span, end-to-end uniform semantic role Zhao, H.; Chen, W.; and Kit, C. 2009. Semantic dependency pars-
labeling. In AAAI. arXiv preprint arXiv:1901.05280. ing of nombank and propbank: An efficient integrated approach via
Liu, X.; He, P.; Chen, W.; and Gao, J. 2019. Multi-task deep a large-scale feature selection. In EMNLP.
neural networks for natural language understanding. arXiv preprint Zhao, H.; Zhang, X.; and Kit, C. 2013. Integrative semantic depen-
arXiv:1901.11504. dency parsing via efficient large-scale feature selection. Journal of
Mikolov, T.; Sutskever, I.; Chen, K.; Corrado, G. S.; and Dean, J. Artificial Intelligence Research 46:203–233.
2013. Distributed representations of words and phrases and their Zhou, J.; Zhang, Z.; and Zhao, H. 2019. LIMIT-BERT: Linguistic
compositionality. In NIPS. informed multi-task bert. arXiv preprint arXiv:1910.14296.
Mudrakarta, P. K.; Taly, A.; Sundararajan, M.; and Dhamdhere, K.
2018. Did the model understand the question? In ACL.
Nangia, N.; Williams, A.; Lazaridou, A.; and Bowman, S. R. 2017.
The repeval 2017 shared task: Multi-genre natural language infer-
ence with sentence representations. In RepEval.
Palmer, M.; Gildea, D.; and Kingsbury, P. 2005. The proposition
bank: An annotated corpus of semantic roles. Computational lin-
guistics 31(1):71–106.
Peters, M. E.; Neumann, M.; Iyyer, M.; Gardner, M.; Clark, C.;
Lee, K.; and Zettlemoyer, L. 2018. Deep contextualized word
representations. In NAACL-HLT.
Pradhan, S.; Moschitti, A.; Xue, N.; Ng, H. T.; Björkelund, A.;
Uryupina, O.; Zhang, Y.; and Zhong, Z. 2013. Towards robust
linguistic analysis using OntoNotes. In CoNLL.
Radford, A.; Narasimhan, K.; Salimans, T.; and Sutskever, I.
2018. Improving language understanding by generative pre-
training. Technical report.
Rajpurkar, P.; Zhang, J.; Lopyrev, K.; and Liang, P. 2016.
SQuAD: 100,000+ questions for machine comprehension of text.
In EMNLP.
9635