0% found this document useful (0 votes)
12 views

2021 Automated Concatenation of Embeddings For Structured Prediction

Automated Concatenation of Embeddings for Structured Prediction

Uploaded by

Lee Hung
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views

2021 Automated Concatenation of Embeddings For Structured Prediction

Automated Concatenation of Embeddings for Structured Prediction

Uploaded by

Lee Hung
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 18

Automated Concatenation of Embeddings for Structured Prediction

Xinyu Wang‡ , Yong Jiang†∗ , Nguyen Bach† , Tao Wang† ,


Zhongqiang Huang† , Fei Huang† , Kewei Tu∗

School of Information Science and Technology, ShanghaiTech University
Shanghai Engineering Research Center of Intelligent Vision and Imaging
Shanghai Institute of Microsystem and Information Technology, Chinese Academy of Sciences
University of Chinese Academy of Sciences

DAMO Academy, Alibaba Group
{wangxy1,tukw}@shanghaitech.edu.cn, [email protected]
{nguyen.bach,leeo.wangt,z.huang,f.huang}@alibaba-inc.com

Abstract language processing. Approaches based on contex-


tualized embeddings, such as ELMo (Peters et al.,
Pretrained contextualized embeddings are 2018), Flair (Akbik et al., 2018), BERT (Devlin
powerful word representations for structured et al., 2019), and XLM-R (Conneau et al., 2020),
arXiv:2010.05006v4 [cs.CL] 1 Jun 2021

prediction tasks. Recent work found that bet-


ter word representations can be obtained by
have been consistently raising the state-of-the-art
concatenating different types of embeddings. for various structured prediction tasks. Concur-
However, the selection of embeddings to form rently, research has also showed that word represen-
the best concatenated representation usually tations based on the concatenation of multiple pre-
varies depending on the task and the collec- trained contextualized embeddings and traditional
tion of candidate embeddings, and the ever- non-contextualized embeddings (such as word2vec
increasing number of embedding types makes (Mikolov et al., 2013) and character embeddings
it a more difficult problem. In this paper, we
(Santos and Zadrozny, 2014)) can further improve
propose Automated Concatenation of Embed-
dings (ACE) to automate the process of find- performance (Peters et al., 2018; Akbik et al., 2018;
ing better concatenations of embeddings for Straková et al., 2019; Wang et al., 2020b). Given
structured prediction tasks, based on a formu- the ever-increasing number of embedding learn-
lation inspired by recent progress on neural ing methods that operate on different granularities
architecture search. Specifically, a controller (e.g., word, subword, or character level) and with
alternately samples a concatenation of embed- different model architectures, choosing the best em-
dings, according to its current belief of the ef- beddings to concatenate for a specific task becomes
fectiveness of individual embedding types in
consideration for a task, and updates the be-
non-trivial, and exploring all possible concatena-
lief based on a reward. We follow strategies tions can be prohibitively demanding in computing
in reinforcement learning to optimize the pa- resources.
rameters of the controller and compute the re- Neural architecture search (NAS) is an active
ward based on the accuracy of a task model, area of research in deep learning to automati-
which is fed with the sampled concatenation cally search for better model architectures, and has
as input and trained on a task dataset. Empir-
achieved state-of-the-art performance on various
ical results on 6 tasks and 21 datasets show
that our approach outperforms strong base-
tasks in computer vision, such as image classifi-
lines and achieves state-of-the-art performance cation (Real et al., 2019), semantic segmentation
with fine-tuned embeddings in all the evalua- (Liu et al., 2019a), and object detection (Ghiasi
tions.1 et al., 2019). In natural language processing, NAS
has been successfully applied to find better RNN
1 Introduction structures (Zoph and Le, 2017; Pham et al., 2018b)
and recently better transformer structures (So et al.,
Recent developments on pretrained contextualized
2019; Zhu et al., 2020). In this paper, we propose
embeddings have significantly improved the per-
Automated Concatenation of Embeddings (ACE)
formance of structured prediction tasks in natural
to automate the process of finding better concatena-

Yong Jiang and Kewei Tu are the corresponding authors. tions of embeddings for structured prediction tasks.

: This work was conducted when Xinyu Wang was interning ACE is formulated as an NAS problem. In this
at Alibaba DAMO Academy.
1
Our code is publicly available at https://github. approach, an iterative search process is guided by
com/Alibaba-NLP/ACE. a controller based on its belief that models the ef-
fectiveness of individual embedding candidates in 2004), syntactic dependency parsing (Tesnière,
consideration for a specific task. At each step, the 1959) and semantic dependency parsing (Oepen
controller samples a concatenation of embeddings et al., 2014) over 21 datasets. Besides, we also
according to the belief model and then feeds the analyze the advantage of ACE and reward function
concatenated word representations as inputs to a design over the baselines and show the advantage
task model, which in turn is trained on the task of ACE over ensemble models.
dataset and returns the model accuracy as a reward
signal to update the belief model. We use the policy 2 Related Work
gradient algorithm (Williams, 1992) in reinforce-
2.1 Embeddings
ment learning (Sutton and Barto, 1992) to solve
the optimization problem. In order to improve the Non-contextualized embeddings, such as word2vec
efficiency of the search process, we also design (Mikolov et al., 2013), GloVe (Pennington et al.,
a special reward function by accumulating all the 2014), and fastText (Bojanowski et al., 2017), help
rewards based on the transformation between the lots of NLP tasks. Character embeddings (San-
current concatenation and all previously sampled tos and Zadrozny, 2014) are trained together with
concatenations. the task and applied in many structured prediction
Our approach is different from previous work on tasks (Ma and Hovy, 2016; Lample et al., 2016;
NAS in the following aspects: Dozat and Manning, 2018). For pretrained contex-
tualized embeddings, ELMo (Peters et al., 2018),
1. Unlike most previous work, we focus on search- a pretrained contextualized word embedding gen-
ing for better word representations rather than erated with multiple Bidirectional LSTM layers,
better model architectures. significantly outperforms previous state-of-the-art
2. We design a novel search space for the embed- approaches on several NLP tasks. Following this
ding concatenation search. Instead of using idea, Akbik et al. (2018) proposed Flair embed-
RNN as in previous work of Zoph and Le (2017), dings, which is a kind of contextualized character
we design a more straightforward controller to embeddings and achieved strong performance in
generate the embedding concatenation. We de- sequence labeling tasks. Recently, Devlin et al.
sign a novel reward function in the objective of (2019) proposed BERT, which encodes contex-
optimization to better evaluate the effectiveness tualized sub-word information by Transformers
of each concatenated embeddings. (Vaswani et al., 2017) and significantly improves
the performance on a lot of NLP tasks. Much re-
3. ACE achieves high accuracy without the need search such as RoBERTa (Liu et al., 2019c) has
for retraining the task model, which is typically focused on improving BERT model’s performance
required in other NAS approaches. through stronger masking strategies. Moreover,
multilingual contextualized embeddings become
4. Our approach is efficient and practical. Al-
popular. Pires et al. (2019) and Wu and Dredze
though ACE is formulated in a NAS framework,
(2019) showed that Multilingual BERT (M-BERT)
ACE can find a strong word representation on
could learn a good multilingual representation ef-
a single GPU with only a few GPU-hours for
fectively with strong cross-lingual zero-shot trans-
structured prediction tasks. In comparison, a lot
fer performance in various tasks. Conneau et al.
of NAS approaches require dozens or even thou-
(2020) proposed XLM-R, which is trained on a
sands of GPU-hours to search for good neural
larger multilingual corpus and significantly outper-
architectures for their corresponding tasks.
forms M-BERT on various multilingual tasks.
Empirical results show that ACE outperforms
strong baselines. Furthermore, when ACE is 2.2 Neural Architecture Search
applied to concatenate pretrained contextualized Recent progress on deep learning has shown that
embeddings fine-tuned on specific tasks, we can network architecture design is crucial to the model
achieve state-of-the-art accuracy on 6 structured performance. However, designing a strong neu-
prediction tasks including Named Entity Recog- ral architecture for each task requires enormous
nition (Sundheim, 1995), Part-Of-Speech tagging efforts, high level of knowledge, and experiences
(DeRose, 1988), chunking (Tjong Kim Sang and over the task domain. Therefore, automatic design
Buchholz, 2000), aspect extraction (Hu and Liu, of neural architecture is desired. A crucial part of
NAS is search space design, which defines the dis- Given an embedding concatenation generated from
coverable NAS space. Previous work (Baker et al., the controller, the task model is trained over the
2017; Zoph and Le, 2017; Xie and Yuille, 2017) task data and returns a reward to the controller. The
designs a global search space (Elsken et al., 2019) controller receives the reward to update its param-
which incorporates structures from hand-crafted eter and samples a new embedding concatenation
architectures. For example, Zoph and Le (2017) de- for the task model. Figure 1 shows the general
signed a chained-structured search space with skip architecture of our approach.
connections. The global search space usually has a
considerable degree of freedom. For example, the 3.1 Task Model
approach of Zoph and Le (2017) takes 22,400 GPU- For the task model, we emphasis on sequence-
hours to search on CIFAR-10 dataset. Based on the structured and graph-structured outputs. Given a
observation that existing hand-crafted architectures structured prediction task with input sentence x
contain repeated structures (Szegedy et al., 2016; and structured output y, we can calculate the prob-
He et al., 2016; Huang et al., 2017), Zoph et al. ability distribution P (y|x) by:
(2018) explored cell-based search space which can
exp (Score(x, y))
reduce the search time to 2,000 GPU-hours. P (y|x) = P 0
In recent NAS research, reinforcement learning y 0 ∈Y(x) exp (Score(x, y ))

and evolutionary algorithms are the most usual ap- where Y(x) represents all possible output struc-
proaches. In reinforcement learning, the agent’s tures given the input sentence x. Depending on
actions are the generation of neural architectures different structured prediction tasks, the output
and the action space is identical to the search space. structure y can be label sequences, trees, graphs
Previous work usually applies an RNN layer (Zoph or other structures. In this paper, we use sequence-
and Le, 2017; Zhong et al., 2018; Zoph et al., 2018) structured and graph-structured outputs as two
or use Markov Decision Process (Baker et al., 2017) exemplar structured prediction tasks. We use
to decide the hyper-parameter of each structure and BiLSTM-CRF model (Ma and Hovy, 2016; Lample
decide the input order of each structure. Evolution- et al., 2016) for sequence-structured outputs and
ary algorithms have been applied to architecture use BiLSTM-Biaffine model (Dozat and Manning,
search for many decades (Miller et al., 1989; Ange- 2017) for graph-structured outputs:
line et al., 1994; Stanley and Miikkulainen, 2002;
Floreano et al., 2008; Jozefowicz et al., 2015). The P seq (y|x) = BiLSTM-CRF(V , y)
algorithm repeatedly generates new populations
P graph (y|x) = BiLSTM-Biaffine(V , y)
through recombination and mutation operations
and selects survivors through competing among where V = [v1 ; · · · ; vn ], V ∈ Rd×n is a matrix
the population. Recent work with evolutionary al- of the word representations for the input sentence
gorithms differ in the method on parent/survivor x with n words, d is the hidden size of the concate-
selection and population generation. For exam- nation of all embeddings. The word representation
ple, Real et al. (2017), Liu et al. (2018a), Wistuba vi of i-th word is a concatenation of L types of
(2018) and Real et al. (2019) applied tournament word embeddings:
selection (Goldberg and Deb, 1991) for the par-
ent selection while Xie and Yuille (2017) keeps vil = embedli (x); vi = [vi1 ; vi2 ; . . . ; viL ]
all parents. Suganuma et al. (2017) and Elsken
et al. (2018) chose the best model while Real et al. where embedl is the model of l-th embeddings,
l
(2019) chose several latest models as survivors. vi ∈ Rd , vil ∈ Rd . dl is the hidden size of embedl .

3.2 Search Space Design


3 Automated Concatenation of
Embeddings The neural architecture search space can be repre-
sented as a set of neural networks (Elsken et al.,
In ACE, a task model and a controller interact with 2019). A neural network can be represented as a
each other repeatedly. The task model predicts the directed acyclic graph with a set of nodes and di-
task output, while the controller searches for better rected edges. Each node represents an operation,
embedding concatenation as the word representa- while each edge represents the inputs and outputs
tion for the task model to achieve higher accuracy. between these nodes. In ACE, we represent each
embedding candidate as a node. The input to the benefit of our search space design is that we can re-
nodes is the input sentence x, and the outputs are move the unused embedding candidates and the cor-
the embeddings v l . Since we concatenate the em- responding weights in W for a lighter task model
beddings as the word representation of the task after the best concatenation is found by ACE.
model, there is no connection between nodes in
our search space. Therefore, the search space can 3.3 Searching in the Space
be significantly reduced. For each node, there are During search, the controller generates the embed-
a lot of options to extract word features. Taking ding mask for the task model iteratively. We use
BERT embeddings as an example, Devlin et al. parameters θ = [θ1 ; θ2 ; . . . ; θL ] for the controller
(2019) concatenated the last four layers as word instead of using the RNN structure applied in pre-
features while Kondratyuk and Straka (2019) ap- vious approaches (Zoph and Le, 2017; Zoph et al.,
plied a weighted sum of all twelve layers. However, 2018). The probability distribution Q of selecting an
the empirical results (Devlin et al., 2019) do not concatenation a is P ctrl (a; θ) = L P
l=1 l
ctrl (a ; θ ).
l l
show a significant difference in accuracy. We fol- Each element al of a is sampled independently
low the typical usage for each embedding to further from a Bernoulli distribution, which is defined as:
reduce the search space. As a result, each embed- (
ctrl σ(θl ) al =1
ding only has a fixed operation and the resulting Pl (al ; θl )= ctrl
(3)
search space contains 2L −1 possible combinations 1−Pl (al =1; θl ) al =0
of nodes. where σ is the sigmoid function. Given the mask,
In NAS, weight sharing (Pham et al., 2018a) the task model is trained until convergence and re-
shares the weight of structures in training differ- turns an accuracy R on the development set. As
ent neural architectures to reduce the training cost. the accuracy cannot be back-propagated to the
In comparison, we fixed the weight of pretrained controller, we use the reinforcement algorithm
embedding candidates in ACE except for the char- for optimization. The accuracy R is used as the
acter embeddings. Instead of sharing the parame- reward signal to train the controller. The con-
ters of the embeddings, we share the parameters troller’s target is to maximize the expected reward
of the task models at each step of search. How- J(θ) = EP ctrl (a;θ) [R] through the policy gradient
ever, the hidden size of word representation varies method (Williams, 1992). In our approach, since
over the concatenations, making the weight shar- calculating the exact expectation is intractable, the
ing of structured prediction models difficult. In- gradient of J(θ) is approximated by sampling only
stead of deciding whether each node exists in the one selection following the distribution P ctrl (a; θ)
graph, we keep all nodes in the search space and at each step for training efficiency:
add an additional operation for each node to in-
L
dicate whether the embedding is masked out. To X
∇θ J(θ) ≈ ∇θ log Plctrl (al ; θl )(R − b) (4)
represent the selected concatenation, we use a bi-
l=1
nary vector a = [a1 , · · · , al , · · · , aL ] as an mask
to mask out the embeddings which are not selected: where b is the baseline function to reduce the high
variance of the update function. The baseline usu-
vi = [vi1 a1 ; . . . ; vil al ; . . . ; viL aL ] (1) ally can be the highest accuracy during the search
process. Instead of merely using the highest accu-
where al is a binary variable. Since the input V is racy of development set over the search process as
applied to a linear layer in the BiLSTM layer, multi- the baseline, we design a reward function on how
plying the mask with the embeddings is equivalent each embedding candidate contributes to accuracy
to directly concatenating the selected embeddings: change by utilizing all searched concatenations’ de-
velopment scores. We use a binary vector |at − ai |
L
>
X to represent the change between current embedding
W vi = Wl> vil al (2)
l=1
concatenation at at current time step t and ai at
previous time step i. We then define the reward
where W =[W1 ; W2 ; . . . ; WL ] and W ∈Rd×h function as:
l
and Wl ∈Rd ×h . Therefore, the model weights can t−1
X
t
be shared after applying the embedding mask to r = (Rt − Ri )|at − ai | (5)
all embedding candidates’ concatenation. Another i=1
Previous Current
Controller

[Reward]
Choice Choice Choice

[Action]
f
0 0 Flair 0
0 1 ELMo 1
1 1 R Task Model BERT 1

Figure 1: The main paradigm of our approach is shown in the middle, where an example of reward function is
represented in the left and an example of a concatenation action is shown in the right.

where r t is a vector with length L representing we compare the Rt with the value in the dictionary
the reward of each embedding candidate. Rt and keep the higher one.
and Ri are the reward at time step t and i.
When the Hamming distance of two concatena- 4 Experiments
tions Hamm(at , ai ) gets larger, the changed can- We use ISO 639-1 language codes to represent
didates’ contribution to the accuracy becomes less languages in the table2 .
noticeable. The controller may be misled to re-
ward a candidate that is not actually helpful. We 4.1 Datasets and Configurations
apply a discount factor to reduce the reward for two To show ACE’s effectiveness, we conduct extensive
concatenations with a large Hamming distance to experiments on a variety of structured prediction
alleviate this issue. Our final reward function is: tasks varying from syntactic tasks to semantic tasks.
t−1 The tasks are named entity recognition (NER), Part-
t ,ai )−1
X
rt = (Rt −Ri )γ Hamm(a |at −ai | (6) Of-Speech (POS) tagging, Chunking, Aspect Ex-
i=1 traction (AE), Syntactic Dependency Parsing (DP)
where γ ∈ (0, 1). Eq. 4 is then reformulated as: and Semantic Dependency Parsing (SDP). The de-
tails of the 6 structured prediction tasks in our ex-
L
X periments are shown in below:
∇θ Jt (θ) ≈ ∇θ log Plctrl (atl ; θl )rlt (7)
l=1 • NER: We use the corpora of 4 languages from
3.4 Training the CoNLL 2002 and 2003 shared task (Tjong
Kim Sang, 2002; Tjong Kim Sang and De Meul-
To train the controller, we use a dictionary D to
der, 2003) with standard split.
store the concatenations and the corresponding val-
idation scores. At t = 1, we train the task model • POS Tagging: We use three datasets, Ritter11-T-
with all embedding candidates concatenated. From POS (Ritter et al., 2011), ARK-Twitter (Gimpel
t = 2, we repeat the following steps until a maxi- et al., 2011; Owoputi et al., 2013) and Tweebank-
mum iteration T : v2 (Liu et al., 2018b) datasets (Ritter, ARK and
1. Sample a concatenation at based on the proba- TB-v2 in simplification). We follow the dataset
bility distribution in Eq. 3. split of Nguyen et al. (2020).

2. Train the task model with at following Eq. 1 • Chunking: We use CoNLL 2000 (Tjong
and evaluate the model on the development set Kim Sang and Buchholz, 2000) for chunking.
to get the accuracy Rt . Since there is no standard development set for
CoNLL 2000 dataset, we split 10% of the train-
3. Given the concatenation at , accuracy Rt and D, ing data as the development set.
compute the gradient of the controller following
Eq. 7 and update the parameters of controller. • Aspect Extraction: Aspect extraction is a sub-
task of aspect-based sentiment analysis (Pontiki
4. Add at and Rt into D, set t = t + 1. et al., 2014, 2015, 2016). The datasets are from
When sampling at , we avoid selecting the previous the laptop and restaurant domain of SemEval
concatenation at−1 and the all-zero vector (i.e., se- 2
https://en.wikipedia.org/wiki/List_
lecting no embedding). If at is in the dictionary D, of_ISO_639-1_codes
NER POS AE
de en es nl Ritter ARK TB-v2 14Lap 14Res 15Res 16Res es nl ru tr
A LL 83.1 92.4 88.9 89.8 90.6 92.1 94.6 82.7 88.5 74.2 73.2 74.6 75.0 67.1 67.5
R ANDOM 84.0 92.6 88.8 91.9 91.3 92.6 94.6 83.6 88.1 73.5 74.7 75.0 73.6 68.0 70.0
ACE 84.2 93.0 88.9 92.1 91.7 92.8 94.8 83.9 88.6 74.9 75.6 75.7 75.3 70.6 71.1
C HUNK DP SDP
AVG
CoNLL 2000 UAS LAS DM-ID DM-OOD PAS-ID PAS-OOD PSD-ID PSD-OOD
A LL 96.7 96.7 95.1 94.3 90.8 94.6 92.9 82.4 81.7 85.3
R ANDOM 96.7 96.8 95.2 94.4 90.8 94.6 93.0 82.3 81.8 85.7
ACE 96.8 96.9 95.3 94.5 90.9 94.5 93.1 82.5 82.1 86.2

Table 1: Comparison with concatenating all embeddings and random search baselines on 6 tasks.

14, restaurant domain of SemEval 15 and restau- XLM-R embeddings. The size of the search space
rant domain of SemEval 16 shared task (14Lap, in our experiments is 211 −1=20473 . For language-
14Res, 15Res and 16Res in short). Addition- specific models of other languages, please refer to
ally, we use another 4 languages in the restaurant Appendix A for more details. In AE, there is no
domain of SemEval 16 to test our approach in available Russian-specific BERT, Flair and ELMo
multiple languages. We randomly split 10% of embeddings and there is no available Turkish-
the training data as the development set following specific Flair and ELMo embeddings. We use the
Li et al. (2019). corresponding English embeddings instead so that
the search spaces of these datasets are almost iden-
• Syntactic Dependency Parsing: We use Penn
tical to those of the other datasets. All embeddings
Tree Bank (PTB) 3.0 with the same dataset pre-
are fixed during training except that the character
processing as (Ma et al., 2018).
embeddings are trained over the task. The empiri-
• Semantic Dependency Parsing: We use DM, cal results are reported in Section 4.3.1.
PAS and PSD datasets for semantic dependency
parsing (Oepen et al., 2014) for the SemEval Embedding Fine-tuning: A usual approach to
2015 shared task (Oepen et al., 2015). The three get better accuracy is fine-tuning transformer-based
datasets have the same sentences but with dif- embeddings. In sequence labeling, most of the
ferent formalisms. We use the standard split for work follows the fine-tuning pipeline of BERT that
SDP. In the split, there are in-domain test sets connects the BERT model with a linear layer for
and out-of-domain test sets for each dataset. word-level classification. However, when multiple
Among these tasks, NER, POS tagging, chunk- embeddings are concatenated, fine-tuning a specific
ing and aspect extraction are sequence-structured group of embeddings becomes difficult because of
outputs while dependency parsing and semantic complicated hyper-parameter settings and massive
dependency parsing are the graph-structured out- GPU memory consumption. To alleviate this prob-
puts. POS Tagging, chunking and DP are syntactic lem, we first fine-tune the transformer-based em-
structured prediction tasks while NER, AE, SDP beddings over the task and then concatenate these
are semantic structured prediction tasks. embeddings together with other embeddings in the
We train the controller for 30 steps and save the basic setting to apply ACE. The empirical results
task model with the highest accuracy on the devel- are reported in Section 4.3.2.
opment set as the final model for testing. Please
refer to Appendix A for more details of other set- 4.3 Results
tings.
We use the following abbreviations in our experi-
4.2 Embeddings ments: UAS: Unlabeled Attachment Score; LAS:
Basic Settings: For the candidates of embed- Labeled Attachment Score; ID: In-domain test set;
dings on English datasets, we use the language- OOD: Out-of-domain test set. We use language
specific model for ELMo, Flair, base BERT, GloVe codes for languages in NER and AE.
word embeddings, fastText word embeddings, non-
contextual character embeddings (Lample et al., 3
Flair embeddings have two models (forward and back-
2016), multilingual Flair (M-Flair), M-BERT and ward) for each language.
4.3.1 Comparison With Baselines approaches used POS tags and lemmas as addi-
To show the effectiveness of our approach, we com- tional word features to the network. We add these
pare our approach with two strong baselines. For two features to the embedding candidates and train
the first one, we let the task model learn by itself the embeddings together with the task. We use
the contribution of each embedding candidate that the fine-tuned transformer-based embeddings on
is helpful to the task. We set a to all-ones (i.e., each task instead of the pretrained version of these
the concatenation of all the embeddings) and train embeddings as the candidates.4
the task model (All). The linear layer weight We additionally compare with fine-tuned XLM-
W in Eq. 2 reflects the contribution of each can- R model for NER, POS tagging, chunking and AE,
didate. For the second one, we use the random and compare with fine-tuned XLNet model for DP
search (Random), a strong baseline in NAS (Li and SDP, which are strong fine-tuned models in
and Talwalkar, 2020). For Random, we run the most of the experiments. Results are shown in Ta-
same maximum iteration as in ACE. For the exper- ble 2, 3, 4. Results show that ACE with fine-tuned
iments, we report the averaged accuracy of 3 runs. embeddings achieves state-of-the-art performance
Table 1 shows that ACE outperforms both baselines in all test sets, which shows that finding a good em-
in 6 tasks over 23 test sets with only two exceptions. bedding concatenation helps structured prediction
Comparing Random with All, Random outper- tasks. We also find that ACE is stronger than the
forms All by 0.4 on average and surpasses the fine-tuned models, which shows the effectiveness
accuracy of All on 14 out of 23 test sets, which of concatenating the fine-tuned embeddings5 .
shows that concatenating all embeddings may not
be the best solution to most structured prediction 5 Analysis
tasks. In general, searching for the concatenation
for the word representation is essential in most 5.1 Efficiency of Search Methods
cases, and our search design can usually lead to To show how efficient our approach is compared
better results compared to both of the baselines. with the random search algorithm, we compare the
4.3.2 Comparison With State-of-the-Art algorithm in two aspects on CoNLL English NER
approaches dataset. The first aspect is the best development
accuracy during training. The left part of Figure 2
As we have shown, ACE has an advantage in
shows that ACE is consistently stronger than the
searching for better embedding concatenations.
random search algorithm in this task. The second
We further show that ACE is competitive or even
aspect is the searched concatenation at each time
stronger than state-of-the-art approaches. We
step. The right part of Figure 2 shows that the ac-
additionally use XLNet (Yang et al., 2019) and
curacy of ACE gradually increases and gets stable
RoBERTa as the candidates of ACE. In some tasks,
when more concatenations are sampled.
we have several additional settings to better com-
pare with previous work. In NER, we also conduct
5.2 Ablation Study on Reward Function
a comparison on the revised version of German
Design
datasets in the CoNLL 2006 shared task (Buch-
holz and Marsi, 2006). Recent work such as Yu To show the effectiveness of the designed reward
et al. (2020) and Yamada et al. (2020) utilizes doc- function, we compare our reward function (Eq. 6)
ument contexts in the datasets. We follow their with the reward function without discount factor
work and extract document embeddings for the (Eq. 5) and the traditional reward function (reward
transformer-based embeddings. Specifically, we term in Eq. 4). We sample 2000 training sentences
follow the fine-tune process of Yamada et al. (2020) on CoNLL English NER dataset for faster train-
to fine-tune the transformer-based embeddings over ing and train the controller for 50 steps. Table 5
the document except for BERT and M-BERT em- shows that both the discount factor and the binary
beddings. For BERT and M-BERT, we follow the vector |at − ai | for the task are helpful in both
document extraction process of Yu et al. (2020) development and test datasets.
because we find that the model with such docu-
4
ment embeddings is significantly stronger than the Please refer to Appendix for more details about the em-
beddings.
model trained with the fine-tuning process of Ya- 5
We compare ACE with other fine-tuned embeddings in
mada et al. (2020). In SDP, the state-of-the-art Appendix.
NER POS
de de06 en es nl Ritter ARK TB-v2
Baevski et al. (2019) - - 93.5 - - Owoputi et al. (2013) 90.4 93.2 94.6
Straková et al. (2019) 85.1 - 93.4 88.8 92.7 Gui et al. (2017) 90.9 - 92.8
Yu et al. (2020) 86.4 90.3 93.5 90.3 93.7 Gui et al. (2018) 91.2 92.4 -
Yamada et al. (2020) - - 94.3 - - Nguyen et al. (2020) 90.1 94.1 95.2
XLM-R+Fine-tune 87.7 91.4 94.1 89.3 95.3 XLM-R+Fine-tune 92.3 93.7 95.4
ACE+Fine-tune 88.3 91.7 94.6 95.9 95.7 ACE+Fine-tune 93.4 94.4 95.8

Table 2: Comparison with state-of-the-art approaches in NER and POS tagging. † : Models are trained on both
train and development set.

C HUNK AE
CoNLL 2000 14Lap 14Res 15Res 16Res es nl ru tr
Akbik et al. (2018) 96.7 Xu et al. (2018)† 84.2 84.6 72.0 75.4 - - - -
Clark et al. (2018) 97.0 Xu et al. (2019) 84.3 - - 78.0 - - - -
Liu et al. (2019b) 97.3 Wang et al. (2020a) - - - 72.8 74.3 72.9 71.8 59.3
Chen et al. (2020) 95.5 Wei et al. (2020) 82.7 87.1 72.7 77.7 - - - -
XLM-R+Fine-tune 97.0 XLM-R+Fine-tune 85.9 90.5 76.4 78.9 77.0 77.6 77.7 74.1
ACE+Fine-tune 97.3 ACE+Fine-tune 87.4 92.0 80.3 81.3 79.9 80.5 79.4 81.9

Table 3: Comparison with state-of-the-art approaches in chunking and aspect extraction. † : We report the results
reproduced by Wei et al. (2020).

96.6
Sample Accuracy

96
Best Accuracy

96.4 95

96.2 94

93
96 ACE Random ACE Random

0 5 10 15 20 25 30 0 5 10 15 20 25 30

Figure 2: Comparing the efficiency of random search (Random) and ACE. The x-axis is the number of time steps.
The left y-axis is the averaged best validation accuracy on CoNLL English NER dataset. The right y-axis is the
averaged validation accuracy of the current selection.

5.3 Comparison with Embedding Weighting dence of model predictions to break ties if there
& Ensemble Approaches are more than one agreement with the same counts.
We call this approach Ensemble. One of the ben-
We compare ACE with two more approaches to efits of voting is that it combines the predictions
further show the effectiveness of ACE. One is a of the task models efficiently without any training
variant of All, which uses a weighting param- process. We can search all possible 2L −1 model
eter b = [b1 , · · · , bl , · · · , bL ] passing through a ensembles in a short period of time through caching
sigmoid function to weight each embedding can- the outputs of the models. Therefore, we search
didate. Such an approach can explicitly learn the for the best ensemble of models on the develop-
weight of each embedding in training instead of a ment set and then evaluate the best ensemble on
binary mask. We call this approach All+Weight. the test set (Ensembledev ). Moreover, we addi-
Another one is model ensemble, which trains the tionally search for the best ensemble on the test set
task model with each embedding candidate indi- for reference (Ensembletest ), which is the upper
vidually and uses the trained models to make joint bound of the approach. We use the same setting as
prediction on the test set. We use voting for ensem- in Section 4.3.1 and select one of the datasets from
ble as it is simple and fast. For sequence labeling each task. For NER, POS tagging, AE, and SDP,
tasks, the models vote for the predicted label at we use CoNLL 2003 English, Ritter, 16Res, and
each position. For DP, the models vote for the DM datasets, respectively. The results are shown
tree of each sentence. For SDP, the models vote in Table 6. Empirical results show that ACE out-
for each potential labeled arc. We use the confi-
DP SDP
PTB DM PAS PSD
UAS LAS ID OOD ID OOD ID OOD
Zhou and Zhao (2019)† 97.2 95.7 He and Choi (2020)‡ 94.6 90.8 96.1 94.4 86.8 79.5
Mrini et al. (2020)† 97.4 96.3 D & M (2018) 93.7 88.9 93.9 90.6 81.0 79.4
Li et al. (2020) 96.6 94.8 Wang et al. (2019) 94.0 89.7 94.1 91.3 81.4 79.6
Zhang et al. (2020) 96.1 94.5 Jia et al. (2020) 93.6 89.1 - - - -
Wang and Tu (2020) 96.9 95.3 F & G (2020) 94.4 91.0 95.1 93.4 82.6 82.0
XLNET+Fine-tune 97.0 95.6 XLNet+Fine-tune 94.2 90.6 94.8 93.4 82.7 81.8
ACE+Fine-tune 97.2 95.8 ACE+Fine-tune 95.6 92.6 95.8 94.6 83.8 83.4

Table 4: Comparison with state-of-the-art approaches in DP and SDP. † : For reference, they additionally used
constituency dependencies in training. We also find that the PTB dataset used by Mrini et al. (2020) is not identical
to the dataset in previous work such as Zhang et al. (2020) and Wang and Tu (2020). ‡ : For reference, we confirmed
with the authors of He and Choi (2020) that they used a different data pre-processing script with previous work.
.

D EV T EST embeddings that are not very useful in the con-


ACE 93.18 90.00
No discount (Eq. 5) 92.98 89.90 catenation. Moreover, ACE models can be used
Simple (Eq. 4) 92.89 89.82 to guide the training of weaker models through
techniques such as knowledge distillation in struc-
Table 5: Comparison of reward functions. tured prediction (Kim and Rush, 2016; Kuncoro
et al., 2016; Wang et al., 2020a, 2021b), leading to
DP SDP
NER POS AE CHK
UAS LAS ID OOD
models that are both stronger and faster.
All 92.4 90.6 73.2 96.7 96.7 95.1 94.3 90.8
Random 92.6 91.3 74.7 96.7 96.8 95.2 94.4 90.8 7 Conclusion
ACE 93.0 91.7 75.6 96.8 96.9 95.3 94.5 90.9
All+Weight 92.7 90.4 73.7 96.7 96.7 95.1 94.3 90.7 In this paper, we propose Automated Concatena-
Ensemble 92.2 90.6 68.1 96.5 96.1 94.3 94.1 90.3 tion of Embeddings, which automatically searches
Ensembledev 92.2 90.8 70.2 96.7 96.8 95.2 94.3 90.7
Ensembletest 92.7 91.4 73.9 96.7 96.8 95.2 94.4 90.8
for better embedding concatenation for structured
prediction tasks. We design a simple search space
Table 6: A comparison among All, Random, ACE, and use the reinforcement learning with a novel
All+Weight and Ensemble. CHK: chunking. reward function to efficiently guide the controller
to search for better embedding concatenations. We
take the change of embedding concatenations into
performs all the settings of these approaches and
the reward function design and show that our new
even Ensembletest , which shows the effective-
reward function is stronger than the simpler ones.
ness of ACE and the limitation of ensemble mod-
Results show that ACE outperforms strong base-
els. All, All+Weight and Ensembledev are
lines. Together with fine-tuned embeddings, ACE
competitive in most of the cases and there is no
achieves state-of-the-art performance in 6 tasks
clear winner of these approaches on all the datasets.
over 21 datasets.
These results show the strength of embedding con-
catenation. Concatenating the embeddings incor- Acknowledgments
porates information from all the embeddings and This work was supported by the National Natu-
forms stronger word representations for the task ral Science Foundation of China (61976139) and
model, while in model ensemble, it is difficult for by Alibaba Group through Alibaba Innovative Re-
the individual task models to affect each other. search Program. We thank Chengyue Jiang for his
comments and suggestions on writing.
6 Discussion: Practical Usability of ACE
Concatenating multiple embeddings is a commonly
References
used approach to improve accuracy of structured
prediction. However, such approaches can be com- Alan Akbik, Tanja Bergmann, and Roland Vollgraf.
2019. Pooled contextualized embeddings for named
putationally costly as multiple language models entity recognition. In Proceedings of the 2019 Con-
are used as input. ACE is more practical than con- ference of the North American Chapter of the Asso-
catenating all embeddings as it can remove those ciation for Computational Linguistics: Human Lan-
guage Technologies, Volume 1 (Long and Short Pa- Steven J. DeRose. 1988. Grammatical category disam-
pers), pages 724–728, Minneapolis, Minnesota. As- biguation by statistical optimization. Computational
sociation for Computational Linguistics. Linguistics, 14(1):31–39.
Alan Akbik, Duncan Blythe, and Roland Vollgraf. Jacob Devlin, Ming-Wei Chang, Kenton Lee, and
2018. Contextual string embeddings for sequence Kristina Toutanova. 2019. BERT: Pre-training of
labeling. In Proceedings of the 27th International deep bidirectional transformers for language under-
Conference on Computational Linguistics, pages standing. In Proceedings of the 2019 Conference
1638–1649, Santa Fe, New Mexico, USA. Associ- of the North American Chapter of the Association
ation for Computational Linguistics. for Computational Linguistics: Human Language
Peter J Angeline, Gregory M Saunders, and Jordan B Technologies, Volume 1 (Long and Short Papers),
Pollack. 1994. An evolutionary algorithm that con- pages 4171–4186, Minneapolis, Minnesota. Associ-
structs recurrent neural networks. IEEE transac- ation for Computational Linguistics.
tions on Neural Networks, 5(1):54–65.
Timothy Dozat and Christopher D Manning. 2017.
Alexei Baevski, Sergey Edunov, Yinhan Liu, Luke Deep biaffine attention for neural dependency pars-
Zettlemoyer, and Michael Auli. 2019. Cloze-driven ing. In International Conference on Learning Rep-
pretraining of self-attention networks. In Proceed- resentations.
ings of the 2019 Conference on Empirical Methods
in Natural Language Processing and the 9th Inter- Timothy Dozat and Christopher D. Manning. 2018.
national Joint Conference on Natural Language Pro- Simpler but more accurate semantic dependency
cessing (EMNLP-IJCNLP), pages 5360–5369, Hong parsing. In Proceedings of the 56th Annual Meet-
Kong, China. Association for Computational Lin- ing of the Association for Computational Linguis-
guistics. tics (Volume 2: Short Papers), pages 484–490, Mel-
bourne, Australia. Association for Computational
Bowen Baker, Otkrist Gupta, Nikhil Naik, and Ramesh Linguistics.
Raskar. 2017. Designing neural network architec-
tures using reinforcement learning. In International Thomas Elsken, Jan-Hendrik Metzen, and Frank Hut-
Conference on Learning Representations. ter. 2018. Simple and efficient architecture search
Piotr Bojanowski, Edouard Grave, Armand Joulin, and for convolutional neural networks. In International
Tomas Mikolov. 2017. Enriching word vectors with Conference on Learning Representations workshop.
subword information. Transactions of the Associa-
tion for Computational Linguistics, 5:135–146. Thomas Elsken, Jan Hendrik Metzen, and Frank Hutter.
2019. Neural architecture search: A survey. Journal
Sabine Buchholz and Erwin Marsi. 2006. CoNLL- of Machine Learning Research, 20:1–21.
X shared task on multilingual dependency parsing.
In Proceedings of the Tenth Conference on Com- Daniel Fernández-González and Carlos Gómez-
putational Natural Language Learning (CoNLL-X), Rodríguez. 2020. Transition-based semantic
pages 149–164, New York City. Association for dependency parsing with pointer networks. In
Computational Linguistics. Proceedings of the 58th Annual Meeting of the
Association for Computational Linguistics, pages
Luoxin Chen, Weitong Ruan, Xinyue Liu, and Jianhua 7035–7046, Online. Association for Computational
Lu. 2020. SeqVAT: Virtual adversarial training for Linguistics.
semi-supervised sequence labeling. In Proceedings
of the 58th Annual Meeting of the Association for Dario Floreano, Peter Dürr, and Claudio Mattiussi.
Computational Linguistics, pages 8801–8811, On- 2008. Neuroevolution: from architectures to learn-
line. Association for Computational Linguistics. ing. Evolutionary intelligence, 1(1):47–62.
Kevin Clark, Minh-Thang Luong, Christopher D. Man-
ning, and Quoc Le. 2018. Semi-supervised se- Golnaz Ghiasi, Tsung-Yi Lin, and Quoc V Le. 2019.
quence modeling with cross-view training. In Pro- Nas-fpn: Learning scalable feature pyramid architec-
ceedings of the 2018 Conference on Empirical Meth- ture for object detection. In Proceedings of the IEEE
ods in Natural Language Processing, pages 1914– conference on computer vision and pattern recogni-
1925, Brussels, Belgium. Association for Computa- tion, pages 7036–7045.
tional Linguistics.
Kevin Gimpel, Nathan Schneider, Brendan O’Connor,
Alexis Conneau, Kartikay Khandelwal, Naman Goyal, Dipanjan Das, Daniel Mills, Jacob Eisenstein,
Vishrav Chaudhary, Guillaume Wenzek, Francisco Michael Heilman, Dani Yogatama, Jeffrey Flanigan,
Guzmán, Edouard Grave, Myle Ott, Luke Zettle- and Noah A. Smith. 2011. Part-of-speech tagging
moyer, and Veselin Stoyanov. 2020. Unsupervised for Twitter: Annotation, features, and experiments.
cross-lingual representation learning at scale. In In Proceedings of the 49th Annual Meeting of the
Proceedings of the 58th Annual Meeting of the Asso- Association for Computational Linguistics: Human
ciation for Computational Linguistics, pages 8440– Language Technologies, pages 42–47, Portland, Ore-
8451, Online. Association for Computational Lin- gon, USA. Association for Computational Linguis-
guistics. tics.
David E Goldberg and Kalyanmoy Deb. 1991. A com- Dan Kondratyuk and Milan Straka. 2019. 75 lan-
parative analysis of selection schemes used in ge- guages, 1 model: Parsing Universal Dependencies
netic algorithms. In Foundations of genetic algo- universally. In Proceedings of the 2019 Confer-
rithms, volume 1, pages 69–93. Elsevier. ence on Empirical Methods in Natural Language
Processing and the 9th International Joint Confer-
Tao Gui, Qi Zhang, Jingjing Gong, Minlong Peng, ence on Natural Language Processing (EMNLP-
Di Liang, Keyu Ding, and Xuanjing Huang. 2018. IJCNLP), pages 2779–2795, Hong Kong, China. As-
Transferring from formal newswire domain with hy- sociation for Computational Linguistics.
pernet for Twitter POS tagging. In Proceedings of
the 2018 Conference on Empirical Methods in Nat- Adhiguna Kuncoro, Miguel Ballesteros, Lingpeng
ural Language Processing, pages 2540–2549, Brus- Kong, Chris Dyer, and Noah A. Smith. 2016. Distill-
sels, Belgium. Association for Computational Lin- ing an ensemble of greedy dependency parsers into
guistics. one MST parser. In Proceedings of the 2016 Con-
Tao Gui, Qi Zhang, Haoran Huang, Minlong Peng, and ference on Empirical Methods in Natural Language
Xuanjing Huang. 2017. Part-of-speech tagging for Processing, pages 1744–1753, Austin, Texas. Asso-
Twitter with adversarial neural networks. In Pro- ciation for Computational Linguistics.
ceedings of the 2017 Conference on Empirical Meth-
ods in Natural Language Processing, pages 2411– Guillaume Lample, Miguel Ballesteros, Sandeep Sub-
2420, Copenhagen, Denmark. Association for Com- ramanian, Kazuya Kawakami, and Chris Dyer. 2016.
putational Linguistics. Neural architectures for named entity recognition.
In Proceedings of the 2016 Conference of the North
Han He and Jinho Choi. 2020. Establishing strong American Chapter of the Association for Computa-
baselines for the new decade: Sequence tagging, tional Linguistics: Human Language Technologies,
syntactic and semantic parsing with bert. In The pages 260–270, San Diego, California. Association
Thirty-Third International Flairs Conference. for Computational Linguistics.
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Liam Li and Ameet Talwalkar. 2020. Random search
Sun. 2016. Deep residual learning for image recog- and reproducibility for neural architecture search.
nition. In Proceedings of the IEEE conference on In Uncertainty in Artificial Intelligence, pages 367–
computer vision and pattern recognition, pages 770– 377. PMLR.
778.
Xin Li, Lidong Bing, Wenxuan Zhang, and Wai Lam.
Minqing Hu and Bing Liu. 2004. Mining and sum-
2019. Exploiting BERT for end-to-end aspect-based
marizing customer reviews. In Proceedings of the
sentiment analysis. In Proceedings of the 5th Work-
Tenth ACM SIGKDD International Conference on
shop on Noisy User-generated Text (W-NUT 2019),
Knowledge Discovery and Data Mining, KDD ’04,
pages 34–41, Hong Kong, China. Association for
page 168–177, New York, NY, USA. Association for
Computational Linguistics.
Computing Machinery.
Gao Huang, Zhuang Liu, Laurens Van Der Maaten, and Zuchao Li, Hai Zhao, and Kevin Parnow. 2020. Global
Kilian Q Weinberger. 2017. Densely connected con- greedy dependency parsing. In Proceedings of
volutional networks. In Proceedings of the IEEE the AAAI Conference on Artificial Intelligence, vol-
conference on computer vision and pattern recogni- ume 34, pages 8319–8326.
tion, pages 4700–4708.
Chenxi Liu, Liang-Chieh Chen, Florian Schroff,
Zixia Jia, Youmi Ma, Jiong Cai, and Kewei Tu. 2020. Hartwig Adam, Wei Hua, Alan L Yuille, and Li Fei-
Semi-supervised semantic dependency parsing us- Fei. 2019a. Auto-deeplab: Hierarchical neural ar-
ing CRF autoencoders. In Proceedings of the 58th chitecture search for semantic image segmentation.
Annual Meeting of the Association for Computa- In Proceedings of the IEEE conference on computer
tional Linguistics, pages 6795–6805, Online. Asso- vision and pattern recognition, pages 82–92.
ciation for Computational Linguistics.
Hanxiao Liu, Karen Simonyan, Oriol Vinyals, Chrisan-
Rafal Jozefowicz, Wojciech Zaremba, and Ilya tha Fernando, and Koray Kavukcuoglu. 2018a. Hi-
Sutskever. 2015. An empirical exploration of recur- erarchical representations for efficient architecture
rent network architectures. In International confer- search. In International Conference on Learning
ence on machine learning, pages 2342–2350. Representations.
Yoon Kim and Alexander M. Rush. 2016. Sequence-
level knowledge distillation. In Proceedings of the Yijia Liu, Yi Zhu, Wanxiang Che, Bing Qin, Nathan
2016 Conference on Empirical Methods in Natu- Schneider, and Noah A. Smith. 2018b. Parsing
ral Language Processing, pages 1317–1327, Austin, tweets into Universal Dependencies. In Proceedings
Texas. Association for Computational Linguistics. of the 2018 Conference of the North American Chap-
ter of the Association for Computational Linguistics:
Diederik P Kingma and Jimmy Ba. 2015. Adam: A Human Language Technologies, Volume 1 (Long Pa-
method for stochastic optimization. In International pers), pages 965–975, New Orleans, Louisiana. As-
Conference on Learning Representations. sociation for Computational Linguistics.
Yijin Liu, Fandong Meng, Jinchao Zhang, Jinan Xu, 2020, pages 731–742, Online. Association for Com-
Yufeng Chen, and Jie Zhou. 2019b. GCDT: A global putational Linguistics.
context enhanced deep transition architecture for se-
quence labeling. In Proceedings of the 57th Annual Dat Quoc Nguyen, Thanh Vu, and Anh Tuan Nguyen.
Meeting of the Association for Computational Lin- 2020. BERTweet: A pre-trained language model for
guistics, pages 2431–2441, Florence, Italy. Associa- English Tweets. In Proceedings of the 2020 Con-
tion for Computational Linguistics. ference on Empirical Methods in Natural Language
Processing: System Demonstrations.
Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Man-
dar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Stephan Oepen, Marco Kuhlmann, Yusuke Miyao,
Luke Zettlemoyer, and Veselin Stoyanov. 2019c. Daniel Zeman, Silvie Cinková, Dan Flickinger, Jan
Roberta: A robustly optimized bert pretraining ap- Hajic, and Zdenka Uresova. 2015. Semeval 2015
proach. arXiv preprint arXiv:1907.11692. task 18: Broad-coverage semantic dependency pars-
ing. In Proceedings of the 9th International Work-
Ilya Loshchilov and Frank Hutter. 2018. Decoupled shop on Semantic Evaluation (SemEval 2015), pages
weight decay regularization. In International Con- 915–926.
ference on Learning Representations.
Stephan Oepen, Marco Kuhlmann, Yusuke Miyao,
Jouni Luoma and Sampo Pyysalo. 2020. Exploring Daniel Zeman, Dan Flickinger, Jan Hajic, Angelina
cross-sentence contexts for named entity recogni- Ivanova, and Yi Zhang. 2014. Semeval 2014 task 8:
tion with BERT. In Proceedings of the 28th Inter- Broad-coverage semantic dependency parsing. Se-
national Conference on Computational Linguistics, mEval 2014.
pages 904–914, Barcelona, Spain (Online). Interna-
tional Committee on Computational Linguistics. Olutobi Owoputi, Brendan O’Connor, Chris Dyer,
Kevin Gimpel, Nathan Schneider, and Noah A.
Xuezhe Ma and Eduard Hovy. 2016. End-to-end Smith. 2013. Improved part-of-speech tagging for
sequence labeling via bi-directional LSTM-CNNs- online conversational text with word clusters. In
CRF. In Proceedings of the 54th Annual Meeting of Proceedings of the 2013 Conference of the North
the Association for Computational Linguistics (Vol- American Chapter of the Association for Computa-
ume 1: Long Papers), pages 1064–1074, Berlin, Ger- tional Linguistics: Human Language Technologies,
many. Association for Computational Linguistics. pages 380–390, Atlanta, Georgia. Association for
Computational Linguistics.
Xuezhe Ma, Zecong Hu, Jingzhou Liu, Nanyun Peng,
Graham Neubig, and Eduard Hovy. 2018. Stack- Jeffrey Pennington, Richard Socher, and Christopher
pointer networks for dependency parsing. In Pro- Manning. 2014. Glove: Global vectors for word rep-
ceedings of the 56th Annual Meeting of the Associa- resentation. In Proceedings of the 2014 conference
tion for Computational Linguistics (Volume 1: Long on empirical methods in natural language process-
Papers), pages 1403–1414, Melbourne, Australia. ing (EMNLP), pages 1532–1543.
Association for Computational Linguistics.
Matthew Peters, Mark Neumann, Mohit Iyyer, Matt
Ryan McDonald, Fernando Pereira, Kiril Ribarov, and Gardner, Christopher Clark, Kenton Lee, and Luke
Jan Hajič. 2005. Non-projective dependency pars- Zettlemoyer. 2018. Deep contextualized word rep-
ing using spanning tree algorithms. In Proceed- resentations. In Proceedings of the 2018 Confer-
ings of Human Language Technology Conference ence of the North American Chapter of the Associ-
and Conference on Empirical Methods in Natural ation for Computational Linguistics: Human Lan-
Language Processing, pages 523–530, Vancouver, guage Technologies, Volume 1 (Long Papers), pages
British Columbia, Canada. Association for Compu- 2227–2237, New Orleans, Louisiana. Association
tational Linguistics. for Computational Linguistics.

Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Cor- Hieu Pham, Melody Guan, Barret Zoph, Quoc Le,
rado, and Jeff Dean. 2013. Distributed representa- and Jeff Dean. 2018a. Efficient neural architecture
tions of words and phrases and their compositional- search via parameters sharing. In International Con-
ity. In Advances in neural information processing ference on Machine Learning, pages 4095–4104.
systems, pages 3111–3119.
Hieu Pham, Melody Y Guan, Barret Zoph, Quoc V Le,
Geoffrey Miller, Peter Todd, and Shailesh Hegde. 1989. and Jeff Dean. 2018b. Efficient neural architecture
Designing neural networks using genetic algorithms. search via parameter sharing. In International Con-
In 3rd International Conference on Genetic Algo- ference on Machine Learning.
rithms, pages 379–384.
Telmo Pires, Eva Schlinger, and Dan Garrette. 2019.
Khalil Mrini, Franck Dernoncourt, Quan Hung Tran, How multilingual is multilingual BERT? In Pro-
Trung Bui, Walter Chang, and Ndapa Nakashole. ceedings of the 57th Annual Meeting of the Asso-
2020. Rethinking self-attention: Towards inter- ciation for Computational Linguistics, pages 4996–
pretability in neural parsing. In Findings of the As- 5001, Florence, Italy. Association for Computa-
sociation for Computational Linguistics: EMNLP tional Linguistics.
Maria Pontiki, Dimitris Galanis, Haris Papageorgiou, David R So, Chen Liang, and Quoc V Le. 2019. The
Ion Androutsopoulos, Suresh Manandhar, Moham- evolved transformer. In International Conference on
mad AL-Smadi, Mahmoud Al-Ayyoub, Yanyan Machine Learning.
Zhao, Bing Qin, Orphée De Clercq, Véronique
Hoste, Marianna Apidianaki, Xavier Tannier, Na- Kenneth O Stanley and Risto Miikkulainen. 2002.
talia Loukachevitch, Evgeniy Kotelnikov, Nuria Bel, Evolving neural networks through augmenting
Salud María Jiménez-Zafra, and Gülşen Eryiğit. topologies. Evolutionary computation, 10(2):99–
2016. SemEval-2016 task 5: Aspect based senti- 127.
ment analysis. In Proceedings of the 10th Interna-
tional Workshop on Semantic Evaluation (SemEval- Jana Straková, Milan Straka, and Jan Hajic. 2019. Neu-
2016), pages 19–30, San Diego, California. Associa- ral architectures for nested NER through lineariza-
tion for Computational Linguistics. tion. In Proceedings of the 57th Annual Meeting
of the Association for Computational Linguistics,
pages 5326–5331, Florence, Italy. Association for
Maria Pontiki, Dimitris Galanis, Haris Papageorgiou,
Computational Linguistics.
Suresh Manandhar, and Ion Androutsopoulos. 2015.
SemEval-2015 task 12: Aspect based sentiment Masanori Suganuma, Shinichi Shirakawa, and Tomo-
analysis. In Proceedings of the 9th International haru Nagao. 2017. A genetic programming ap-
Workshop on Semantic Evaluation (SemEval 2015), proach to designing convolutional neural network ar-
pages 486–495, Denver, Colorado. Association for chitectures. In Proceedings of the genetic and evolu-
Computational Linguistics. tionary computation conference, pages 497–504.
Maria Pontiki, Dimitris Galanis, John Pavlopoulos, Beth M. Sundheim. 1995. Named entity task definition,
Harris Papageorgiou, Ion Androutsopoulos, and version 2.1. In Proceedings of the Sixth Message
Suresh Manandhar. 2014. SemEval-2014 task 4: As- Understanding Conference, pages 319–332.
pect based sentiment analysis. In Proceedings of the
8th International Workshop on Semantic Evaluation Richard S Sutton and Andrew G Barto. 1992. Rein-
(SemEval 2014), pages 27–35, Dublin, Ireland. As- forcement learning: An introduction. MIT press.
sociation for Computational Linguistics.
Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe,
Esteban Real, Alok Aggarwal, Yanping Huang, and Jon Shlens, and Zbigniew Wojna. 2016. Rethinking
Quoc V Le. 2019. Regularized evolution for image the inception architecture for computer vision. In
classifier architecture search. In Proceedings of the Proceedings of the IEEE conference on computer vi-
aaai conference on artificial intelligence, volume 33, sion and pattern recognition, pages 2818–2826.
pages 4780–4789. Lucien Tesnière. 1959. éléments de syntaxe structurale.
Editions Klincksieck.
Esteban Real, Sherry Moore, Andrew Selle, Saurabh
Saxena, Yutaka Leon Suematsu, Jie Tan, Quoc V Le, Erik F. Tjong Kim Sang. 2002. Introduction to the
and Alexey Kurakin. 2017. Large-scale evolution CoNLL-2002 shared task: Language-independent
of image classifiers. In International Conference on named entity recognition. In COLING-02: The
Machine Learning, pages 2902–2911. 6th Conference on Natural Language Learning 2002
(CoNLL-2002).
Alan Ritter, Sam Clark, Mausam, and Oren Etzioni.
2011. Named entity recognition in tweets: An ex- Erik F. Tjong Kim Sang and Sabine Buchholz. 2000.
perimental study. In Proceedings of the 2011 Con- Introduction to the CoNLL-2000 shared task chunk-
ference on Empirical Methods in Natural Language ing. In Fourth Conference on Computational Nat-
Processing, pages 1524–1534, Edinburgh, Scotland, ural Language Learning and the Second Learning
UK. Association for Computational Linguistics. Language in Logic Workshop.

Erik F. Tjong Kim Sang and Fien De Meulder.


Cicero D Santos and Bianca Zadrozny. 2014. Learning
2003. Introduction to the CoNLL-2003 shared task:
character-level representations for part-of-speech
Language-independent named entity recognition. In
tagging. In Proceedings of the 31st international
Proceedings of the Seventh Conference on Natu-
conference on machine learning (ICML-14), pages
ral Language Learning at HLT-NAACL 2003, pages
1818–1826.
142–147.
Tal Schuster, Ori Ram, Regina Barzilay, and Amir Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob
Globerson. 2019. Cross-lingual alignment of con- Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz
textual word embeddings, with applications to zero- Kaiser, and Illia Polosukhin. 2017. Attention is all
shot dependency parsing. In Proceedings of the you need. In Advances in neural information pro-
2019 Conference of the North American Chapter of cessing systems, pages 5998–6008.
the Association for Computational Linguistics: Hu-
man Language Technologies, Volume 1 (Long and Xinyu Wang, Jingxian Huang, and Kewei Tu. 2019.
Short Papers), pages 1599–1613, Minneapolis, Min- Second-order semantic dependency parsing with
nesota. Association for Computational Linguistics. end-to-end neural networks. In Proceedings of the
57th Annual Meeting of the Association for Com- Shijie Wu and Mark Dredze. 2019. Beto, bentz, be-
putational Linguistics, pages 4609–4618, Florence, cas: The surprising cross-lingual effectiveness of
Italy. Association for Computational Linguistics. BERT. In Proceedings of the 2019 Conference on
Empirical Methods in Natural Language Processing
Xinyu Wang, Yong Jiang, Nguyen Bach, Tao Wang, and the 9th International Joint Conference on Natu-
Fei Huang, and Kewei Tu. 2020a. Structure-level ral Language Processing (EMNLP-IJCNLP), pages
knowledge distillation for multilingual sequence la- 833–844, Hong Kong, China. Association for Com-
beling. In Proceedings of the 58th Annual Meet- putational Linguistics.
ing of the Association for Computational Linguistics,
pages 3317–3330, Online. Association for Computa- L. Xie and A. Yuille. 2017. Genetic cnn. In 2017
tional Linguistics. IEEE International Conference on Computer Vision
(ICCV), pages 1388–1397.
Xinyu Wang, Yong Jiang, Nguyen Bach, Tao Wang,
Zhongqiang Huang, Fei Huang, and Kewei Tu. Hu Xu, Bing Liu, Lei Shu, and Philip Yu. 2019. BERT
2021a. Improving Named Entity Recognition by Ex- post-training for review reading comprehension and
ternal Context Retrieving and Cooperative Learning. aspect-based sentiment analysis. In Proceedings of
In the Joint Conference of the 59th Annual Meet- the 2019 Conference of the North American Chap-
ing of the Association for Computational Linguistics ter of the Association for Computational Linguistics:
and the 11th International Joint Conference on Natu- Human Language Technologies, Volume 1 (Long
ral Language Processing (ACL-IJCNLP 2021). As- and Short Papers), pages 2324–2335, Minneapolis,
sociation for Computational Linguistics. Minnesota. Association for Computational Linguis-
tics.
Xinyu Wang, Yong Jiang, Nguyen Bach, Tao Wang,
Huang Zhongqiang, Fei Huang, and Kewei Tu. Hu Xu, Bing Liu, Lei Shu, and Philip S. Yu. 2018. Dou-
2020b. More embeddings, better sequence labelers? ble embeddings and CNN-based sequence labeling
In Findings of EMNLP, Online. for aspect extraction. In Proceedings of the 56th An-
nual Meeting of the Association for Computational
Xinyu Wang, Yong Jiang, Zhaohui Yan, Zixia Jia,
Linguistics (Volume 2: Short Papers), pages 592–
Nguyen Bach, Tao Wang, Zhongqiang Huang, Fei
598, Melbourne, Australia. Association for Compu-
Huang, and Kewei Tu. 2021b. Structural Knowl-
tational Linguistics.
edge Distillation: Tractably Distilling Information
for Structured Predictor. In the Joint Conference
Ikuya Yamada, Akari Asai, Hiroyuki Shindo, Hideaki
of the 59th Annual Meeting of the Association for
Takeda, and Yuji Matsumoto. 2020. LUKE: Deep
Computational Linguistics and the 11th Interna-
contextualized entity representations with entity-
tional Joint Conference on Natural Language Pro-
aware self-attention. In Proceedings of the 2020
cessing (ACL-IJCNLP 2021). Association for Com-
Conference on Empirical Methods in Natural Lan-
putational Linguistics.
guage Processing (EMNLP), pages 6442–6454, On-
Xinyu Wang and Kewei Tu. 2020. Second-order neural line. Association for Computational Linguistics.
dependency parsing with message passing and end-
to-end training. In Proceedings of the 1st Confer- Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Car-
ence of the Asia-Pacific Chapter of the Association bonell, Russ R Salakhutdinov, and Quoc V Le. 2019.
for Computational Linguistics and the 10th Interna- Xlnet: Generalized autoregressive pretraining for
tional Joint Conference on Natural Language Pro- language understanding. In Advances in neural in-
cessing, pages 93–99, Suzhou, China. Association formation processing systems, pages 5753–5763.
for Computational Linguistics.
Juntao Yu, Bernd Bohnet, and Massimo Poesio. 2020.
Zhenkai Wei, Yu Hong, Bowei Zou, Meng Cheng, and Named entity recognition as dependency parsing. In
Jianmin Yao. 2020. Don’t eclipse your arts due to Proceedings of the 58th Annual Meeting of the Asso-
small discrepancies: Boundary repositioning with ciation for Computational Linguistics, pages 6470–
a pointer network for aspect extraction. In Pro- 6476, Online. Association for Computational Lin-
ceedings of the 58th Annual Meeting of the Asso- guistics.
ciation for Computational Linguistics, pages 3678–
3684, Online. Association for Computational Lin- Yu Zhang, Zhenghua Li, and Min Zhang. 2020. Effi-
guistics. cient second-order TreeCRF for neural dependency
parsing. In Proceedings of the 58th Annual Meet-
Ronald J Williams. 1992. Simple statistical gradient- ing of the Association for Computational Linguistics,
following algorithms for connectionist reinforce- pages 3295–3305, Online. Association for Computa-
ment learning. Machine learning, 8(3-4):229–256. tional Linguistics.

Martin Wistuba. 2018. Deep learning architecture Zhao Zhong, Junjie Yan, Wei Wu, Jing Shao, and
search by neuro-cell-based evolution with function- Cheng-Lin Liu. 2018. Practical block-wise neural
preserving mutations. In Joint European Confer- network architecture generation. In Proceedings of
ence on Machine Learning and Knowledge Discov- the IEEE conference on computer vision and pattern
ery in Databases, pages 243–258. Springer. recognition, pages 2423–2432.
Junru Zhou and Hai Zhao. 2019. Head-Driven Phrase We use Stochastic Gradient Descent (SGD) to opti-
Structure Grammar parsing on Penn Treebank. In mize the controller. The training time depends on
Proceedings of the 57th Annual Meeting of the
the task and dataset size. Take the CoNLL English
Association for Computational Linguistics, pages
2396–2408, Florence, Italy. Association for Compu- NER dataset as an example. It takes 45 GPU hours
tational Linguistics. to train the controller for 30 steps on a single Tesla
P100 GPU, which is an acceptable training time in
Wei Zhu, Xiaoling Wang, Xipeng Qiu, Yuan Ni, and
Guotong Xie. 2020. Autotrans: Automating trans-
practice.
former design via reinforced architecture search.
arXiv preprint arXiv:2009.02070. Sources of Embeddings The sources of the em-
beddings that we used are listed in Table 7.
Barret Zoph and Quoc V Le. 2017. Neural architecture
search with reinforcement learning. In International B Additional Analysis
Conference on Learning Representations.

Barret Zoph, Vijay Vasudevan, Jonathon Shlens, and B.1 Document-Level and Sentence-Level
Quoc V Le. 2018. Learning transferable architec- Representations
tures for scalable image recognition. In Proceedings
of the IEEE conference on computer vision and pat-
Recently, models with document-level word repre-
tern recognition, pages 8697–8710. sentations extracted from transformer-based em-
beddings significantly outperform models with
A Detailed Configurations sentence-level word representations in NER (De-
vlin et al., 2019; Yu et al., 2020; Yamada et al.,
Evaluation To evaluate our models, We use F1 2020). However, there are a lot of application sce-
score to evaluate NER, Chunking and AE, use ac- narios that document contexts are unavailable. We
curacy to evaluate POS Tagging, use unlabeled replace the document-level word representations
attachment score (UAS) and labeled attachment from transformer-based embeddings (i.e., XLM-
score (LAS) to evaluate DP, and use labeled F1 R and BERT embeddings) with the sentence-level
score to evaluate SDP. word representations. Results are shown in Table
Task Models and Controller For sequence- 8. We report the test results of All to show how
structured tasks (i.e., NER, POS tagging, chunking, the gap between ACE and All changes with dif-
aspect extraction), we use a batch size of 32 sen- ferent kinds of representations. We report the test
tences and an SGD optimizer with a learning rate of accuracy of the models with the highest develop-
0.1. We anneal the learning rate by 0.5 when there ment accuracy following Yamada et al. (2020) for
is no accuracy improvement on the development a fair comparison. Empirical results show that the
set for 5 epochs. We set the maximum training document-level representations can significantly
epoch to 150. For graph-structured tasks (i.e., DP improve the accuracy of ACE. Comparing with
and SDP), we use Adam (Kingma and Ba, 2015) models with sentence-level representations, the av-
to optimize the model with a learning rate of 0.002. eraged accuracy gap between ACE and All is en-
We anneal the learning rate by 0.75 for every 5000 hanced from 0.7 to 1.7 with document-level repre-
iterations following Dozat and Manning (2017). sentations, which shows that the advantage of ACE
We set the maximum training epoch to 300. For becomes stronger with document-level representa-
DP, we run the maximum spanning tree (McDon- tions.
ald et al., 2005) algorithm to output valid trees in
testing. We fix the hyper-parameters of the task B.2 Fine-tuned Models Versus ACE
models. To fine-tune the embeddings, we use AdamW
We tune the learning rate for the controller (Loshchilov and Hutter, 2018) optimizer with a
among {0.1, 0.2, 0.3, 0.4, 0.5} and the discount learning rate of 5 × 10−6 and trained the contex-
factor among {0.1, 0.3, 0.5, 0.7, 0.9} on the same tualized embeddings with the task for 10 epochs.
dataset in Section 5.2. We search for the hyper- We use a batch size of 32 for BERT, M-BERT and
parameter through grid search and find a learning use a batch size of 4 for XLM-R, RoBERTa and
rate of 0.1 and a discount factor of 0.5 performs XLNet. A comparison between ACE and the fine-
the best on the development set. The controller’s tuned embeddings that we used in ACE is shown
parameters are initialized to all 0 so that each can- in Table 9, 10. Results show that ACE can further
didate is selected evenly in the first two time steps. improve the accuracy of fine-tuned models.
E MBEDDING R ESOURCE URL
GloVe Pennington et al. (2014) nlp.stanford.edu/projects/glove
fastText Bojanowski et al. (2017) github.com/facebookresearch/fastText
ELMo Peters et al. (2018) github.com/allenai/allennlp
ELMo (Other languages) Schuster et al. (2019) github.com/TalSchuster/CrossLingualContextualEmb
BERT Devlin et al. (2019) huggingface.co/bert-base-cased
M-BERT Devlin et al. (2019) huggingface.co/bert-base-multilingual-cased
BERT (Dutch) wietsedv huggingface.co/wietsedv/bert-base-dutch-cased
BERT (German) dbmdz huggingface.co/bert-base-german-dbmdz-cased
BERT (Spanish) dccuchile huggingface.co/dccuchile/bert-base-spanish-wwm-cased
BERT (Turkish) dbmdz huggingface.co/dbmdz/bert-base-turkish-cased
XLM-R Conneau et al. (2020) huggingface.co/xlm-roberta-large
RoBERTa Liu et al. (2019c) huggingface.co/roberta-large
XLNet Yang et al. (2019) huggingface.co/xlnet-large-cased

Table 7: The embeddings we used in our experiments. The URL is where we downloaded the embeddings.

de de06 en es nl B.4 Effect of Embeddings in the Searched


All+sent 86.8 90.1 93.3 90.0 94.4 Embedding Concatenations
ACE+sent 87.1 90.5 93.6 92.4 94.6
BERT (2019) - - 92.8 - - There is no clear conclusion on what concate-
Akbik et al. (2019) - 88.3 93.2 - 90.4
Yu et al. (2020) 86.4 90.3 93.5 90.3 94.7 nation of embeddings is helpful to most of the
Yamada et al. (2020) - - 94.3 - - tasks. We analyze the best searched embedding
Luoma and Pyysalo (2020) 87.3 - 93.7 88.3 93.5 concatenations by ACE over different structured
Wang et al. (2021a) - - 93.9 - -
All+doc 87.5 90.8 94.0 90.7 93.7 outputs, semantic/syntactic type, and monolin-
ACE+doc 88.3 91.7 94.6 95.9 95.7 gual/multilingual tasks. The percentage of each em-
bedding selected by the best concatenations from
Table 8: Comparison of models with and without doc-
all experiments of ACE are shown in Table 13.
ument contexts on NER. +sent/+doc: models with
sentence-/document-level embeddings.
The best embedding concatenation varies over the
output structure, syntactic/semantic level of under-
standing, and the language. The experimental re-
sults show that it is essential to select embeddings
B.3 Retraining for each kind of task separately. However, we also
find that the embeddings are strong in specific set-
Most of the work (Zoph and Le, 2017; Zoph et al., tings. In comparison to the sequence-structured and
2018; Pham et al., 2018b; So et al., 2019; Zhu et al., graph-structured tasks, we find that M-BERT and
2020) in NAS retrains the searched neural archi- ELMo are only frequently selected in sequence-
tecture from scratch so that the hyper-parameters structured tasks while XLM-R embeddings are
of the searched model can be modified or trained always selected in graph-structured tasks. For
on larger datasets. To show whether our searched Flair embeddings, the forward and backward model
embedding concatenation is helpful to the task, we are evenly selected. We suspect one direction of
retrain the task model with the embedding concate- Flair embeddings is strong enough. Therefore con-
nations on the same dataset from scratch. For the catenating the embeddings from two directions to-
experiment, we use the same dataset settings as in gether cannot further improve the accuracy. For
Section 4.3.1. We train the searched embedding non-contextualized embeddings, pretrained word
concatenation of each run from ACE 3 times (there- embeddings are frequently selected in sequence-
fore, 9 runs for each dataset). structured tasks, and character embeddings are not.
Table 12 shows the comparison between re- When we dig deeper into the semantic and syntactic
trained models with the searched embedding con- type of these two structured outputs, we find that
catenation from ACE and All. The results show in all best concatenations, BERT embeddings are
that the retrained models are competitive with ACE selected in all syntactic sequence-structured tasks,
in SDP and in chunking. However, in another three and Flair, M-Flair, word, and XLM-R embeddings
tasks, the retrained models perform inferior to ACE. are selected in syntactic graph-structured tasks. In
The possible reason is that the model at each step multilingual tasks, all best concatenations in mul-
is initialized by the trained model of previous step. tilingual NER tasks select M-BERT embeddings
The retrained models outperform All in all tasks, while M-BERT is rarely selected in multilingual
which shows the effectiveness of the searched em- AE tasks. The monolingual Flair embeddings are
bedding concatenations. always selected in NER tasks, and XLM-R is more
frequently selected in multilingual tasks than mono-
lingual sequence-structured tasks (SS).
NER POS
de de (Revised) en es nl Ritter ARK TB-v2
BERT+Fine-tune 76.9 79.4 89.2 83.3 83.8 91.2 91.7 94.4
MBERT+Fine-tune 81.6 86.7 92.0 87.1 87.2 90.8 91.5 93.9
XLM-R+Fine-tune 87.7 91.4 94.1 89.3 95.3 92.3 93.7 95.4
RoBERTa+Fine-tune - - 93.9 - - 92.0 93.9 95.4
XLNET+Fine-tune - - 93.6 - - 88.4 92.4 94.4
ACE+Fine-tune 88.3 91.7 94.6 95.9 95.7 93.4 94.4 95.8

Table 9: A comparison between ACE and the fine-tuned embeddings that are used in ACE for NER and POS
tagging.

Chunk AE
CoNLL 2000 14Lap 14Res 15Res 16Res es nl ru tr
BERT+Fine-tune 96.7 81.2 87.7 71.8 73.9 76.9 73.1 64.3 75.6
MBERT+Fine-tune 96.6 83.5 85.0 69.5 73.6 74.5 72.6 71.6 58.8
XLM-R+Fine-tune 97.0 85.9 90.5 76.4 78.9 77.0 77.6 77.7 74.1
RoBERTa+Fine-tune 97.2 83.9 90.2 78.5 80.7 - - - -
XLNET+Fine-tune 97.1 84.5 88.9 72.8 73.4 - - - -
ACE+Fine-tune 97.3 87.4 92.0 80.3 81.3 79.9 80.5 79.4 81.9

Table 10: A comparison between ACE and the fine-tuned embeddings we used in ACE for chunking and AE.

DP SDP
PTB DM PAS PSD
UAS LAS ID OOD ID OOD ID OOD
BERT+Fine-tune 96.6 95.1 94.4 91.4 94.4 93.0 82.0 81.3
MBERT+Fine-tune 96.5 94.9 93.9 90.4 93.9 92.1 81.2 80.0
XLM-R+Fine-tune 96.7 95.4 94.2 90.4 94.6 93.2 82.9 81.7
RoBERTa+Fine-tune 96.9 95.6 93.0 89.3 94.3 92.8 82.0 80.6
XLNET+Fine-tune 97.0 95.6 94.2 90.6 94.8 93.4 82.7 81.8
ACE+Fine-tune 97.2 95.7 95.6 92.6 95.8 94.6 83.8 83.4

Table 11: A comparison between ACE and the fine-tuned embeddings that are used in ACE for DP and SDP.

NER POS Chunk AE DP-UAS DP-LAS SDP-ID SDP-OOD


All 92.4 90.6 96.7 73.2 96.7 95.1 94.3 90.8
Retrain 92.6 90.8 96.8 73.6 96.8 95.2 94.5 90.9
ACE 93.0 91.7 96.8 75.6 96.9 95.3 94.5 90.9

Table 12: A comparison among retrained models, All and ACE. We use the one dataset for each task.

BERT M-BERT Char ELMo F F-bw F-fw MF MF-bw MF-fw Word XLM-R
SS 0.81 0.74 0.37 0.85 0.70 0.48 0.59 0.78 0.59 0.41 0.81 0.70
GS 0.75 0.17 0.50 0.25 0.83 0.75 0.42 0.83 0.58 0.58 0.50 1.00
Sem. SS 0.67 0.73 0.40 0.80 0.60 0.40 0.53 0.87 0.60 0.53 0.80 0.60
Syn. SS 1.00 0.75 0.33 0.92 0.83 0.58 0.67 0.67 0.58 0.25 0.83 0.83
Sem. GS 0.78 0.22 0.67 0.33 0.78 0.67 0.56 0.78 0.56 0.67 0.33 1.00
Syn. GS 0.67 0.00 0.00 0.00 1.00 1.00 0.00 1.00 0.67 0.33 1.00 1.00
M-NER 0.67 1.00 0.56 0.83 1.00 0.78 1.00 0.89 0.78 0.44 0.78 0.89
M-AE 1.00 0.33 0.75 0.33 0.58 0.42 0.42 0.75 0.25 0.75 0.50 0.92

Table 13: The percentage of each embedding candidate selected in the best concatenations from ACE. F and MF
are monolingual and multilingual Flair embeddings. We count these two embeddings are selected if one of the
forward/backward (fw/bw) direction of Flair is selected in the concatenation. We count the Word embedding is
selected if one of the fastText/GloVe embeddings is selected. SS: sequence-structured tasks. GS: graph-structured
tasks. Sem.: Semantic-level tasks. Syn.: Syntactic-level tasks. M-NER: Multilingual NER tasks. M-AE: Mul-
tilingual AE tasks. We only use English datasets in SS and GS. English datasets are removed for M-NER and
M-AE.

You might also like