2021 Automated Concatenation of Embeddings For Structured Prediction
2021 Automated Concatenation of Embeddings For Structured Prediction
and evolutionary algorithms are the most usual ap- where Y(x) represents all possible output struc-
proaches. In reinforcement learning, the agent’s tures given the input sentence x. Depending on
actions are the generation of neural architectures different structured prediction tasks, the output
and the action space is identical to the search space. structure y can be label sequences, trees, graphs
Previous work usually applies an RNN layer (Zoph or other structures. In this paper, we use sequence-
and Le, 2017; Zhong et al., 2018; Zoph et al., 2018) structured and graph-structured outputs as two
or use Markov Decision Process (Baker et al., 2017) exemplar structured prediction tasks. We use
to decide the hyper-parameter of each structure and BiLSTM-CRF model (Ma and Hovy, 2016; Lample
decide the input order of each structure. Evolution- et al., 2016) for sequence-structured outputs and
ary algorithms have been applied to architecture use BiLSTM-Biaffine model (Dozat and Manning,
search for many decades (Miller et al., 1989; Ange- 2017) for graph-structured outputs:
line et al., 1994; Stanley and Miikkulainen, 2002;
Floreano et al., 2008; Jozefowicz et al., 2015). The P seq (y|x) = BiLSTM-CRF(V , y)
algorithm repeatedly generates new populations
P graph (y|x) = BiLSTM-Biaffine(V , y)
through recombination and mutation operations
and selects survivors through competing among where V = [v1 ; · · · ; vn ], V ∈ Rd×n is a matrix
the population. Recent work with evolutionary al- of the word representations for the input sentence
gorithms differ in the method on parent/survivor x with n words, d is the hidden size of the concate-
selection and population generation. For exam- nation of all embeddings. The word representation
ple, Real et al. (2017), Liu et al. (2018a), Wistuba vi of i-th word is a concatenation of L types of
(2018) and Real et al. (2019) applied tournament word embeddings:
selection (Goldberg and Deb, 1991) for the par-
ent selection while Xie and Yuille (2017) keeps vil = embedli (x); vi = [vi1 ; vi2 ; . . . ; viL ]
all parents. Suganuma et al. (2017) and Elsken
et al. (2018) chose the best model while Real et al. where embedl is the model of l-th embeddings,
l
(2019) chose several latest models as survivors. vi ∈ Rd , vil ∈ Rd . dl is the hidden size of embedl .
[Reward]
Choice Choice Choice
[Action]
f
0 0 Flair 0
0 1 ELMo 1
1 1 R Task Model BERT 1
Figure 1: The main paradigm of our approach is shown in the middle, where an example of reward function is
represented in the left and an example of a concatenation action is shown in the right.
where r t is a vector with length L representing we compare the Rt with the value in the dictionary
the reward of each embedding candidate. Rt and keep the higher one.
and Ri are the reward at time step t and i.
When the Hamming distance of two concatena- 4 Experiments
tions Hamm(at , ai ) gets larger, the changed can- We use ISO 639-1 language codes to represent
didates’ contribution to the accuracy becomes less languages in the table2 .
noticeable. The controller may be misled to re-
ward a candidate that is not actually helpful. We 4.1 Datasets and Configurations
apply a discount factor to reduce the reward for two To show ACE’s effectiveness, we conduct extensive
concatenations with a large Hamming distance to experiments on a variety of structured prediction
alleviate this issue. Our final reward function is: tasks varying from syntactic tasks to semantic tasks.
t−1 The tasks are named entity recognition (NER), Part-
t ,ai )−1
X
rt = (Rt −Ri )γ Hamm(a |at −ai | (6) Of-Speech (POS) tagging, Chunking, Aspect Ex-
i=1 traction (AE), Syntactic Dependency Parsing (DP)
where γ ∈ (0, 1). Eq. 4 is then reformulated as: and Semantic Dependency Parsing (SDP). The de-
tails of the 6 structured prediction tasks in our ex-
L
X periments are shown in below:
∇θ Jt (θ) ≈ ∇θ log Plctrl (atl ; θl )rlt (7)
l=1 • NER: We use the corpora of 4 languages from
3.4 Training the CoNLL 2002 and 2003 shared task (Tjong
Kim Sang, 2002; Tjong Kim Sang and De Meul-
To train the controller, we use a dictionary D to
der, 2003) with standard split.
store the concatenations and the corresponding val-
idation scores. At t = 1, we train the task model • POS Tagging: We use three datasets, Ritter11-T-
with all embedding candidates concatenated. From POS (Ritter et al., 2011), ARK-Twitter (Gimpel
t = 2, we repeat the following steps until a maxi- et al., 2011; Owoputi et al., 2013) and Tweebank-
mum iteration T : v2 (Liu et al., 2018b) datasets (Ritter, ARK and
1. Sample a concatenation at based on the proba- TB-v2 in simplification). We follow the dataset
bility distribution in Eq. 3. split of Nguyen et al. (2020).
2. Train the task model with at following Eq. 1 • Chunking: We use CoNLL 2000 (Tjong
and evaluate the model on the development set Kim Sang and Buchholz, 2000) for chunking.
to get the accuracy Rt . Since there is no standard development set for
CoNLL 2000 dataset, we split 10% of the train-
3. Given the concatenation at , accuracy Rt and D, ing data as the development set.
compute the gradient of the controller following
Eq. 7 and update the parameters of controller. • Aspect Extraction: Aspect extraction is a sub-
task of aspect-based sentiment analysis (Pontiki
4. Add at and Rt into D, set t = t + 1. et al., 2014, 2015, 2016). The datasets are from
When sampling at , we avoid selecting the previous the laptop and restaurant domain of SemEval
concatenation at−1 and the all-zero vector (i.e., se- 2
https://en.wikipedia.org/wiki/List_
lecting no embedding). If at is in the dictionary D, of_ISO_639-1_codes
NER POS AE
de en es nl Ritter ARK TB-v2 14Lap 14Res 15Res 16Res es nl ru tr
A LL 83.1 92.4 88.9 89.8 90.6 92.1 94.6 82.7 88.5 74.2 73.2 74.6 75.0 67.1 67.5
R ANDOM 84.0 92.6 88.8 91.9 91.3 92.6 94.6 83.6 88.1 73.5 74.7 75.0 73.6 68.0 70.0
ACE 84.2 93.0 88.9 92.1 91.7 92.8 94.8 83.9 88.6 74.9 75.6 75.7 75.3 70.6 71.1
C HUNK DP SDP
AVG
CoNLL 2000 UAS LAS DM-ID DM-OOD PAS-ID PAS-OOD PSD-ID PSD-OOD
A LL 96.7 96.7 95.1 94.3 90.8 94.6 92.9 82.4 81.7 85.3
R ANDOM 96.7 96.8 95.2 94.4 90.8 94.6 93.0 82.3 81.8 85.7
ACE 96.8 96.9 95.3 94.5 90.9 94.5 93.1 82.5 82.1 86.2
Table 1: Comparison with concatenating all embeddings and random search baselines on 6 tasks.
14, restaurant domain of SemEval 15 and restau- XLM-R embeddings. The size of the search space
rant domain of SemEval 16 shared task (14Lap, in our experiments is 211 −1=20473 . For language-
14Res, 15Res and 16Res in short). Addition- specific models of other languages, please refer to
ally, we use another 4 languages in the restaurant Appendix A for more details. In AE, there is no
domain of SemEval 16 to test our approach in available Russian-specific BERT, Flair and ELMo
multiple languages. We randomly split 10% of embeddings and there is no available Turkish-
the training data as the development set following specific Flair and ELMo embeddings. We use the
Li et al. (2019). corresponding English embeddings instead so that
the search spaces of these datasets are almost iden-
• Syntactic Dependency Parsing: We use Penn
tical to those of the other datasets. All embeddings
Tree Bank (PTB) 3.0 with the same dataset pre-
are fixed during training except that the character
processing as (Ma et al., 2018).
embeddings are trained over the task. The empiri-
• Semantic Dependency Parsing: We use DM, cal results are reported in Section 4.3.1.
PAS and PSD datasets for semantic dependency
parsing (Oepen et al., 2014) for the SemEval Embedding Fine-tuning: A usual approach to
2015 shared task (Oepen et al., 2015). The three get better accuracy is fine-tuning transformer-based
datasets have the same sentences but with dif- embeddings. In sequence labeling, most of the
ferent formalisms. We use the standard split for work follows the fine-tuning pipeline of BERT that
SDP. In the split, there are in-domain test sets connects the BERT model with a linear layer for
and out-of-domain test sets for each dataset. word-level classification. However, when multiple
Among these tasks, NER, POS tagging, chunk- embeddings are concatenated, fine-tuning a specific
ing and aspect extraction are sequence-structured group of embeddings becomes difficult because of
outputs while dependency parsing and semantic complicated hyper-parameter settings and massive
dependency parsing are the graph-structured out- GPU memory consumption. To alleviate this prob-
puts. POS Tagging, chunking and DP are syntactic lem, we first fine-tune the transformer-based em-
structured prediction tasks while NER, AE, SDP beddings over the task and then concatenate these
are semantic structured prediction tasks. embeddings together with other embeddings in the
We train the controller for 30 steps and save the basic setting to apply ACE. The empirical results
task model with the highest accuracy on the devel- are reported in Section 4.3.2.
opment set as the final model for testing. Please
refer to Appendix A for more details of other set- 4.3 Results
tings.
We use the following abbreviations in our experi-
4.2 Embeddings ments: UAS: Unlabeled Attachment Score; LAS:
Basic Settings: For the candidates of embed- Labeled Attachment Score; ID: In-domain test set;
dings on English datasets, we use the language- OOD: Out-of-domain test set. We use language
specific model for ELMo, Flair, base BERT, GloVe codes for languages in NER and AE.
word embeddings, fastText word embeddings, non-
contextual character embeddings (Lample et al., 3
Flair embeddings have two models (forward and back-
2016), multilingual Flair (M-Flair), M-BERT and ward) for each language.
4.3.1 Comparison With Baselines approaches used POS tags and lemmas as addi-
To show the effectiveness of our approach, we com- tional word features to the network. We add these
pare our approach with two strong baselines. For two features to the embedding candidates and train
the first one, we let the task model learn by itself the embeddings together with the task. We use
the contribution of each embedding candidate that the fine-tuned transformer-based embeddings on
is helpful to the task. We set a to all-ones (i.e., each task instead of the pretrained version of these
the concatenation of all the embeddings) and train embeddings as the candidates.4
the task model (All). The linear layer weight We additionally compare with fine-tuned XLM-
W in Eq. 2 reflects the contribution of each can- R model for NER, POS tagging, chunking and AE,
didate. For the second one, we use the random and compare with fine-tuned XLNet model for DP
search (Random), a strong baseline in NAS (Li and SDP, which are strong fine-tuned models in
and Talwalkar, 2020). For Random, we run the most of the experiments. Results are shown in Ta-
same maximum iteration as in ACE. For the exper- ble 2, 3, 4. Results show that ACE with fine-tuned
iments, we report the averaged accuracy of 3 runs. embeddings achieves state-of-the-art performance
Table 1 shows that ACE outperforms both baselines in all test sets, which shows that finding a good em-
in 6 tasks over 23 test sets with only two exceptions. bedding concatenation helps structured prediction
Comparing Random with All, Random outper- tasks. We also find that ACE is stronger than the
forms All by 0.4 on average and surpasses the fine-tuned models, which shows the effectiveness
accuracy of All on 14 out of 23 test sets, which of concatenating the fine-tuned embeddings5 .
shows that concatenating all embeddings may not
be the best solution to most structured prediction 5 Analysis
tasks. In general, searching for the concatenation
for the word representation is essential in most 5.1 Efficiency of Search Methods
cases, and our search design can usually lead to To show how efficient our approach is compared
better results compared to both of the baselines. with the random search algorithm, we compare the
4.3.2 Comparison With State-of-the-Art algorithm in two aspects on CoNLL English NER
approaches dataset. The first aspect is the best development
accuracy during training. The left part of Figure 2
As we have shown, ACE has an advantage in
shows that ACE is consistently stronger than the
searching for better embedding concatenations.
random search algorithm in this task. The second
We further show that ACE is competitive or even
aspect is the searched concatenation at each time
stronger than state-of-the-art approaches. We
step. The right part of Figure 2 shows that the ac-
additionally use XLNet (Yang et al., 2019) and
curacy of ACE gradually increases and gets stable
RoBERTa as the candidates of ACE. In some tasks,
when more concatenations are sampled.
we have several additional settings to better com-
pare with previous work. In NER, we also conduct
5.2 Ablation Study on Reward Function
a comparison on the revised version of German
Design
datasets in the CoNLL 2006 shared task (Buch-
holz and Marsi, 2006). Recent work such as Yu To show the effectiveness of the designed reward
et al. (2020) and Yamada et al. (2020) utilizes doc- function, we compare our reward function (Eq. 6)
ument contexts in the datasets. We follow their with the reward function without discount factor
work and extract document embeddings for the (Eq. 5) and the traditional reward function (reward
transformer-based embeddings. Specifically, we term in Eq. 4). We sample 2000 training sentences
follow the fine-tune process of Yamada et al. (2020) on CoNLL English NER dataset for faster train-
to fine-tune the transformer-based embeddings over ing and train the controller for 50 steps. Table 5
the document except for BERT and M-BERT em- shows that both the discount factor and the binary
beddings. For BERT and M-BERT, we follow the vector |at − ai | for the task are helpful in both
document extraction process of Yu et al. (2020) development and test datasets.
because we find that the model with such docu-
4
ment embeddings is significantly stronger than the Please refer to Appendix for more details about the em-
beddings.
model trained with the fine-tuning process of Ya- 5
We compare ACE with other fine-tuned embeddings in
mada et al. (2020). In SDP, the state-of-the-art Appendix.
NER POS
de de06 en es nl Ritter ARK TB-v2
Baevski et al. (2019) - - 93.5 - - Owoputi et al. (2013) 90.4 93.2 94.6
Straková et al. (2019) 85.1 - 93.4 88.8 92.7 Gui et al. (2017) 90.9 - 92.8
Yu et al. (2020) 86.4 90.3 93.5 90.3 93.7 Gui et al. (2018) 91.2 92.4 -
Yamada et al. (2020) - - 94.3 - - Nguyen et al. (2020) 90.1 94.1 95.2
XLM-R+Fine-tune 87.7 91.4 94.1 89.3 95.3 XLM-R+Fine-tune 92.3 93.7 95.4
ACE+Fine-tune 88.3 91.7 94.6 95.9 95.7 ACE+Fine-tune 93.4 94.4 95.8
Table 2: Comparison with state-of-the-art approaches in NER and POS tagging. † : Models are trained on both
train and development set.
C HUNK AE
CoNLL 2000 14Lap 14Res 15Res 16Res es nl ru tr
Akbik et al. (2018) 96.7 Xu et al. (2018)† 84.2 84.6 72.0 75.4 - - - -
Clark et al. (2018) 97.0 Xu et al. (2019) 84.3 - - 78.0 - - - -
Liu et al. (2019b) 97.3 Wang et al. (2020a) - - - 72.8 74.3 72.9 71.8 59.3
Chen et al. (2020) 95.5 Wei et al. (2020) 82.7 87.1 72.7 77.7 - - - -
XLM-R+Fine-tune 97.0 XLM-R+Fine-tune 85.9 90.5 76.4 78.9 77.0 77.6 77.7 74.1
ACE+Fine-tune 97.3 ACE+Fine-tune 87.4 92.0 80.3 81.3 79.9 80.5 79.4 81.9
Table 3: Comparison with state-of-the-art approaches in chunking and aspect extraction. † : We report the results
reproduced by Wei et al. (2020).
96.6
Sample Accuracy
96
Best Accuracy
96.4 95
96.2 94
93
96 ACE Random ACE Random
0 5 10 15 20 25 30 0 5 10 15 20 25 30
Figure 2: Comparing the efficiency of random search (Random) and ACE. The x-axis is the number of time steps.
The left y-axis is the averaged best validation accuracy on CoNLL English NER dataset. The right y-axis is the
averaged validation accuracy of the current selection.
5.3 Comparison with Embedding Weighting dence of model predictions to break ties if there
& Ensemble Approaches are more than one agreement with the same counts.
We call this approach Ensemble. One of the ben-
We compare ACE with two more approaches to efits of voting is that it combines the predictions
further show the effectiveness of ACE. One is a of the task models efficiently without any training
variant of All, which uses a weighting param- process. We can search all possible 2L −1 model
eter b = [b1 , · · · , bl , · · · , bL ] passing through a ensembles in a short period of time through caching
sigmoid function to weight each embedding can- the outputs of the models. Therefore, we search
didate. Such an approach can explicitly learn the for the best ensemble of models on the develop-
weight of each embedding in training instead of a ment set and then evaluate the best ensemble on
binary mask. We call this approach All+Weight. the test set (Ensembledev ). Moreover, we addi-
Another one is model ensemble, which trains the tionally search for the best ensemble on the test set
task model with each embedding candidate indi- for reference (Ensembletest ), which is the upper
vidually and uses the trained models to make joint bound of the approach. We use the same setting as
prediction on the test set. We use voting for ensem- in Section 4.3.1 and select one of the datasets from
ble as it is simple and fast. For sequence labeling each task. For NER, POS tagging, AE, and SDP,
tasks, the models vote for the predicted label at we use CoNLL 2003 English, Ritter, 16Res, and
each position. For DP, the models vote for the DM datasets, respectively. The results are shown
tree of each sentence. For SDP, the models vote in Table 6. Empirical results show that ACE out-
for each potential labeled arc. We use the confi-
DP SDP
PTB DM PAS PSD
UAS LAS ID OOD ID OOD ID OOD
Zhou and Zhao (2019)† 97.2 95.7 He and Choi (2020)‡ 94.6 90.8 96.1 94.4 86.8 79.5
Mrini et al. (2020)† 97.4 96.3 D & M (2018) 93.7 88.9 93.9 90.6 81.0 79.4
Li et al. (2020) 96.6 94.8 Wang et al. (2019) 94.0 89.7 94.1 91.3 81.4 79.6
Zhang et al. (2020) 96.1 94.5 Jia et al. (2020) 93.6 89.1 - - - -
Wang and Tu (2020) 96.9 95.3 F & G (2020) 94.4 91.0 95.1 93.4 82.6 82.0
XLNET+Fine-tune 97.0 95.6 XLNet+Fine-tune 94.2 90.6 94.8 93.4 82.7 81.8
ACE+Fine-tune 97.2 95.8 ACE+Fine-tune 95.6 92.6 95.8 94.6 83.8 83.4
Table 4: Comparison with state-of-the-art approaches in DP and SDP. † : For reference, they additionally used
constituency dependencies in training. We also find that the PTB dataset used by Mrini et al. (2020) is not identical
to the dataset in previous work such as Zhang et al. (2020) and Wang and Tu (2020). ‡ : For reference, we confirmed
with the authors of He and Choi (2020) that they used a different data pre-processing script with previous work.
.
Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Cor- Hieu Pham, Melody Guan, Barret Zoph, Quoc Le,
rado, and Jeff Dean. 2013. Distributed representa- and Jeff Dean. 2018a. Efficient neural architecture
tions of words and phrases and their compositional- search via parameters sharing. In International Con-
ity. In Advances in neural information processing ference on Machine Learning, pages 4095–4104.
systems, pages 3111–3119.
Hieu Pham, Melody Y Guan, Barret Zoph, Quoc V Le,
Geoffrey Miller, Peter Todd, and Shailesh Hegde. 1989. and Jeff Dean. 2018b. Efficient neural architecture
Designing neural networks using genetic algorithms. search via parameter sharing. In International Con-
In 3rd International Conference on Genetic Algo- ference on Machine Learning.
rithms, pages 379–384.
Telmo Pires, Eva Schlinger, and Dan Garrette. 2019.
Khalil Mrini, Franck Dernoncourt, Quan Hung Tran, How multilingual is multilingual BERT? In Pro-
Trung Bui, Walter Chang, and Ndapa Nakashole. ceedings of the 57th Annual Meeting of the Asso-
2020. Rethinking self-attention: Towards inter- ciation for Computational Linguistics, pages 4996–
pretability in neural parsing. In Findings of the As- 5001, Florence, Italy. Association for Computa-
sociation for Computational Linguistics: EMNLP tional Linguistics.
Maria Pontiki, Dimitris Galanis, Haris Papageorgiou, David R So, Chen Liang, and Quoc V Le. 2019. The
Ion Androutsopoulos, Suresh Manandhar, Moham- evolved transformer. In International Conference on
mad AL-Smadi, Mahmoud Al-Ayyoub, Yanyan Machine Learning.
Zhao, Bing Qin, Orphée De Clercq, Véronique
Hoste, Marianna Apidianaki, Xavier Tannier, Na- Kenneth O Stanley and Risto Miikkulainen. 2002.
talia Loukachevitch, Evgeniy Kotelnikov, Nuria Bel, Evolving neural networks through augmenting
Salud María Jiménez-Zafra, and Gülşen Eryiğit. topologies. Evolutionary computation, 10(2):99–
2016. SemEval-2016 task 5: Aspect based senti- 127.
ment analysis. In Proceedings of the 10th Interna-
tional Workshop on Semantic Evaluation (SemEval- Jana Straková, Milan Straka, and Jan Hajic. 2019. Neu-
2016), pages 19–30, San Diego, California. Associa- ral architectures for nested NER through lineariza-
tion for Computational Linguistics. tion. In Proceedings of the 57th Annual Meeting
of the Association for Computational Linguistics,
pages 5326–5331, Florence, Italy. Association for
Maria Pontiki, Dimitris Galanis, Haris Papageorgiou,
Computational Linguistics.
Suresh Manandhar, and Ion Androutsopoulos. 2015.
SemEval-2015 task 12: Aspect based sentiment Masanori Suganuma, Shinichi Shirakawa, and Tomo-
analysis. In Proceedings of the 9th International haru Nagao. 2017. A genetic programming ap-
Workshop on Semantic Evaluation (SemEval 2015), proach to designing convolutional neural network ar-
pages 486–495, Denver, Colorado. Association for chitectures. In Proceedings of the genetic and evolu-
Computational Linguistics. tionary computation conference, pages 497–504.
Maria Pontiki, Dimitris Galanis, John Pavlopoulos, Beth M. Sundheim. 1995. Named entity task definition,
Harris Papageorgiou, Ion Androutsopoulos, and version 2.1. In Proceedings of the Sixth Message
Suresh Manandhar. 2014. SemEval-2014 task 4: As- Understanding Conference, pages 319–332.
pect based sentiment analysis. In Proceedings of the
8th International Workshop on Semantic Evaluation Richard S Sutton and Andrew G Barto. 1992. Rein-
(SemEval 2014), pages 27–35, Dublin, Ireland. As- forcement learning: An introduction. MIT press.
sociation for Computational Linguistics.
Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe,
Esteban Real, Alok Aggarwal, Yanping Huang, and Jon Shlens, and Zbigniew Wojna. 2016. Rethinking
Quoc V Le. 2019. Regularized evolution for image the inception architecture for computer vision. In
classifier architecture search. In Proceedings of the Proceedings of the IEEE conference on computer vi-
aaai conference on artificial intelligence, volume 33, sion and pattern recognition, pages 2818–2826.
pages 4780–4789. Lucien Tesnière. 1959. éléments de syntaxe structurale.
Editions Klincksieck.
Esteban Real, Sherry Moore, Andrew Selle, Saurabh
Saxena, Yutaka Leon Suematsu, Jie Tan, Quoc V Le, Erik F. Tjong Kim Sang. 2002. Introduction to the
and Alexey Kurakin. 2017. Large-scale evolution CoNLL-2002 shared task: Language-independent
of image classifiers. In International Conference on named entity recognition. In COLING-02: The
Machine Learning, pages 2902–2911. 6th Conference on Natural Language Learning 2002
(CoNLL-2002).
Alan Ritter, Sam Clark, Mausam, and Oren Etzioni.
2011. Named entity recognition in tweets: An ex- Erik F. Tjong Kim Sang and Sabine Buchholz. 2000.
perimental study. In Proceedings of the 2011 Con- Introduction to the CoNLL-2000 shared task chunk-
ference on Empirical Methods in Natural Language ing. In Fourth Conference on Computational Nat-
Processing, pages 1524–1534, Edinburgh, Scotland, ural Language Learning and the Second Learning
UK. Association for Computational Linguistics. Language in Logic Workshop.
Martin Wistuba. 2018. Deep learning architecture Zhao Zhong, Junjie Yan, Wei Wu, Jing Shao, and
search by neuro-cell-based evolution with function- Cheng-Lin Liu. 2018. Practical block-wise neural
preserving mutations. In Joint European Confer- network architecture generation. In Proceedings of
ence on Machine Learning and Knowledge Discov- the IEEE conference on computer vision and pattern
ery in Databases, pages 243–258. Springer. recognition, pages 2423–2432.
Junru Zhou and Hai Zhao. 2019. Head-Driven Phrase We use Stochastic Gradient Descent (SGD) to opti-
Structure Grammar parsing on Penn Treebank. In mize the controller. The training time depends on
Proceedings of the 57th Annual Meeting of the
the task and dataset size. Take the CoNLL English
Association for Computational Linguistics, pages
2396–2408, Florence, Italy. Association for Compu- NER dataset as an example. It takes 45 GPU hours
tational Linguistics. to train the controller for 30 steps on a single Tesla
P100 GPU, which is an acceptable training time in
Wei Zhu, Xiaoling Wang, Xipeng Qiu, Yuan Ni, and
Guotong Xie. 2020. Autotrans: Automating trans-
practice.
former design via reinforced architecture search.
arXiv preprint arXiv:2009.02070. Sources of Embeddings The sources of the em-
beddings that we used are listed in Table 7.
Barret Zoph and Quoc V Le. 2017. Neural architecture
search with reinforcement learning. In International B Additional Analysis
Conference on Learning Representations.
Barret Zoph, Vijay Vasudevan, Jonathon Shlens, and B.1 Document-Level and Sentence-Level
Quoc V Le. 2018. Learning transferable architec- Representations
tures for scalable image recognition. In Proceedings
of the IEEE conference on computer vision and pat-
Recently, models with document-level word repre-
tern recognition, pages 8697–8710. sentations extracted from transformer-based em-
beddings significantly outperform models with
A Detailed Configurations sentence-level word representations in NER (De-
vlin et al., 2019; Yu et al., 2020; Yamada et al.,
Evaluation To evaluate our models, We use F1 2020). However, there are a lot of application sce-
score to evaluate NER, Chunking and AE, use ac- narios that document contexts are unavailable. We
curacy to evaluate POS Tagging, use unlabeled replace the document-level word representations
attachment score (UAS) and labeled attachment from transformer-based embeddings (i.e., XLM-
score (LAS) to evaluate DP, and use labeled F1 R and BERT embeddings) with the sentence-level
score to evaluate SDP. word representations. Results are shown in Table
Task Models and Controller For sequence- 8. We report the test results of All to show how
structured tasks (i.e., NER, POS tagging, chunking, the gap between ACE and All changes with dif-
aspect extraction), we use a batch size of 32 sen- ferent kinds of representations. We report the test
tences and an SGD optimizer with a learning rate of accuracy of the models with the highest develop-
0.1. We anneal the learning rate by 0.5 when there ment accuracy following Yamada et al. (2020) for
is no accuracy improvement on the development a fair comparison. Empirical results show that the
set for 5 epochs. We set the maximum training document-level representations can significantly
epoch to 150. For graph-structured tasks (i.e., DP improve the accuracy of ACE. Comparing with
and SDP), we use Adam (Kingma and Ba, 2015) models with sentence-level representations, the av-
to optimize the model with a learning rate of 0.002. eraged accuracy gap between ACE and All is en-
We anneal the learning rate by 0.75 for every 5000 hanced from 0.7 to 1.7 with document-level repre-
iterations following Dozat and Manning (2017). sentations, which shows that the advantage of ACE
We set the maximum training epoch to 300. For becomes stronger with document-level representa-
DP, we run the maximum spanning tree (McDon- tions.
ald et al., 2005) algorithm to output valid trees in
testing. We fix the hyper-parameters of the task B.2 Fine-tuned Models Versus ACE
models. To fine-tune the embeddings, we use AdamW
We tune the learning rate for the controller (Loshchilov and Hutter, 2018) optimizer with a
among {0.1, 0.2, 0.3, 0.4, 0.5} and the discount learning rate of 5 × 10−6 and trained the contex-
factor among {0.1, 0.3, 0.5, 0.7, 0.9} on the same tualized embeddings with the task for 10 epochs.
dataset in Section 5.2. We search for the hyper- We use a batch size of 32 for BERT, M-BERT and
parameter through grid search and find a learning use a batch size of 4 for XLM-R, RoBERTa and
rate of 0.1 and a discount factor of 0.5 performs XLNet. A comparison between ACE and the fine-
the best on the development set. The controller’s tuned embeddings that we used in ACE is shown
parameters are initialized to all 0 so that each can- in Table 9, 10. Results show that ACE can further
didate is selected evenly in the first two time steps. improve the accuracy of fine-tuned models.
E MBEDDING R ESOURCE URL
GloVe Pennington et al. (2014) nlp.stanford.edu/projects/glove
fastText Bojanowski et al. (2017) github.com/facebookresearch/fastText
ELMo Peters et al. (2018) github.com/allenai/allennlp
ELMo (Other languages) Schuster et al. (2019) github.com/TalSchuster/CrossLingualContextualEmb
BERT Devlin et al. (2019) huggingface.co/bert-base-cased
M-BERT Devlin et al. (2019) huggingface.co/bert-base-multilingual-cased
BERT (Dutch) wietsedv huggingface.co/wietsedv/bert-base-dutch-cased
BERT (German) dbmdz huggingface.co/bert-base-german-dbmdz-cased
BERT (Spanish) dccuchile huggingface.co/dccuchile/bert-base-spanish-wwm-cased
BERT (Turkish) dbmdz huggingface.co/dbmdz/bert-base-turkish-cased
XLM-R Conneau et al. (2020) huggingface.co/xlm-roberta-large
RoBERTa Liu et al. (2019c) huggingface.co/roberta-large
XLNet Yang et al. (2019) huggingface.co/xlnet-large-cased
Table 7: The embeddings we used in our experiments. The URL is where we downloaded the embeddings.
Table 9: A comparison between ACE and the fine-tuned embeddings that are used in ACE for NER and POS
tagging.
Chunk AE
CoNLL 2000 14Lap 14Res 15Res 16Res es nl ru tr
BERT+Fine-tune 96.7 81.2 87.7 71.8 73.9 76.9 73.1 64.3 75.6
MBERT+Fine-tune 96.6 83.5 85.0 69.5 73.6 74.5 72.6 71.6 58.8
XLM-R+Fine-tune 97.0 85.9 90.5 76.4 78.9 77.0 77.6 77.7 74.1
RoBERTa+Fine-tune 97.2 83.9 90.2 78.5 80.7 - - - -
XLNET+Fine-tune 97.1 84.5 88.9 72.8 73.4 - - - -
ACE+Fine-tune 97.3 87.4 92.0 80.3 81.3 79.9 80.5 79.4 81.9
Table 10: A comparison between ACE and the fine-tuned embeddings we used in ACE for chunking and AE.
DP SDP
PTB DM PAS PSD
UAS LAS ID OOD ID OOD ID OOD
BERT+Fine-tune 96.6 95.1 94.4 91.4 94.4 93.0 82.0 81.3
MBERT+Fine-tune 96.5 94.9 93.9 90.4 93.9 92.1 81.2 80.0
XLM-R+Fine-tune 96.7 95.4 94.2 90.4 94.6 93.2 82.9 81.7
RoBERTa+Fine-tune 96.9 95.6 93.0 89.3 94.3 92.8 82.0 80.6
XLNET+Fine-tune 97.0 95.6 94.2 90.6 94.8 93.4 82.7 81.8
ACE+Fine-tune 97.2 95.7 95.6 92.6 95.8 94.6 83.8 83.4
Table 11: A comparison between ACE and the fine-tuned embeddings that are used in ACE for DP and SDP.
Table 12: A comparison among retrained models, All and ACE. We use the one dataset for each task.
BERT M-BERT Char ELMo F F-bw F-fw MF MF-bw MF-fw Word XLM-R
SS 0.81 0.74 0.37 0.85 0.70 0.48 0.59 0.78 0.59 0.41 0.81 0.70
GS 0.75 0.17 0.50 0.25 0.83 0.75 0.42 0.83 0.58 0.58 0.50 1.00
Sem. SS 0.67 0.73 0.40 0.80 0.60 0.40 0.53 0.87 0.60 0.53 0.80 0.60
Syn. SS 1.00 0.75 0.33 0.92 0.83 0.58 0.67 0.67 0.58 0.25 0.83 0.83
Sem. GS 0.78 0.22 0.67 0.33 0.78 0.67 0.56 0.78 0.56 0.67 0.33 1.00
Syn. GS 0.67 0.00 0.00 0.00 1.00 1.00 0.00 1.00 0.67 0.33 1.00 1.00
M-NER 0.67 1.00 0.56 0.83 1.00 0.78 1.00 0.89 0.78 0.44 0.78 0.89
M-AE 1.00 0.33 0.75 0.33 0.58 0.42 0.42 0.75 0.25 0.75 0.50 0.92
Table 13: The percentage of each embedding candidate selected in the best concatenations from ACE. F and MF
are monolingual and multilingual Flair embeddings. We count these two embeddings are selected if one of the
forward/backward (fw/bw) direction of Flair is selected in the concatenation. We count the Word embedding is
selected if one of the fastText/GloVe embeddings is selected. SS: sequence-structured tasks. GS: graph-structured
tasks. Sem.: Semantic-level tasks. Syn.: Syntactic-level tasks. M-NER: Multilingual NER tasks. M-AE: Mul-
tilingual AE tasks. We only use English datasets in SS and GS. English datasets are removed for M-NER and
M-AE.