0% found this document useful (0 votes)
11 views

Model

Uploaded by

edoyimale
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views

Model

Uploaded by

edoyimale
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

What Would Elsa Do?

Freezing Layers During Transformer Fine-Tuning

Jaejun Lee, Raphael Tang, and Jimmy Lin


David R. Cheriton School of Computer Science
University of Waterloo

Abstract example, note that only a few attention heads need


to be retained in each layer for acceptable effec-
Pretrained transformer-based language models
have achieved state of the art across countless tiveness. Kovaleva et al. (2019) find that, on many
arXiv:1911.03090v1 [cs.CL] 8 Nov 2019

tasks in natural language processing. These tasks, just the last few layers change the most af-
models are highly expressive, comprising at ter the fine-tuning process. We take these obser-
least a hundred million parameters and a dozen vations as evidence that only the last few layers
layers. Recent evidence suggests that only a necessarily need to be fine-tuned.
few of the final layers need to be fine-tuned for The central objective of our paper is, then, to de-
high quality on downstream tasks. Naturally,
termine how many of the last layers actually need
a subsequent research question is, “how many
of the last layers do we need to fine-tune?” In fine-tuning. Why is this an important subject of
this paper, we precisely answer this question. study? Pragmatically, a reasonable cutoff point
We examine two recent pretrained language saves computational memory across fine-tuning
models, BERT and RoBERTa, across standard multiple tasks, which bolsters the effectiveness
tasks in textual entailment, semantic similarity, of existing parameter-saving methods (Houlsby
sentiment analysis, and linguistic acceptabil- et al., 2019). Pedagogically, understanding the re-
ity. We vary the number of final layers that are
lationship between the number of fine-tuned layers
fine-tuned, then study the resulting change in
task-specific effectiveness. We show that only
and the resulting model quality may guide future
a fourth of the final layers need to be fine-tuned works in modeling.
to achieve 90% of the original quality. Surpris- Our research contribution is a comprehensive
ingly, we also find that fine-tuning all layers evaluation, across multiple pretrained transform-
does not always help. ers and datasets, of the number of final layers
needed for fine-tuning. We show that, on most
1 Introduction
tasks, we need to fine-tune only one fourth of the
Transformer-based pretrained language models final layers to achieve within 10% parity with the
are a battle-tested solution to a plethora of natu- full model. Surprisingly, on SST-2, a sentiment
ral language processing tasks. In this paradigm, a classification dataset, we find that not fine-tuning
transformer-based language model is first trained all of the layers leads to improved quality.
on copious amounts of text, then fine-tuned on
task-specific data. BERT (Devlin et al., 2019), 2 Background and Related Work
XLNet (Yang et al., 2019), and RoBERTa (Liu
et al., 2019) are some of the most well-known 2.1 Pretrained Language Models
ones, representing the current state of the art in In the pretrained language modeling paradigm, a
natural language inference, question answering, language model (LM) is trained on vast amounts
and sentiment classification, to list a few. These of text, then fine-tuned on a specific downstream
models are extremely expressive, consisting of at task. Peters et al. (2018) are one of the first to suc-
least a hundred million parameters, a hundred at- cessfully apply this idea, outperforming state of
tention heads, and a dozen layers. the art in question answering, textual entailment,
An emerging line of work questions the need and sentiment classification. Their model, dubbed
for such a parameter-loaded model, especially on ELMo, comprises a two-layer BiLSTM pretrained
a single downstream task. Michel et al. (2019), for on the Billion Word Corpus (Chelba et al., 2014).
Furthering this approach with more data and Model Embedding Per-Layer Output Total
improved modeling, Devlin et al. (2019) pre- BERTBASE 24M (22%) 7M (7%) 0.6M (0.5%) 110M
RoBERTaBASE 39M (31%) 7M (6%) 0.6M (0.5%) 125M
train deep 12- and 24-layer bidirectional trans-
BERTLARGE 32M (10%) 13M (4%) 1M (0.3%) 335M
formers (Vaswani et al., 2017) on the entirety of RoBERTaLARGE 52M (15%) 13M (4%) 1M (0.3%) 355M
Wikipedia and BooksCorpus (Zhu et al., 2015).
Their approach, called BERT, achieves state of the Table 1: Parameter statistics for the base and large vari-
art across all tasks in the General Language Under- ants of BERT and RoBERTa. Note that “per-layer” in-
standing Evaluation (GLUE) benchmark (Wang dicates the number of parameters in one intermediate
et al., 2018), as well as the Stanford Question An- layer, which is more relevant to our study.
swering Dataset (Rajpurkar et al., 2016).
CoLA SST-2 MRPC STS-B QQP MNLI QNLI RTE
Model
As a result of this development, a flurry of MCC Acc. ρ ρ F1 Acc. Acc. Acc.

recent papers has followed this more-data-plus- BERTBASE 58.8 92.7 90.4 89.5 87.8 84.3 91.3 68.2
RoBERTaBASE 59.9 94.6 92.8 90.8 88.8 87.4 92.7 78.2
better-models principle. Two prominent exam- BERTLARGE 61.8 93.4 90.6 89.7 88.3 86.4 92.2 71.1
ples include XLNet (Yang et al., 2019) and RoBERTaLARGE 66.0 95.5 92.8 91.9 89.1 89.9 94.3 84.5

RoBERTa (Liu et al., 2019), both of which con-


Table 2: Reproduced results of BERT and RoBERTa
test the present state of the art. XLNet proposes
on the development sets.
to pretrain two-stream attention-augmented trans-
formers on an autoregressive LM objective, in-
stead of the original cloze and next sentence pre- 3 Experimental Setup
diction (NSP) tasks from BERT. RoBERTa pri-
marily argues for pretraining longer, using more We conduct our experiments on NVIDIA Tesla
data, and removing the NSP task for BERT. V100 GPUs with CUDA v10.1. We run the mod-
els from the Transformers library (v2.1.1; Wolf
et al., 2019) using PyTorch v1.2.0.
2.2 Layerwise Interpretability
3.1 Models and Datasets
The prevailing evidence in the neural network lit-
We choose BERT (Devlin et al., 2019) and
erature suggests that earlier layers extract univer-
RoBERTa (Liu et al., 2019) as the subjects of
sal features, while later ones perform task-specific
our study, since they represent state of the art
modeling. Zeiler and Fergus (2014) visualize the
and the same architecture. XLNet (Yang et al.,
per-layer activations in image classification net-
2019) is another alternative; however, they use
works, finding that the first few layers function
a slightly different attention structure, and our
as corner and edge detectors, and the final layers
preliminary experiments encountered difficulties
as class-specific feature extractors. Gatys et al.
in reproducibility with the Transformers library.
(2016) demonstrate that the low- and high-level
Each model has base and large variants that con-
notions of content and style are separable in con-
tain 12 and 24 layers, respectively. We denote
volutional neural networks, with lower layers cap-
them by appending the variant name as a subscript
turing content and higher layers style.
to the model name.
Pretrained transformers. In the NLP litera- Within each variant, the two models display
ture, similar observations have been made for pre- slight variability in parameter count—110 and 125
trained language models. Clark et al. (2019) an- million in the base variant, and 335 and 355 in
alyze BERT’s attention and observe that the bot- the large one. These differences are mostly at-
tom layers attend broadly, while the top layers tributed to RoBERTa using many more embedding
capture linguistic syntax. Kovaleva et al. (2019) parameters—exactly 63% more for both variants.
find that the last few layers of BERT change the For in-depth, layerwise statistics, see Table 1.
most after task-specific fine-tuning. Similar to our For our datasets, we use the GLUE bench-
work, Houlsby et al. (2019) fine-tune the top lay- mark, which comprises the tasks in natural lan-
ers of BERT, as part of their baseline comparison guage inference, sentiment classification, linguis-
for their model compression approach. However, tic acceptability, and semantic similarity. Specifi-
none of the studies comprehensively examine the cally, for natural language inference (NLI), it pro-
number of necessary final layers across multiple vides the Multigenre NLI (MNLI; Williams et al.,
pretrained transformers and datasets. 2018), Question NLI (QNLI; Wang et al., 2018),
Frozen CoLA SST-2 MRPC STS-B QQP MNLI MNLI-mm QNLI RTE
Model
up to MCC Acc. F1 ρ F1 Acc. Acc. Acc. Acc.
0th 58.3 92.7 90.3 88.8 87.9 84.2 84.8 91.4 67.6
BERTBASE 9th 47.5 90.8 85.4 88.0 85.3 82.0 82.4 89.5 62.3
12th 29.4 84.9 81.5 78.1 72.0 56.4 57.1 74.5 57.5

Table 3: Development set results of BERT, with none, some, and all of the nonoutput layer weights fine-tuned.
Results are averaged across five runs.

Frozen CoLA SST-2 MRPC STS-B Frozen CoLA SST-2 MRPC STS-B
Model Model
up to MCC Acc. F1 ρ up to MCC Acc. F1 ρ
0th 58.3 92.7 90.3 88.9 0th 61.9 93.4 90.3 89.8
BERTBASE 9th 47.5 90.8 85.4 88.0 BERTLARGE 18th 51.6 92.7 85.4 88.0
12th 29.4 84.9 81.5 78.1 24th 24.4 87.8 81.3 71.7
0th 59.4 94.3 92.3 90.6 0th 66.1 95.1 92.2 92.0
RoBERTaBASE 7th 58.6 93.3 89.5 87.7 RoBERTaLARGE 17th 60.5 95.1 91.3 89.6
12th 0.0 80.2 81.2 20.0 24th 0.0 79.2 81.2 11.2

Table 4: Development set results of all base models, Table 5: Development set results of all large models,
with none, some, and all of the nonoutput layer weights with none, some, and all of the nonoutput layer weights
fine-tuned. Results are averaged across five runs. fine-tuned. Results are averaged across five runs.

Recognizing Textual Entailment (RTE; Bentivogli ers, we explore N = L2 , L2 + 1, . . . , L. Due to


et al., 2009), and Winograd NLI (Levesque et al., computational limitations, we set half as the cut-
2012) datasets. For semantic textual similarity and off point. Additionally, we restrict our compre-
paraphrasing, it contains the Microsoft Research hensive all-datasets exploration to the base vari-
Paraphrase Corpus (MRPC; Dolan and Brockett, ant of BERT, since the large model variants and
2005), the Semantic Textual Similarity Bench- RoBERTa are much more computationally in-
mark (STS-B; Cer et al., 2017), and Quora Ques- tensive. On the smaller CoLA, SST-2, MRPC,
tion Pairs (QQP; Iyer et al.). Finally, its single- and STS-B datasets, we comprehensively evaluate
sentence tasks consist of the binary-polarity Stan- both models. These choices do not substantially
ford Sentiment Treebank (SST-2; Socher et al., affect our analysis.
2013) and the Corpus of Linguistic Acceptabil-
ity (CoLA; Warstadt et al., 2018). 4 Analysis
3.2 Fine-Tuning Procedure 4.1 Operating Points
Our fine-tuning procedure closely resembles those We report three relevant operating points in Tables
of BERT and RoBERTa. We choose the Adam 3–5: two extreme operating points and an interme-
optimizer (Kingma and Ba, 2014) with a batch diate one. The former is self-explanatory, indicat-
size of 16 and fine-tune BERT for 3 epochs and ing fine-tuning all or none of the nonoutput layers.
RoBERTa for 10, following the original papers. The latter denotes the number of necessary layers
For hyperparameter tuning, the best learning rate for reaching at least 90% of the full model quality,
is different for each task, and all of the origi- excluding CoLA, which is an outlier.
nal authors choose one between 1 × 10−5 and From the reported results in Tables 3–5, fine-
5 × 10−5 ; thus, we perform line search over the tuning the last output layer and task-specific lay-
interval with a step size of 1 × 10−5 . We report ers is insufficient for all tasks—see the rows corre-
the best results in Table 2. sponding to 0, 12, and 24 frozen layers. However,
On each model, we freeze the embeddings and we find that the first half of the model is unnec-
the weights of the first N layers, then fine-tune essary; the base models, for example, need fine-
the rest using the best hyperparameters of the full tuning of only 3–5 layers out of the 12 to reach
model. Specifically, if L is the number of lay- 90% of the original quality—see Table 4, middle
CoLA-base SST-2-base MRPC-base STS-B-base
0.0 0.00 0.00 0.00

0.2 0.05 0.05 0.05


MCC

Acc.

F1
0.10 0.10 0.10
0.4
0.15 0.15 0.15
0.6 all 11 10 9 8 7 6 0.20 all 0.00
11 10 9 8 7 6 0.20 all 11 10 9 8 7 6 0.20 all 11 10 9 8 7 6
CoLA-large SST-2-large
0.05 MRPC-large STS-B-large
0.0 0.00 0.00 0.00
0.10
0.2 0.05 0.05 0.05
0.15
MCC

Acc.

F1
0.10 0.10 0.10
0.15 0.20 all2322212019181716
0.4
15141312
0.15 0.15
0.6 all 22 20 18 16 14 12 0.20 all 22 20 18 16 14 12 0.20 all 22 20 18 16 14 12 0.20 all 22 20 18 16 14 12

BERT RoBERTa
Figure 1: Relative change in quality compared to the full models, with respect to the number of frozen initial
layers, represented by the x-axes.

subrow of each row group. Similarly, fine-tuning layers. This finding suggests that these models
only a fourth of the layers is sufficient for the large may be overparameterized for SST-2.
models (see Table 5); only 6 layers out of 24 for
BERT and 7 for RoBERTa. 5 Conclusions and Future Work
In this paper, we present a comprehensive evalu-
4.2 Per-Layer Study ation of the number of final layers that need to
In Figure 1, we examine how the relative qual- be fine-tuned for pretrained transformer-based lan-
ity changes with the number of frozen layers. To guage models. We find that only a fourth of the
compute a relative score, we subtract each frozen layers necessarily need to be fine-tuned to ob-
model’s results from its corresponding full model. tain 90% of the original quality. One line of
The relative score aligns the two baselines at zero, future work is to conduct a similar, more fine-
allowing the fair comparison of the transformers. grained analysis on the contributions of the atten-
The graphs report the average of five trials to re- tion heads.
duce the effects of outliers.
Acknowledgments
When every component except the output layer
and the task-specific layer is frozen, the fine-tuned This research was supported by the Natu-
model achieves only 64% of the original quality, ral Sciences and Engineering Research Council
on average. As more layers are fine-tuned, the (NSERC) of Canada, and enabled by computa-
model effectiveness often improves drastically— tional resources provided by Compute Ontario and
see CoLA and STS-B, the first and fourth verti- Compute Canada.
cal pairs of subfigures from the left. This demon-
strates that gains decompose nonadditively with
respect to the number of frozen initial layers. Fine- References
tuning subsequent layers shows diminishing re- Luisa Bentivogli, Ido Kalman Dagan, Dang Hoa,
turns, with every model rapidly approaching the Danilo Giampiccolo, and Bernardo Magnini. 2009.
baseline quality at fine-tuning half of the network; The fifth PASCAL recognizing textual entailment
challenge. In TAC 2009 Workshop.
hence, we believe that half is a reasonable cutoff
point for characterizing the models. Daniel Cer, Mona Diab, Eneko Agirre, Inigo Lopez-
Finally, for the large variants of BERT and Gazpio, and Lucia Specia. 2017. SemEval-2017
task 1: Semantic textual similarity multilingual and
RoBERTa on SST-2 (second subfigure from both crosslingual focused evaluation. In Proceedings of
the top and the left), we observe a surprisingly the 11th International Workshop on Semantic Eval-
consistent increase in quality when freezing 12–16 uation.
Ciprian Chelba, Tomas Mikolov, Mike Schuster, Qi Ge, Zettlemoyer. 2018. Deep contextualized word repre-
Thorsten Brants, Phillipp Koehn, and Tony Robin- sentations. In Proceedings of the 2018 Conference
son. 2014. One billion word benchmark for mea- of the North American Chapter of the Association
suring progress in statistical language modeling. In for Computational Linguistics: Human Language
Fifteenth Annual Conference of the International Technologies.
Speech Communication Association.
Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and
Kevin Clark, Urvashi Khandelwal, Omer Levy, and Percy Liang. 2016. SQuAD: 100,000+ questions for
Christopher D. Manning. 2019. What does machine comprehension of text. In Proceedings of
BERT look at? An analysis of BERT’s attention. the 2016 Conference on Empirical Methods in Nat-
arXiv:1906.04341. ural Language Processing.

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Richard Socher, Alex Perelygin, Jean Wu, Jason
Kristina Toutanova. 2019. BERT: pre-training of Chuang, Christopher D. Manning, Andrew Ng, and
deep bidirectional transformers for language under- Christopher Potts. 2013. Recursive deep models
standing. In Proceedings of the 2019 Conference of for semantic compositionality over a sentiment tree-
the North American Chapter of the Association for bank. In Proceedings of the 2013 Conference on
Computational Linguistics: Human Language Tech- Empirical Methods in Natural Language Process-
nologies. ing.
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob
William B. Dolan and Chris Brockett. 2005. Automati- Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz
cally constructing a corpus of sentential paraphrases. Kaiser, and Illia Polosukhin. 2017. Attention is all
In Proceedings of the Third International Workshop you need. In Advances in Neural Information Pro-
on Paraphrasing. cessing Systems.
Leon A. Gatys, Alexander S. Ecker, and Matthias Alex Wang, Amanpreet Singh, Julian Michael, Felix
Bethge. 2016. Image style transfer using convolu- Hill, Omer Levy, and Samuel R. Bowman. 2018.
tional neural networks. In Proceedings of the IEEE GLUE: A multi-task benchmark and analysis plat-
Conference on Computer Vision and Pattern Recog- form for natural language understanding. In Pro-
nition. ceedings of the 2018 EMNLP Workshop Black-
boxNLP: Analyzing and Interpreting Neural Net-
Neil Houlsby, Andrei Giurgiu, Stanislaw Jastrzebski, works for NLP.
Bruna Morrone, Quentin De Laroussilhe, Andrea
Gesmundo, Mona Attariyan, and Sylvain Gelly. Alex Warstadt, Amanpreet Singh, and Samuel R. Bow-
2019. Parameter-efficient transfer learning for NLP. man. 2018. Neural network acceptability judg-
In International Conference on Machine Learning. ments. arXiv:1805.12471.

Shankar Iyer, Nikhil Dandekar, and Kornel Csernai. Adina Williams, Nikita Nangia, and Samuel R. Bow-
First Quora dataset release: Question pairs. man. 2018. A broad-coverage challenge corpus for
sentence understanding through inference. In Pro-
Diederik P. Kingma and Jimmy Ba. 2014. ceedings of the 2018 Conference of the North Amer-
Adam: A method for stochastic optimization. ican Chapter of the Association for Computational
arXiv:1412.6980. Linguistics: Human Language Technologies.

Olga Kovaleva, Alexey Romanov, Anna Rogers, and Thomas Wolf, Lysandre Debut, Victor Sanh, Julien
Anna Rumshisky. 2019. Revealing the dark secrets Chaumond, Clement Delangue, Anthony Moi, Pier-
of BERT. arXiv:1908.08593. ric Cistac, Tim Rault, R’emi Louf, Morgan Funtow-
icz, and Jamie Brew. 2019. HuggingFace’s Trans-
Hector Levesque, Ernest Davis, and Leora Morgen- formers: State-of-the-art natural language process-
stern. 2012. The Winograd schema challenge. In ing. arXiv:1910.03771.
Thirteenth International Conference on the Princi- Zhilin Yang, Zihang Dai, Yiming Yang, Jaime G.
ples of Knowledge Representation and Reasoning. Carbonell, Ruslan Salakhutdinov, and Quoc V. Le.
2019. XLNet: generalized autoregressive pretrain-
Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Man- ing for language understanding. arXiv:1906.08237.
dar Joshi, Danqi Chen, Omer Levy, Mike Lewis,
Luke Zettlemoyer, and Veselin Stoyanov. 2019. Matthew D. Zeiler and Rob Fergus. 2014. Visualizing
RoBERTa: A robustly optimized BERT pretraining and understanding convolutional networks. In Euro-
approach. arXiv:1907.11692. pean Conference on Computer Vision.
Paul Michel, Omer Levy, and Graham Neubig. Yukun Zhu, Ryan Kiros, Rich Zemel, Ruslan Salakhut-
2019. Are sixteen heads really better than one? dinov, Raquel Urtasun, Antonio Torralba, and Sanja
arXiv:1905.10650. Fidler. 2015. Aligning books and movies: Towards
story-like visual explanations by watching movies
Matthew Peters, Mark Neumann, Mohit Iyyer, Matt and reading books. In Proceedings of the IEEE In-
Gardner, Christopher Clark, Kenton Lee, and Luke ternational Conference on Computer Vision.

You might also like