Model
Model
tasks in natural language processing. These tasks, just the last few layers change the most af-
models are highly expressive, comprising at ter the fine-tuning process. We take these obser-
least a hundred million parameters and a dozen vations as evidence that only the last few layers
layers. Recent evidence suggests that only a necessarily need to be fine-tuned.
few of the final layers need to be fine-tuned for The central objective of our paper is, then, to de-
high quality on downstream tasks. Naturally,
termine how many of the last layers actually need
a subsequent research question is, “how many
of the last layers do we need to fine-tune?” In fine-tuning. Why is this an important subject of
this paper, we precisely answer this question. study? Pragmatically, a reasonable cutoff point
We examine two recent pretrained language saves computational memory across fine-tuning
models, BERT and RoBERTa, across standard multiple tasks, which bolsters the effectiveness
tasks in textual entailment, semantic similarity, of existing parameter-saving methods (Houlsby
sentiment analysis, and linguistic acceptabil- et al., 2019). Pedagogically, understanding the re-
ity. We vary the number of final layers that are
lationship between the number of fine-tuned layers
fine-tuned, then study the resulting change in
task-specific effectiveness. We show that only
and the resulting model quality may guide future
a fourth of the final layers need to be fine-tuned works in modeling.
to achieve 90% of the original quality. Surpris- Our research contribution is a comprehensive
ingly, we also find that fine-tuning all layers evaluation, across multiple pretrained transform-
does not always help. ers and datasets, of the number of final layers
needed for fine-tuning. We show that, on most
1 Introduction
tasks, we need to fine-tune only one fourth of the
Transformer-based pretrained language models final layers to achieve within 10% parity with the
are a battle-tested solution to a plethora of natu- full model. Surprisingly, on SST-2, a sentiment
ral language processing tasks. In this paradigm, a classification dataset, we find that not fine-tuning
transformer-based language model is first trained all of the layers leads to improved quality.
on copious amounts of text, then fine-tuned on
task-specific data. BERT (Devlin et al., 2019), 2 Background and Related Work
XLNet (Yang et al., 2019), and RoBERTa (Liu
et al., 2019) are some of the most well-known 2.1 Pretrained Language Models
ones, representing the current state of the art in In the pretrained language modeling paradigm, a
natural language inference, question answering, language model (LM) is trained on vast amounts
and sentiment classification, to list a few. These of text, then fine-tuned on a specific downstream
models are extremely expressive, consisting of at task. Peters et al. (2018) are one of the first to suc-
least a hundred million parameters, a hundred at- cessfully apply this idea, outperforming state of
tention heads, and a dozen layers. the art in question answering, textual entailment,
An emerging line of work questions the need and sentiment classification. Their model, dubbed
for such a parameter-loaded model, especially on ELMo, comprises a two-layer BiLSTM pretrained
a single downstream task. Michel et al. (2019), for on the Billion Word Corpus (Chelba et al., 2014).
Furthering this approach with more data and Model Embedding Per-Layer Output Total
improved modeling, Devlin et al. (2019) pre- BERTBASE 24M (22%) 7M (7%) 0.6M (0.5%) 110M
RoBERTaBASE 39M (31%) 7M (6%) 0.6M (0.5%) 125M
train deep 12- and 24-layer bidirectional trans-
BERTLARGE 32M (10%) 13M (4%) 1M (0.3%) 335M
formers (Vaswani et al., 2017) on the entirety of RoBERTaLARGE 52M (15%) 13M (4%) 1M (0.3%) 355M
Wikipedia and BooksCorpus (Zhu et al., 2015).
Their approach, called BERT, achieves state of the Table 1: Parameter statistics for the base and large vari-
art across all tasks in the General Language Under- ants of BERT and RoBERTa. Note that “per-layer” in-
standing Evaluation (GLUE) benchmark (Wang dicates the number of parameters in one intermediate
et al., 2018), as well as the Stanford Question An- layer, which is more relevant to our study.
swering Dataset (Rajpurkar et al., 2016).
CoLA SST-2 MRPC STS-B QQP MNLI QNLI RTE
Model
As a result of this development, a flurry of MCC Acc. ρ ρ F1 Acc. Acc. Acc.
recent papers has followed this more-data-plus- BERTBASE 58.8 92.7 90.4 89.5 87.8 84.3 91.3 68.2
RoBERTaBASE 59.9 94.6 92.8 90.8 88.8 87.4 92.7 78.2
better-models principle. Two prominent exam- BERTLARGE 61.8 93.4 90.6 89.7 88.3 86.4 92.2 71.1
ples include XLNet (Yang et al., 2019) and RoBERTaLARGE 66.0 95.5 92.8 91.9 89.1 89.9 94.3 84.5
Table 3: Development set results of BERT, with none, some, and all of the nonoutput layer weights fine-tuned.
Results are averaged across five runs.
Frozen CoLA SST-2 MRPC STS-B Frozen CoLA SST-2 MRPC STS-B
Model Model
up to MCC Acc. F1 ρ up to MCC Acc. F1 ρ
0th 58.3 92.7 90.3 88.9 0th 61.9 93.4 90.3 89.8
BERTBASE 9th 47.5 90.8 85.4 88.0 BERTLARGE 18th 51.6 92.7 85.4 88.0
12th 29.4 84.9 81.5 78.1 24th 24.4 87.8 81.3 71.7
0th 59.4 94.3 92.3 90.6 0th 66.1 95.1 92.2 92.0
RoBERTaBASE 7th 58.6 93.3 89.5 87.7 RoBERTaLARGE 17th 60.5 95.1 91.3 89.6
12th 0.0 80.2 81.2 20.0 24th 0.0 79.2 81.2 11.2
Table 4: Development set results of all base models, Table 5: Development set results of all large models,
with none, some, and all of the nonoutput layer weights with none, some, and all of the nonoutput layer weights
fine-tuned. Results are averaged across five runs. fine-tuned. Results are averaged across five runs.
Acc.
F1
0.10 0.10 0.10
0.4
0.15 0.15 0.15
0.6 all 11 10 9 8 7 6 0.20 all 0.00
11 10 9 8 7 6 0.20 all 11 10 9 8 7 6 0.20 all 11 10 9 8 7 6
CoLA-large SST-2-large
0.05 MRPC-large STS-B-large
0.0 0.00 0.00 0.00
0.10
0.2 0.05 0.05 0.05
0.15
MCC
Acc.
F1
0.10 0.10 0.10
0.15 0.20 all2322212019181716
0.4
15141312
0.15 0.15
0.6 all 22 20 18 16 14 12 0.20 all 22 20 18 16 14 12 0.20 all 22 20 18 16 14 12 0.20 all 22 20 18 16 14 12
BERT RoBERTa
Figure 1: Relative change in quality compared to the full models, with respect to the number of frozen initial
layers, represented by the x-axes.
subrow of each row group. Similarly, fine-tuning layers. This finding suggests that these models
only a fourth of the layers is sufficient for the large may be overparameterized for SST-2.
models (see Table 5); only 6 layers out of 24 for
BERT and 7 for RoBERTa. 5 Conclusions and Future Work
In this paper, we present a comprehensive evalu-
4.2 Per-Layer Study ation of the number of final layers that need to
In Figure 1, we examine how the relative qual- be fine-tuned for pretrained transformer-based lan-
ity changes with the number of frozen layers. To guage models. We find that only a fourth of the
compute a relative score, we subtract each frozen layers necessarily need to be fine-tuned to ob-
model’s results from its corresponding full model. tain 90% of the original quality. One line of
The relative score aligns the two baselines at zero, future work is to conduct a similar, more fine-
allowing the fair comparison of the transformers. grained analysis on the contributions of the atten-
The graphs report the average of five trials to re- tion heads.
duce the effects of outliers.
Acknowledgments
When every component except the output layer
and the task-specific layer is frozen, the fine-tuned This research was supported by the Natu-
model achieves only 64% of the original quality, ral Sciences and Engineering Research Council
on average. As more layers are fine-tuned, the (NSERC) of Canada, and enabled by computa-
model effectiveness often improves drastically— tional resources provided by Compute Ontario and
see CoLA and STS-B, the first and fourth verti- Compute Canada.
cal pairs of subfigures from the left. This demon-
strates that gains decompose nonadditively with
respect to the number of frozen initial layers. Fine- References
tuning subsequent layers shows diminishing re- Luisa Bentivogli, Ido Kalman Dagan, Dang Hoa,
turns, with every model rapidly approaching the Danilo Giampiccolo, and Bernardo Magnini. 2009.
baseline quality at fine-tuning half of the network; The fifth PASCAL recognizing textual entailment
challenge. In TAC 2009 Workshop.
hence, we believe that half is a reasonable cutoff
point for characterizing the models. Daniel Cer, Mona Diab, Eneko Agirre, Inigo Lopez-
Finally, for the large variants of BERT and Gazpio, and Lucia Specia. 2017. SemEval-2017
task 1: Semantic textual similarity multilingual and
RoBERTa on SST-2 (second subfigure from both crosslingual focused evaluation. In Proceedings of
the top and the left), we observe a surprisingly the 11th International Workshop on Semantic Eval-
consistent increase in quality when freezing 12–16 uation.
Ciprian Chelba, Tomas Mikolov, Mike Schuster, Qi Ge, Zettlemoyer. 2018. Deep contextualized word repre-
Thorsten Brants, Phillipp Koehn, and Tony Robin- sentations. In Proceedings of the 2018 Conference
son. 2014. One billion word benchmark for mea- of the North American Chapter of the Association
suring progress in statistical language modeling. In for Computational Linguistics: Human Language
Fifteenth Annual Conference of the International Technologies.
Speech Communication Association.
Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and
Kevin Clark, Urvashi Khandelwal, Omer Levy, and Percy Liang. 2016. SQuAD: 100,000+ questions for
Christopher D. Manning. 2019. What does machine comprehension of text. In Proceedings of
BERT look at? An analysis of BERT’s attention. the 2016 Conference on Empirical Methods in Nat-
arXiv:1906.04341. ural Language Processing.
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Richard Socher, Alex Perelygin, Jean Wu, Jason
Kristina Toutanova. 2019. BERT: pre-training of Chuang, Christopher D. Manning, Andrew Ng, and
deep bidirectional transformers for language under- Christopher Potts. 2013. Recursive deep models
standing. In Proceedings of the 2019 Conference of for semantic compositionality over a sentiment tree-
the North American Chapter of the Association for bank. In Proceedings of the 2013 Conference on
Computational Linguistics: Human Language Tech- Empirical Methods in Natural Language Process-
nologies. ing.
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob
William B. Dolan and Chris Brockett. 2005. Automati- Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz
cally constructing a corpus of sentential paraphrases. Kaiser, and Illia Polosukhin. 2017. Attention is all
In Proceedings of the Third International Workshop you need. In Advances in Neural Information Pro-
on Paraphrasing. cessing Systems.
Leon A. Gatys, Alexander S. Ecker, and Matthias Alex Wang, Amanpreet Singh, Julian Michael, Felix
Bethge. 2016. Image style transfer using convolu- Hill, Omer Levy, and Samuel R. Bowman. 2018.
tional neural networks. In Proceedings of the IEEE GLUE: A multi-task benchmark and analysis plat-
Conference on Computer Vision and Pattern Recog- form for natural language understanding. In Pro-
nition. ceedings of the 2018 EMNLP Workshop Black-
boxNLP: Analyzing and Interpreting Neural Net-
Neil Houlsby, Andrei Giurgiu, Stanislaw Jastrzebski, works for NLP.
Bruna Morrone, Quentin De Laroussilhe, Andrea
Gesmundo, Mona Attariyan, and Sylvain Gelly. Alex Warstadt, Amanpreet Singh, and Samuel R. Bow-
2019. Parameter-efficient transfer learning for NLP. man. 2018. Neural network acceptability judg-
In International Conference on Machine Learning. ments. arXiv:1805.12471.
Shankar Iyer, Nikhil Dandekar, and Kornel Csernai. Adina Williams, Nikita Nangia, and Samuel R. Bow-
First Quora dataset release: Question pairs. man. 2018. A broad-coverage challenge corpus for
sentence understanding through inference. In Pro-
Diederik P. Kingma and Jimmy Ba. 2014. ceedings of the 2018 Conference of the North Amer-
Adam: A method for stochastic optimization. ican Chapter of the Association for Computational
arXiv:1412.6980. Linguistics: Human Language Technologies.
Olga Kovaleva, Alexey Romanov, Anna Rogers, and Thomas Wolf, Lysandre Debut, Victor Sanh, Julien
Anna Rumshisky. 2019. Revealing the dark secrets Chaumond, Clement Delangue, Anthony Moi, Pier-
of BERT. arXiv:1908.08593. ric Cistac, Tim Rault, R’emi Louf, Morgan Funtow-
icz, and Jamie Brew. 2019. HuggingFace’s Trans-
Hector Levesque, Ernest Davis, and Leora Morgen- formers: State-of-the-art natural language process-
stern. 2012. The Winograd schema challenge. In ing. arXiv:1910.03771.
Thirteenth International Conference on the Princi- Zhilin Yang, Zihang Dai, Yiming Yang, Jaime G.
ples of Knowledge Representation and Reasoning. Carbonell, Ruslan Salakhutdinov, and Quoc V. Le.
2019. XLNet: generalized autoregressive pretrain-
Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Man- ing for language understanding. arXiv:1906.08237.
dar Joshi, Danqi Chen, Omer Levy, Mike Lewis,
Luke Zettlemoyer, and Veselin Stoyanov. 2019. Matthew D. Zeiler and Rob Fergus. 2014. Visualizing
RoBERTa: A robustly optimized BERT pretraining and understanding convolutional networks. In Euro-
approach. arXiv:1907.11692. pean Conference on Computer Vision.
Paul Michel, Omer Levy, and Graham Neubig. Yukun Zhu, Ryan Kiros, Rich Zemel, Ruslan Salakhut-
2019. Are sixteen heads really better than one? dinov, Raquel Urtasun, Antonio Torralba, and Sanja
arXiv:1905.10650. Fidler. 2015. Aligning books and movies: Towards
story-like visual explanations by watching movies
Matthew Peters, Mark Neumann, Mohit Iyyer, Matt and reading books. In Proceedings of the IEEE In-
Gardner, Christopher Clark, Kenton Lee, and Luke ternational Conference on Computer Vision.