Model

Uploaded by

edoyimale

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

11 views

Model

Uploaded by

edoyimale

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 5

What Would Elsa Do?

Freezing Layers During Transformer Fine-Tuning

Jaejun Lee, Raphael Tang, and Jimmy Lin

David R. Cheriton School of Computer Science
University of Waterloo

Abstract example, note that only a few attention heads need

to be retained in each layer for acceptable effec-
Pretrained transformer-based language models
have achieved state of the art across countless tiveness. Kovaleva et al. (2019) find that, on many
arXiv:1911.03090v1 [cs.CL] 8 Nov 2019

tasks in natural language processing. These tasks, just the last few layers change the most af-
models are highly expressive, comprising at ter the fine-tuning process. We take these obser-
least a hundred million parameters and a dozen vations as evidence that only the last few layers
layers. Recent evidence suggests that only a necessarily need to be fine-tuned.
few of the final layers need to be fine-tuned for The central objective of our paper is, then, to de-
high quality on downstream tasks. Naturally,
termine how many of the last layers actually need
a subsequent research question is, “how many
of the last layers do we need to fine-tune?” In fine-tuning. Why is this an important subject of
this paper, we precisely answer this question. study? Pragmatically, a reasonable cutoff point
We examine two recent pretrained language saves computational memory across fine-tuning
models, BERT and RoBERTa, across standard multiple tasks, which bolsters the effectiveness
tasks in textual entailment, semantic similarity, of existing parameter-saving methods (Houlsby
sentiment analysis, and linguistic acceptabil- et al., 2019). Pedagogically, understanding the re-
ity. We vary the number of final layers that are
lationship between the number of fine-tuned layers
fine-tuned, then study the resulting change in
task-specific effectiveness. We show that only
and the resulting model quality may guide future
a fourth of the final layers need to be fine-tuned works in modeling.
to achieve 90% of the original quality. Surpris- Our research contribution is a comprehensive
ingly, we also find that fine-tuning all layers evaluation, across multiple pretrained transform-
does not always help. ers and datasets, of the number of final layers
needed for fine-tuning. We show that, on most
1 Introduction
tasks, we need to fine-tune only one fourth of the
Transformer-based pretrained language models final layers to achieve within 10% parity with the
are a battle-tested solution to a plethora of natu- full model. Surprisingly, on SST-2, a sentiment
ral language processing tasks. In this paradigm, a classification dataset, we find that not fine-tuning
transformer-based language model is first trained all of the layers leads to improved quality.
on copious amounts of text, then fine-tuned on
task-specific data. BERT (Devlin et al., 2019), 2 Background and Related Work
XLNet (Yang et al., 2019), and RoBERTa (Liu
et al., 2019) are some of the most well-known 2.1 Pretrained Language Models
ones, representing the current state of the art in In the pretrained language modeling paradigm, a
natural language inference, question answering, language model (LM) is trained on vast amounts
and sentiment classification, to list a few. These of text, then fine-tuned on a specific downstream
models are extremely expressive, consisting of at task. Peters et al. (2018) are one of the first to suc-
least a hundred million parameters, a hundred at- cessfully apply this idea, outperforming state of
tention heads, and a dozen layers. the art in question answering, textual entailment,
An emerging line of work questions the need and sentiment classification. Their model, dubbed
for such a parameter-loaded model, especially on ELMo, comprises a two-layer BiLSTM pretrained
a single downstream task. Michel et al. (2019), for on the Billion Word Corpus (Chelba et al., 2014).
Furthering this approach with more data and Model Embedding Per-Layer Output Total
improved modeling, Devlin et al. (2019) pre- BERTBASE 24M (22%) 7M (7%) 0.6M (0.5%) 110M
RoBERTaBASE 39M (31%) 7M (6%) 0.6M (0.5%) 125M
train deep 12- and 24-layer bidirectional trans-
BERTLARGE 32M (10%) 13M (4%) 1M (0.3%) 335M
formers (Vaswani et al., 2017) on the entirety of RoBERTaLARGE 52M (15%) 13M (4%) 1M (0.3%) 355M
Wikipedia and BooksCorpus (Zhu et al., 2015).
Their approach, called BERT, achieves state of the Table 1: Parameter statistics for the base and large vari-
art across all tasks in the General Language Under- ants of BERT and RoBERTa. Note that “per-layer” in-
standing Evaluation (GLUE) benchmark (Wang dicates the number of parameters in one intermediate
et al., 2018), as well as the Stanford Question An- layer, which is more relevant to our study.
swering Dataset (Rajpurkar et al., 2016).
CoLA SST-2 MRPC STS-B QQP MNLI QNLI RTE
Model
As a result of this development, a flurry of MCC Acc. ρ ρ F1 Acc. Acc. Acc.

recent papers has followed this more-data-plus- BERTBASE 58.8 92.7 90.4 89.5 87.8 84.3 91.3 68.2
RoBERTaBASE 59.9 94.6 92.8 90.8 88.8 87.4 92.7 78.2
better-models principle. Two prominent exam- BERTLARGE 61.8 93.4 90.6 89.7 88.3 86.4 92.2 71.1
ples include XLNet (Yang et al., 2019) and RoBERTaLARGE 66.0 95.5 92.8 91.9 89.1 89.9 94.3 84.5

RoBERTa (Liu et al., 2019), both of which con-

Table 2: Reproduced results of BERT and RoBERTa
test the present state of the art. XLNet proposes
on the development sets.
to pretrain two-stream attention-augmented trans-
formers on an autoregressive LM objective, in-
stead of the original cloze and next sentence pre- 3 Experimental Setup
diction (NSP) tasks from BERT. RoBERTa pri-
marily argues for pretraining longer, using more We conduct our experiments on NVIDIA Tesla
data, and removing the NSP task for BERT. V100 GPUs with CUDA v10.1. We run the mod-
els from the Transformers library (v2.1.1; Wolf
et al., 2019) using PyTorch v1.2.0.
2.2 Layerwise Interpretability
3.1 Models and Datasets
The prevailing evidence in the neural network lit-
We choose BERT (Devlin et al., 2019) and
erature suggests that earlier layers extract univer-
RoBERTa (Liu et al., 2019) as the subjects of
sal features, while later ones perform task-specific
our study, since they represent state of the art
modeling. Zeiler and Fergus (2014) visualize the
and the same architecture. XLNet (Yang et al.,
per-layer activations in image classification net-
2019) is another alternative; however, they use
works, finding that the first few layers function
a slightly different attention structure, and our
as corner and edge detectors, and the final layers
preliminary experiments encountered difficulties
as class-specific feature extractors. Gatys et al.
in reproducibility with the Transformers library.
(2016) demonstrate that the low- and high-level
Each model has base and large variants that con-
notions of content and style are separable in con-
tain 12 and 24 layers, respectively. We denote
volutional neural networks, with lower layers cap-
them by appending the variant name as a subscript
turing content and higher layers style.
to the model name.
Pretrained transformers. In the NLP litera- Within each variant, the two models display
ture, similar observations have been made for pre- slight variability in parameter count—110 and 125
trained language models. Clark et al. (2019) an- million in the base variant, and 335 and 355 in
alyze BERT’s attention and observe that the bot- the large one. These differences are mostly at-
tom layers attend broadly, while the top layers tributed to RoBERTa using many more embedding
capture linguistic syntax. Kovaleva et al. (2019) parameters—exactly 63% more for both variants.
find that the last few layers of BERT change the For in-depth, layerwise statistics, see Table 1.
most after task-specific fine-tuning. Similar to our For our datasets, we use the GLUE bench-
work, Houlsby et al. (2019) fine-tune the top lay- mark, which comprises the tasks in natural lan-
ers of BERT, as part of their baseline comparison guage inference, sentiment classification, linguis-
for their model compression approach. However, tic acceptability, and semantic similarity. Specifi-
none of the studies comprehensively examine the cally, for natural language inference (NLI), it pro-
number of necessary final layers across multiple vides the Multigenre NLI (MNLI; Williams et al.,
pretrained transformers and datasets. 2018), Question NLI (QNLI; Wang et al., 2018),
Frozen CoLA SST-2 MRPC STS-B QQP MNLI MNLI-mm QNLI RTE
Model
up to MCC Acc. F1 ρ F1 Acc. Acc. Acc. Acc.
0th 58.3 92.7 90.3 88.8 87.9 84.2 84.8 91.4 67.6
BERTBASE 9th 47.5 90.8 85.4 88.0 85.3 82.0 82.4 89.5 62.3
12th 29.4 84.9 81.5 78.1 72.0 56.4 57.1 74.5 57.5

Table 3: Development set results of BERT, with none, some, and all of the nonoutput layer weights fine-tuned.
Results are averaged across five runs.

Frozen CoLA SST-2 MRPC STS-B Frozen CoLA SST-2 MRPC STS-B
Model Model
up to MCC Acc. F1 ρ up to MCC Acc. F1 ρ
0th 58.3 92.7 90.3 88.9 0th 61.9 93.4 90.3 89.8
BERTBASE 9th 47.5 90.8 85.4 88.0 BERTLARGE 18th 51.6 92.7 85.4 88.0
12th 29.4 84.9 81.5 78.1 24th 24.4 87.8 81.3 71.7
0th 59.4 94.3 92.3 90.6 0th 66.1 95.1 92.2 92.0
RoBERTaBASE 7th 58.6 93.3 89.5 87.7 RoBERTaLARGE 17th 60.5 95.1 91.3 89.6
12th 0.0 80.2 81.2 20.0 24th 0.0 79.2 81.2 11.2

Table 4: Development set results of all base models, Table 5: Development set results of all large models,
with none, some, and all of the nonoutput layer weights with none, some, and all of the nonoutput layer weights
fine-tuned. Results are averaged across five runs. fine-tuned. Results are averaged across five runs.

Recognizing Textual Entailment (RTE; Bentivogli ers, we explore N = L2 , L2 + 1, . . . , L. Due to

et al., 2009), and Winograd NLI (Levesque et al., computational limitations, we set half as the cut-
2012) datasets. For semantic textual similarity and off point. Additionally, we restrict our compre-
paraphrasing, it contains the Microsoft Research hensive all-datasets exploration to the base vari-
Paraphrase Corpus (MRPC; Dolan and Brockett, ant of BERT, since the large model variants and
2005), the Semantic Textual Similarity Bench- RoBERTa are much more computationally in-
mark (STS-B; Cer et al., 2017), and Quora Ques- tensive. On the smaller CoLA, SST-2, MRPC,
tion Pairs (QQP; Iyer et al.). Finally, its single- and STS-B datasets, we comprehensively evaluate
sentence tasks consist of the binary-polarity Stan- both models. These choices do not substantially
ford Sentiment Treebank (SST-2; Socher et al., affect our analysis.
2013) and the Corpus of Linguistic Acceptabil-
ity (CoLA; Warstadt et al., 2018). 4 Analysis
3.2 Fine-Tuning Procedure 4.1 Operating Points
Our fine-tuning procedure closely resembles those We report three relevant operating points in Tables
of BERT and RoBERTa. We choose the Adam 3–5: two extreme operating points and an interme-
optimizer (Kingma and Ba, 2014) with a batch diate one. The former is self-explanatory, indicat-
size of 16 and fine-tune BERT for 3 epochs and ing fine-tuning all or none of the nonoutput layers.
RoBERTa for 10, following the original papers. The latter denotes the number of necessary layers
For hyperparameter tuning, the best learning rate for reaching at least 90% of the full model quality,
is different for each task, and all of the origi- excluding CoLA, which is an outlier.
nal authors choose one between 1 × 10−5 and From the reported results in Tables 3–5, fine-
5 × 10−5 ; thus, we perform line search over the tuning the last output layer and task-specific lay-
interval with a step size of 1 × 10−5 . We report ers is insufficient for all tasks—see the rows corre-
the best results in Table 2. sponding to 0, 12, and 24 frozen layers. However,
On each model, we freeze the embeddings and we find that the first half of the model is unnec-
the weights of the first N layers, then fine-tune essary; the base models, for example, need fine-
the rest using the best hyperparameters of the full tuning of only 3–5 layers out of the 12 to reach
model. Specifically, if L is the number of lay- 90% of the original quality—see Table 4, middle
CoLA-base SST-2-base MRPC-base STS-B-base
0.0 0.00 0.00 0.00

0.2 0.05 0.05 0.05

MCC

Acc.

F1
0.10 0.10 0.10
0.4
0.15 0.15 0.15
0.6 all 11 10 9 8 7 6 0.20 all 0.00
11 10 9 8 7 6 0.20 all 11 10 9 8 7 6 0.20 all 11 10 9 8 7 6
CoLA-large SST-2-large
0.05 MRPC-large STS-B-large
0.0 0.00 0.00 0.00
0.10
0.2 0.05 0.05 0.05
0.15
MCC

Acc.

F1
0.10 0.10 0.10
0.15 0.20 all2322212019181716
0.4
15141312
0.15 0.15
0.6 all 22 20 18 16 14 12 0.20 all 22 20 18 16 14 12 0.20 all 22 20 18 16 14 12 0.20 all 22 20 18 16 14 12

BERT RoBERTa
Figure 1: Relative change in quality compared to the full models, with respect to the number of frozen initial
layers, represented by the x-axes.

subrow of each row group. Similarly, fine-tuning layers. This finding suggests that these models
only a fourth of the layers is sufficient for the large may be overparameterized for SST-2.
models (see Table 5); only 6 layers out of 24 for
BERT and 7 for RoBERTa. 5 Conclusions and Future Work
In this paper, we present a comprehensive evalu-
4.2 Per-Layer Study ation of the number of final layers that need to
In Figure 1, we examine how the relative qual- be fine-tuned for pretrained transformer-based lan-
ity changes with the number of frozen layers. To guage models. We find that only a fourth of the
compute a relative score, we subtract each frozen layers necessarily need to be fine-tuned to ob-
model’s results from its corresponding full model. tain 90% of the original quality. One line of
The relative score aligns the two baselines at zero, future work is to conduct a similar, more fine-
allowing the fair comparison of the transformers. grained analysis on the contributions of the atten-
The graphs report the average of five trials to re- tion heads.
duce the effects of outliers.
Acknowledgments
When every component except the output layer
and the task-specific layer is frozen, the fine-tuned This research was supported by the Natu-
model achieves only 64% of the original quality, ral Sciences and Engineering Research Council
on average. As more layers are fine-tuned, the (NSERC) of Canada, and enabled by computa-
model effectiveness often improves drastically— tional resources provided by Compute Ontario and
see CoLA and STS-B, the first and fourth verti- Compute Canada.
cal pairs of subfigures from the left. This demon-
strates that gains decompose nonadditively with
respect to the number of frozen initial layers. Fine- References
tuning subsequent layers shows diminishing re- Luisa Bentivogli, Ido Kalman Dagan, Dang Hoa,
turns, with every model rapidly approaching the Danilo Giampiccolo, and Bernardo Magnini. 2009.
baseline quality at fine-tuning half of the network; The fifth PASCAL recognizing textual entailment
challenge. In TAC 2009 Workshop.
hence, we believe that half is a reasonable cutoff
point for characterizing the models. Daniel Cer, Mona Diab, Eneko Agirre, Inigo Lopez-
Finally, for the large variants of BERT and Gazpio, and Lucia Specia. 2017. SemEval-2017
task 1: Semantic textual similarity multilingual and
RoBERTa on SST-2 (second subfigure from both crosslingual focused evaluation. In Proceedings of
the top and the left), we observe a surprisingly the 11th International Workshop on Semantic Eval-
consistent increase in quality when freezing 12–16 uation.
Ciprian Chelba, Tomas Mikolov, Mike Schuster, Qi Ge, Zettlemoyer. 2018. Deep contextualized word repre-
Thorsten Brants, Phillipp Koehn, and Tony Robin- sentations. In Proceedings of the 2018 Conference
son. 2014. One billion word benchmark for mea- of the North American Chapter of the Association
suring progress in statistical language modeling. In for Computational Linguistics: Human Language
Fifteenth Annual Conference of the International Technologies.
Speech Communication Association.
Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and
Kevin Clark, Urvashi Khandelwal, Omer Levy, and Percy Liang. 2016. SQuAD: 100,000+ questions for
Christopher D. Manning. 2019. What does machine comprehension of text. In Proceedings of
BERT look at? An analysis of BERT’s attention. the 2016 Conference on Empirical Methods in Nat-
arXiv:1906.04341. ural Language Processing.

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Richard Socher, Alex Perelygin, Jean Wu, Jason
Kristina Toutanova. 2019. BERT: pre-training of Chuang, Christopher D. Manning, Andrew Ng, and
deep bidirectional transformers for language under- Christopher Potts. 2013. Recursive deep models
standing. In Proceedings of the 2019 Conference of for semantic compositionality over a sentiment tree-
the North American Chapter of the Association for bank. In Proceedings of the 2013 Conference on
Computational Linguistics: Human Language Tech- Empirical Methods in Natural Language Process-
nologies. ing.
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob
William B. Dolan and Chris Brockett. 2005. Automati- Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz
cally constructing a corpus of sentential paraphrases. Kaiser, and Illia Polosukhin. 2017. Attention is all
In Proceedings of the Third International Workshop you need. In Advances in Neural Information Pro-
on Paraphrasing. cessing Systems.
Leon A. Gatys, Alexander S. Ecker, and Matthias Alex Wang, Amanpreet Singh, Julian Michael, Felix
Bethge. 2016. Image style transfer using convolu- Hill, Omer Levy, and Samuel R. Bowman. 2018.
tional neural networks. In Proceedings of the IEEE GLUE: A multi-task benchmark and analysis plat-
Conference on Computer Vision and Pattern Recog- form for natural language understanding. In Pro-
nition. ceedings of the 2018 EMNLP Workshop Black-
boxNLP: Analyzing and Interpreting Neural Net-
Neil Houlsby, Andrei Giurgiu, Stanislaw Jastrzebski, works for NLP.
Bruna Morrone, Quentin De Laroussilhe, Andrea
Gesmundo, Mona Attariyan, and Sylvain Gelly. Alex Warstadt, Amanpreet Singh, and Samuel R. Bow-
2019. Parameter-efficient transfer learning for NLP. man. 2018. Neural network acceptability judg-
In International Conference on Machine Learning. ments. arXiv:1805.12471.

Shankar Iyer, Nikhil Dandekar, and Kornel Csernai. Adina Williams, Nikita Nangia, and Samuel R. Bow-
First Quora dataset release: Question pairs. man. 2018. A broad-coverage challenge corpus for
sentence understanding through inference. In Pro-
Diederik P. Kingma and Jimmy Ba. 2014. ceedings of the 2018 Conference of the North Amer-
Adam: A method for stochastic optimization. ican Chapter of the Association for Computational
arXiv:1412.6980. Linguistics: Human Language Technologies.

Olga Kovaleva, Alexey Romanov, Anna Rogers, and Thomas Wolf, Lysandre Debut, Victor Sanh, Julien
Anna Rumshisky. 2019. Revealing the dark secrets Chaumond, Clement Delangue, Anthony Moi, Pier-
of BERT. arXiv:1908.08593. ric Cistac, Tim Rault, R’emi Louf, Morgan Funtow-
icz, and Jamie Brew. 2019. HuggingFace’s Trans-
Hector Levesque, Ernest Davis, and Leora Morgen- formers: State-of-the-art natural language process-
stern. 2012. The Winograd schema challenge. In ing. arXiv:1910.03771.
Thirteenth International Conference on the Princi- Zhilin Yang, Zihang Dai, Yiming Yang, Jaime G.
ples of Knowledge Representation and Reasoning. Carbonell, Ruslan Salakhutdinov, and Quoc V. Le.
2019. XLNet: generalized autoregressive pretrain-
Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Man- ing for language understanding. arXiv:1906.08237.
dar Joshi, Danqi Chen, Omer Levy, Mike Lewis,
Luke Zettlemoyer, and Veselin Stoyanov. 2019. Matthew D. Zeiler and Rob Fergus. 2014. Visualizing
RoBERTa: A robustly optimized BERT pretraining and understanding convolutional networks. In Euro-
approach. arXiv:1907.11692. pean Conference on Computer Vision.
Paul Michel, Omer Levy, and Graham Neubig. Yukun Zhu, Ryan Kiros, Rich Zemel, Ruslan Salakhut-
2019. Are sixteen heads really better than one? dinov, Raquel Urtasun, Antonio Torralba, and Sanja
arXiv:1905.10650. Fidler. 2015. Aligning books and movies: Towards
story-like visual explanations by watching movies
Matthew Peters, Mark Neumann, Mohit Iyyer, Matt and reading books. In Proceedings of the IEEE In-
Gardner, Christopher Clark, Kenton Lee, and Luke ternational Conference on Computer Vision.

Consumer Culture Theory: Eric J. Arnould & Craig J. Thompson
No ratings yet
Consumer Culture Theory: Eric J. Arnould & Craig J. Thompson
20 pages
Cmo Leadership Vision 2024
No ratings yet
Cmo Leadership Vision 2024
16 pages
Copq - GLC
No ratings yet
Copq - GLC
64 pages
2003.07000
No ratings yet
2003.07000
9 pages
Revisiting Character-Based Neural Machine Translation With Capacity and Compression
No ratings yet
Revisiting Character-Based Neural Machine Translation With Capacity and Compression
11 pages
EELBERT Tiny Models Through Dynamic Embeddings 1705151354
No ratings yet
EELBERT Tiny Models Through Dynamic Embeddings 1705151354
9 pages
An Analysis of Neural Language Modeling at Multiple Scales: Melis Et Al. 2018
No ratings yet
An Analysis of Neural Language Modeling at Multiple Scales: Melis Et Al. 2018
10 pages
Bag of Tricks For Text Classification
No ratings yet
Bag of Tricks For Text Classification
5 pages
Analyzing the Structure of Attention
No ratings yet
Analyzing the Structure of Attention
14 pages
Gupta Et Al 2019 - Character-Based NMT With Transformer
No ratings yet
Gupta Et Al 2019 - Character-Based NMT With Transformer
16 pages
Lifelong Pretraining:: Continually Adapting Language Models To Emerging Corpora
No ratings yet
Lifelong Pretraining:: Continually Adapting Language Models To Emerging Corpora
18 pages
Adaptive Attention Span in Transformers
No ratings yet
Adaptive Attention Span in Transformers
5 pages
Torward Effective Disambiguation For MT With LLM
No ratings yet
Torward Effective Disambiguation For MT With LLM
14 pages
Transformer 2011.02266
No ratings yet
Transformer 2011.02266
7 pages
Transformer-XL: Attentive Language Models Beyond A Fixed-Length Context
No ratings yet
Transformer-XL: Attentive Language Models Beyond A Fixed-Length Context
20 pages
Bert
No ratings yet
Bert
10 pages
Bert ayman
No ratings yet
Bert ayman
5 pages
N19-1213
No ratings yet
N19-1213
7 pages
Prefix-Tuning: Optimizing Continuous Prompts For Generation
No ratings yet
Prefix-Tuning: Optimizing Continuous Prompts For Generation
15 pages
Multi-Task Deep Neural Networks for Natural Language Understanding
No ratings yet
Multi-Task Deep Neural Networks for Natural Language Understanding
10 pages
Simple, Scalable Adaptation For Neural Machine Translation: Ankur Bapna Orhan Firat Google AI
No ratings yet
Simple, Scalable Adaptation For Neural Machine Translation: Ankur Bapna Orhan Firat Google AI
11 pages
collobert11a
No ratings yet
collobert11a
9 pages
What Happens To BERT Embeddings During Fine-tuning
No ratings yet
What Happens To BERT Embeddings During Fine-tuning
13 pages
BERT For Joint Intent Classification and Slot Filling
No ratings yet
BERT For Joint Intent Classification and Slot Filling
6 pages
Megatron LM
No ratings yet
Megatron LM
15 pages
Do Massively Pretrained Language Models Make Better Storytellers?
No ratings yet
Do Massively Pretrained Language Models Make Better Storytellers?
19 pages
Acl 2020
No ratings yet
Acl 2020
6 pages
DeepNorm Deep Learning Approach
No ratings yet
DeepNorm Deep Learning Approach
7 pages
(2020129) On Layer Normalization in The Transformer Architecture
No ratings yet
(2020129) On Layer Normalization in The Transformer Architecture
17 pages
Bag of Tricks For Efficient Text Classification: Armand Joulin Edouard Grave Piotr Bojanowski Tomas Mikolov
No ratings yet
Bag of Tricks For Efficient Text Classification: Armand Joulin Edouard Grave Piotr Bojanowski Tomas Mikolov
5 pages
Baselines and Analysis
No ratings yet
Baselines and Analysis
6 pages
R T D D S D: Educing Ransformer Epth On Emand With Tructured Ropout
No ratings yet
R T D D S D: Educing Ransformer Epth On Emand With Tructured Ropout
15 pages
Deep Speech 2: End-to-End Speech Recognition in English and Mandarin
No ratings yet
Deep Speech 2: End-to-End Speech Recognition in English and Mandarin
10 pages
Research Paper
No ratings yet
Research Paper
6 pages
ACT Text Classification 20210518
No ratings yet
ACT Text Classification 20210518
9 pages
cor_res
No ratings yet
cor_res
6 pages
A Primer in BERTology
No ratings yet
A Primer in BERTology
15 pages
DocBERT - BERT For Document Classification
No ratings yet
DocBERT - BERT For Document Classification
7 pages
2004.04418v2
No ratings yet
2004.04418v2
6 pages
BART: Denoising Sequence-to-Sequence Pre-Training For Natural Language Generation, Translation, and Comprehension
No ratings yet
BART: Denoising Sequence-to-Sequence Pre-Training For Natural Language Generation, Translation, and Comprehension
10 pages
2017 - Unsupervised Neural Machine Translation PDF
No ratings yet
2017 - Unsupervised Neural Machine Translation PDF
11 pages
Deep Unordered Composition Rivals Syntactic Methods For Text Classification
No ratings yet
Deep Unordered Composition Rivals Syntactic Methods For Text Classification
11 pages
Hyena Hierarchy Towards Larger Convolutional Language Models PDF
No ratings yet
Hyena Hierarchy Towards Larger Convolutional Language Models PDF
38 pages
Synchronous Speech Recognition and Speech-to-Text Translation With Interactive Decoding
No ratings yet
Synchronous Speech Recognition and Speech-to-Text Translation With Interactive Decoding
8 pages
Challenges in NMT - 1706.03872
No ratings yet
Challenges in NMT - 1706.03872
12 pages
Retro Model Deep-Mind
No ratings yet
Retro Model Deep-Mind
43 pages
Intrinc
No ratings yet
Intrinc
11 pages
When BERT Plays The Lottery, All Tickets Are Winning (2020)
No ratings yet
When BERT Plays The Lottery, All Tickets Are Winning (2020)
13 pages
Advancement in NLP Paper
No ratings yet
Advancement in NLP Paper
49 pages
2022 cmcl-1 9
No ratings yet
2022 cmcl-1 9
13 pages
2022 Acl-Long 170
No ratings yet
2022 Acl-Long 170
13 pages
2022 Naacl-Srw 18
No ratings yet
2022 Naacl-Srw 18
7 pages
OnTheUseofBERT
No ratings yet
OnTheUseofBERT
10 pages
Voice Conversion With Deep Learning: Miguel Varela Ramos, Instituto Superior T Ecnico, Universidade de Lisboa
No ratings yet
Voice Conversion With Deep Learning: Miguel Varela Ramos, Instituto Superior T Ecnico, Universidade de Lisboa
10 pages
Selective2
No ratings yet
Selective2
19 pages
Debiasing Word Embeddings Improves Multimodal Machine Translation
No ratings yet
Debiasing Word Embeddings Improves Multimodal Machine Translation
11 pages
Multiclass Text Classification On Unbalanced, Sparse and Noisy Data
No ratings yet
Multiclass Text Classification On Unbalanced, Sparse and Noisy Data
8 pages
2021 Acl-Long 172
No ratings yet
2021 Acl-Long 172
15 pages
Linguistically-Informed Self-Attention For Semantic Role Labeling
No ratings yet
Linguistically-Informed Self-Attention For Semantic Role Labeling
14 pages
Learning To Answer by Learning To Ask - Getting The Best of GPT-2 and BERT Worlds PDF
No ratings yet
Learning To Answer by Learning To Ask - Getting The Best of GPT-2 and BERT Worlds PDF
10 pages
Hugging Face Transformers Essentials: From Fine-Tuning to Deployment
From Everand
Hugging Face Transformers Essentials: From Fine-Tuning to Deployment
Robert Johnson
No ratings yet
Demystifying Large Language Models: Unraveling the Mysteries of Language Transformer Models, Build from Ground up, Pre-train, Fine-tune and Deployment
From Everand
Demystifying Large Language Models: Unraveling the Mysteries of Language Transformer Models, Build from Ground up, Pre-train, Fine-tune and Deployment
James Chen
No ratings yet
Explanation Based Learning: Fundamentals and Applications
From Everand
Explanation Based Learning: Fundamentals and Applications
Fouad Sabry
No ratings yet
Specification GCSE L1 L2 Religious Studies A June 2016 Draft 4
No ratings yet
Specification GCSE L1 L2 Religious Studies A June 2016 Draft 4
80 pages
Monthly Oman Bills
No ratings yet
Monthly Oman Bills
24 pages
Defender Winter 2016/spring 2017
No ratings yet
Defender Winter 2016/spring 2017
36 pages
Jota Floor
No ratings yet
Jota Floor
7 pages
Mr. Tintu Thomas: Business Development Manager
No ratings yet
Mr. Tintu Thomas: Business Development Manager
3 pages
Dragon Wyrmling - The Homebrewery
No ratings yet
Dragon Wyrmling - The Homebrewery
5 pages
COUNCIL RECOMMENDATION On High Quality Early Childhood Education and Care Systems
No ratings yet
COUNCIL RECOMMENDATION On High Quality Early Childhood Education and Care Systems
18 pages
Be It Enacted by The Senate and House of Representatives of The Philippines in Congress Assembled
No ratings yet
Be It Enacted by The Senate and House of Representatives of The Philippines in Congress Assembled
11 pages
Daniel Owusu Ansah Bbas-Ed-173592
No ratings yet
Daniel Owusu Ansah Bbas-Ed-173592
10 pages
Ebooks File (Test Bank) Financial Reporting and Analysis 13th Edition All Chapters
No ratings yet
Ebooks File (Test Bank) Financial Reporting and Analysis 13th Edition All Chapters
34 pages
Download ebooks file Layered double hydroxide polymer nanocomposites Daniel all chapters
100% (4)
Download ebooks file Layered double hydroxide polymer nanocomposites Daniel all chapters
66 pages
Secretary Certificate For Authorization of Company Registration
No ratings yet
Secretary Certificate For Authorization of Company Registration
1 page
Full Download Destined For War Can America and China Escape Thucydides S Trap 1st Edition Graham Allison PDF
100% (3)
Full Download Destined For War Can America and China Escape Thucydides S Trap 1st Edition Graham Allison PDF
62 pages
Technical analysis of Forex by Parabolic SAR Indicator
No ratings yet
Technical analysis of Forex by Parabolic SAR Indicator
9 pages
Kabbalah Journey
100% (1)
Kabbalah Journey
932 pages
PCL - Passive System Portfolio
No ratings yet
PCL - Passive System Portfolio
19 pages
Health and Social PowerPoint
No ratings yet
Health and Social PowerPoint
41 pages
Union Bank - UBI NGSOC RFQ Final Upload
No ratings yet
Union Bank - UBI NGSOC RFQ Final Upload
142 pages
Adverb My Detailed Lesson Plan
No ratings yet
Adverb My Detailed Lesson Plan
12 pages
English Grammar MCQ
No ratings yet
English Grammar MCQ
13 pages
Sava MILOVANOVIĆ Christians As Tertium Gentius"
No ratings yet
Sava MILOVANOVIĆ Christians As Tertium Gentius"
22 pages
Covantis Case Study
No ratings yet
Covantis Case Study
11 pages
Nursery Matron LGS EXT 03AUG18
No ratings yet
Nursery Matron LGS EXT 03AUG18
4 pages
Sec2 1
No ratings yet
Sec2 1
35 pages
Yoga and Vipassana
No ratings yet
Yoga and Vipassana
13 pages
For Oral Recitation
No ratings yet
For Oral Recitation
5 pages
Jurnal Pangan Dan Gizi Anggota
No ratings yet
Jurnal Pangan Dan Gizi Anggota
14 pages