LLMs之P-Tuning v2：《P-Tuning v2: Prompt Tuning Can Be Comparable to Fine-tuning Universally Across Sca-CSDN博客

地址	论文：https://arxiv.org/abs/2110.07602 GitHub：GitHub - THUDM/P-tuning-v2: An optimized deep prompt tuning strategy comparable to fine-tuning across scales and tasks
时间	2021年10月18日
作者	清华大学+BAAI等
摘要	本文提出了一种新的通用高效的提示微调(prompt tuning)技术P-tuning v2，具有以下核心要点：背景：传统的微调技术(fine-tuning)在训练每个任务时需要更新整个模型参数，虽然表现优秀，但训练开销较大，需存储每个任务的模型参数，占用大量内存。前几种知识范畴调优(比如prompt tuning)技术在中小规模模型和语义序列标注任务上效果不佳。提示调优(prompt tuning)方法通过只调整连续提示词而冻结模型参数，可大幅降低存储和计算开销，但在中小规模模型和序列标注任务上表现较差。问题分析：前工作知识范畴调优在不同模型规模和NLU任务上的通用性存在问题。特别是对于序列标注任务效果较差。解决方法：文章提出了P-tuning v2技术，该技术是基于深度知识范畴调优(Deep Prompt Tuning)的思想进行优化和改进的，在预训练模型的各层添加可训练的提示词，增加了提示词的表达能力。它仅调优连续知识范畴，且将知识范畴应用在Transformer各层，而不仅限于输入层。 P-tuning v2优化了一系列关键细节，如是否使用重参数化编码器、提示词长度选择、多任务学习等。不使用verbalizer(语义化标签)，而是采用随机初始化的分类层，以适应序列标注等更复杂的任务。核心效果：通过实验验证，P-tuning v2可以在不同模型规模(300M到10B参数)和各类NLU任务上与微调的效果相当，但仅需要微调的参数比例在0.1%到3%之间，大大降低了训练和存储成本。在中小规模模型和序列标注任务上，明显优于之前的Prompt Tuning方法。优势：P-tuning v2具有广泛适用性和简单性，可以作为微调的一种替代方法，也可以作为未来研究的强基准。总之，P-Tuning v2通过深度提示调优和关键优化，解决了之前提示调优方法在不同规模模型和任务适用性和通用性方面的局限性，实现了与微调相当的性能，同时大幅降低了参数开销，具有很好的应用前景。

Abstract摘要

Prompt tuning, which only tunes continuous prompts with a frozen language model, sub-stantially reduces per-task storage and mem-ory usage at training. However, in the con-text of NLU, prior work reveals that prompt tuning does not perform well for normal-sized pretrained models. We also find that exist-ing methods of prompt tuning cannot handle hard sequence labeling tasks, indicating a lack of universality. We present a novel empiri-cal finding that properly optimized prompt tun-ing can be universally effective across a wide range of model scales and NLU tasks. It matches the performance of finetuning while having only 0.1%-3% tuned parameters. Our method P-Tuning v2 is an implementation of Deep Prompt Tuning (Li and Liang, 2021; Qin and Eisner, 2021) optimized and adapted for NLU. Given the universality and simplicity of P-Tuning v2, we believe it can serve as an al-ternative to finetuning and a strong baseline for future research.1

提示调优，它只对冻结语言模型的连续提示进行调优，在训练过程中显著减少了每个任务的存储和内存使用。然而，在NLU的背景下，先前的工作表明，对于正常大小的预训练模型，提示调优效果不佳。我们还发现，现有的提示调优方法无法处理硬序列标注任务，表明缺乏通用性。我们提出了一个新的实证发现，即适当优化的提示调优可以在广泛的模型规模和NLU任务中普遍有效。它在与微调相同的性能下，只有0.1%-3%的参数被调整。我们的方法P-Tuning v2是Deep Prompt Tuning（Li和Liang，2021；Qin和Eisner，2021）的一种实现，针对NLU进行了优化和适配。鉴于P-Tuning v2的通用性和简单性，我们相信它可以作为微调的替代方案，并为未来的研究提供一个强大的基线。

1 Introduction引言

Pretrained language models (Radford et al., 2019; Devlin et al., 2018; Yang et al., 2019; Raffel et al., 2019) improve performance on a wide range of natural language understanding (NLU) tasks. A widely-used method, fine-tuning, updates the en-tire set of model parameters for a target task. While fine-tuning obtains good performance, it is memory-consuming during training because gradi-ents and optimizer states for all parameters must be stored. Moreover, keeping a copy of model param-eters for each task during inference is inconvenient since pre-trained models are usually large.

预训练语言模型(Radford et al.， 2019;Devlin et al.， 2018;Yang et al.， 2019;拉斐尔等人，2019)在广泛的自然语言理解（NLU）任务上提高了性能。一种广泛使用的方法是微调，为特定任务更新整个模型参数集。尽管微调获得了良好的性能，但由于必须存储所有参数的梯度和优化器状态，它在训练过程中是内存密集型的。此外，由于预训练模型通常很大，因此在推理时为每个任务保存模型参数的副本很不方便。

Prompting提示工程：无需任何训练

Prompting, on the other hand, freezes all param-eters of a pre-trained model and uses a natural language prompt to query a language model (Brown et al., 2020). For example, for sentiment analy-sis, we can concatenate a sample (e.g., "Amazing movie!") with a prompt “This movie is [MASK]” and ask the pre-trained language model to predict the probabilities of masked token being “good” and “bad” to decide the sample’s label. Prompting re-quires no training at all and stores one single copy of model parameters. However, discrete prompt-ing (Shin et al., 2020; Gao et al., 2020) can lead to suboptimal performance in many cases compared to fine-tuning.

另一方面，提示冻结了预训练模型的所有参数，并使用自然语言提示来查询语言模型(Brown et al.， 2020)。例如，对于情感分析，我们可以将一个样本(例如，“精彩的电影!”)与提示“这部电影是[MASK]”连接起来，并要求预训练的语言模型预测掩码标记为“好”和“坏”的概率，以决定样本的标签。提示不需要任何训练，并存储模型参数的单个副本。然而，与微调相比，离散提示（Shin等，2020；Gao等，2020）在许多情况下可能会导致次优性能。

Prompt tuning提示调优：P-tuning、PROMPT TUNING

Prompt tuning is an idea of tuning only the continuous prompts. Specifically, Liu et al. (2021); Lester et al. (2021) proposed to add trainable continuous embeddings (also called continuous prompts) to the original sequence of input word embeddings. Only the continuous prompts are up-dated during training. While prompt tuning im-proves over prompting on many tasks (Liu et al., 2021; Lester et al., 2021; Zhong et al., 2021), it still underperforms fine-tuning when the model size is not large, specifically less than 10 billion parame-ters (Lester et al., 2021). Moreover, as shown in our experiments, prompt tuning performs poorly compared to fine-tuning on several hard sequence labeling tasks such as extractive question answering (Cf. Section 4.2).

提示调优是一种只调优连续提示的想法。具体而言，Liu et al. (2021)—P-tuning;Lester—Prompt Tuning等人(2021)提出在原始输入词嵌入的序列中加入可训练的连续嵌入(也称为连续提示)。在训练过程中，只有连续提示会被更新。虽然在许多任务中，提示调优比提示(即提示工程)更好(Liu et al.， 2021;Lester et al.， 2021;Zhong et al.， 2021)，当模型规模不大，特别是小于100亿个参数时，它仍然表现不佳(Lester et al.， 2021)。此外，正如我们的实验所示，与一些硬序列标注任务(如抽取问题回答)的微调相比，提示调优的性能较差(参见第4.2节)。

本文提出P-tuning v2：

Our main contribution in this paper is a novel empirical finding that properly optimized prompt tuning can be comparable to fine-tuning universally across various model scales and NLU tasks. In con-trast to observations in prior work, our discovery reveals the universality and potential of prompt tuning for NLU.

Technically, our approach P-tuning v2 is not con-ceptually novel. It can be viewed as an optimized and adapted implementation of Deep Prompt Tun-ing (Li and Liang, 2021; Qin and Eisner, 2021) designed for generation and knowledge probing. The most significant improvement originates from appling continuous prompts for every layer of the pretrained model, instead of the mere input layer. Deep prompt tuning increases the capacity of con-tinuous prompts and closes the gap to fine-tuning across various settings, especially for small models and hard tasks. Moreover, we present a series of critical details of optimization and implementation to ensure finetuning-comparable performance.

我们在本文中的主要贡献是一个新的实证发现，即适当优化的提示调优可以与各种模型尺度和NLU任务的普遍微调相媲美。与先前工作的观察结果相反，我们的发现揭示了提示调优在NLU中的通用性和潜力。

从技术上讲，我们的P-tuning v2方法在概念上并不新颖。它可以被视为深度提示调优的优化和适配实现(Li and Liang, 2021;Qin and Eisner, 2021)设计用于生成和知识探索。最显著的改进来自于对预训练模型的每一层应用连续提示，而不仅仅是输入层。深度提示调优增加了连续提示的能力，并缩小了与微调的差距，特别是在小模型和硬任务上。此外，我们提出了一系列关键的优化和实施细节，以确保与微调相媲美的性能。

Experimental results show that P-tuning v2 matches the performance of fine-tuning at differ-ent model scales ranging from 300M to 10B pa-rameters and on various hard sequence tagging tasks such as extractive question answering and named entity recognition. P-tuning v2 has 0.1%to 3% trainable parameters per task compared to fine-tuning, which substantially reduces training time memory cost and per-task storage cost.

实验结果表明，P-tuning v2在300M ~ 10B参数的不同模型尺度以及抽取式问答和命名实体识别等各种硬序列标注任务上的性能都与微调相匹配。与微调相比，P-tuning v2在每个任务中有0.1%到3%的可训练参数，这大大降低了训练时间、内存成本和每个任务的存储成本。

Figure 1: Average scores on RTE, BoolQ and CB of SuperGLUE dev. With 0.1% task-specific parameters, P-tuning v2 can match fine-tuning across wide scales of pre-trained models, while Lester et al. (2021) & P-tuning can make it conditionally at 10B scale.图1:SuperGLUE开发人员在RTE、BoolQ和CB上的平均得分。使用0.1%的任务特定参数，P-tuning v2可以在广泛的预训练模型规模上与微调相匹配，而Lester等（2021）和P-tuning可以使其在10B规模上有条件地实现。

2 Preliminaries预备知识

NLU自然语言理解任务：两类，简单分类任务和硬序列标注任务(如命名实体识别和抽取问题回答)

NLU Tasks. In this work, we categorize NLU chal-lenges into two families: simple classification tasks and hard sequence labeling tasks.3 Simple clas-sification tasks involve classification over a label space. Most datasets from GLUE (Wang et al., 2018) and SuperGLUE (Wang et al., 2019) are in this category. Hard sequence labeling tasks involve classification over a sequence of tokens, such as named entity recognition and extractive question answering.

NLU自然语言理解任务。在这项工作中，我们将NLU挑战分为两类：简单分类任务和硬序列标注任务。简单的分类任务涉及在标签空间上进行分类。GLUE (Wang et al.， 2018)和SuperGLUE (Wang et al.， 2019)的大多数数据集都属于这一类。硬序列标注任务涉及对一系列标记进行分类，例如命名实体识别和抽取问题回答。

Prompt Tuning提示调优：比如P-tuning、PROMPT TUNING，通过引入了可训练的连续提示来替代自然语言提示

Prompt Tuning. Let V be the vocabulary of a language model M and let e be the em-bedding layer of M. In the case of discrete prompting (Schick and Schütze, 2020), prompt tokens {"It", "is", "[MASK]"} ⊂ V can be used to classify a movie review. For exam-ple, given the input text x ="Amazing movie!", the input embedding sequence is formulated as [e(x), e("It"), e("is"), e("[MASK]")].

提示调优。设V为语言模型M的词汇表，设e为M的嵌入层。在离散提示的情况下(Schick and sch<e:1>， 2020)，提示符{“It”，“is”，“[MASK]”}⊂V可用于对电影评论进行分类。例如，给定输入文本x =“Amazing movie!”，输入嵌入序列被表示为[e(x)， e(“It”)，e(“is”)，e(“[MASK]”)]。

Lester et al. (2021) and Liu et al. (2021) in-troduce trainable continuous prompts as a sub-stitution to natural language prompts for NLU with the parameters of pretrained language mod-els frozen. Given the trainable continuous embed-dings [h0, ..., hi], the input embedding sequence is written as [e(x), h0, ..., hi, e("[MASK]")], as il-lustrated in Figure 2. Prompt tuning has been proved to be comparable to fine-tuning on 10-billion-parameter models on simple classification tasks (Lester et al., 2021; Kim et al., 2021; Liu et al., 2021).

Lester等人(2021)和Liu等人(2021)引入了可训练的连续提示，作为NLU的自然语言提示的替代品，预先训练的语言模型的参数被冻结。给定可训练连续嵌入

[0，…]， hi]，

则输入嵌入序列写成

[e(x)， h0，…]， hi, e(“[MASK]”)，

如图2所示。在简单的分类任务中，已经证明，提示调优与对10B级参数模型的微调相当（Lester等，2021；Kim等，2021；Liu等，2021）。

Figure 2: From Lester et al. (2021) & P-tuning to P-tuning v2. Orange blocks (i.e., h0, ..., hi) refer to trainable prompt embeddings; blue blocks are embeddings stored or computed by frozen pre-trained language models.图2:从Lester et al. (2021) & P-tuning到P-tuning v2。橙色方块(即，0，…)， hi)指可训练的提示嵌入;蓝色块是由固定的预训练语言模型存储或计算的嵌入。

3 P-Tuning v2

3.1 Lack of Universality缺乏普适性：现有的提示调优方法,比如P-tuning、PROMPT TUNING有效，但依然无法替代微调

Lester et al. (2021); Liu et al. (2021) have been proved quite effective in many NLP applica-tions (Wang et al., 2021a,b; Chen et al., 2021; Zheng et al., 2021; Min et al., 2021), but still fall short at replacing fine-tuning due to lack of univer-sality, as discussed below.

Lester et al. (2021);Liu等人(2021)已被证明在许多NLP应用中非常有效(Wang等人，2021a,b;Chen et al.， 2021;郑等，2021;Min et al.， 2021)，但由于缺乏普适性，仍然无法取代微调，如下所述。

缺乏跨尺度的普适性：中等规模模型，PROMPT TUNING方法的性能远不如微调

Lack of universality across scales. Lester et al.(2021) shows that prompt tuning can be comparable to fine-tuning when the model scales to over 10 bil-lion parameters. However, for medium-sized mod-els (from 100M to 1B) that are widely used, prompt tuning performs much worse than fine-tuning.

缺乏跨尺度的普遍性。Lester等人(2021)表明，当模型扩展到超过10B个参数时，提示调优可以与微调相媲美。然而，对于广泛使用的中等规模模型（从100M到1B），提示调优的性能远不如微调。

缺乏跨任务间通用性：P-tuning、PROMPT TUNING在NLU基准上优势，但是在硬序列标注任务上的有效性尚未得到验证

Lack of universality across tasks. Though Lester et al. (2021); Liu et al. (2021) have shown superior-ity on some of the NLU benchmarks, the effective-ness of prompt tuning on hard sequence tagging tasks is not verified. Sequence tagging predicts a se-quence of labels for each input token, which can be harder and incompatible with verbalizers (Schick and Schütze, 2020). In our experiments (Cf. Sec-tion 4.2 and Table 3), we show that Lester et al.(2021); Liu et al. (2021) perform poorly on typical sequence tagging tasks compared to fine-tuning.

任务间缺乏通用性。尽管Lester et al. (2021);Liu等人(2021)在一些NLU基准测试中表现出了优势，但在硬序列标注任务上的提示调优的有效性尚未得到验证。序列标注预测每个输入标记的顺序序列，这可能更难，并且与语言器不兼容(Schick和sch<s:1> tze, 2020)。在我们的实验中(参见第4.2节和表3)，我们表明Lester等人(2021);与微调相比，Liu等人(2021)在典型的序列标注任务上表现不佳。

本文提出P-tuning v2：解决跨尺度和NLU任务的通用解决方案

Considering these challenges, we propose P-tuning v2, which adapts deep prompt tuning (Li and Liang, 2021; Qin and Eisner, 2021) as a uni-versal solution across scales and NLU tasks.

考虑到这些挑战，我们提出了适应深度提示调优的P-tuning v2 (Li and Liang, 2021;Qin and Eisner, 2021)作为跨尺度和NLU任务的通用解决方案。

3.2 Deep Prompt Tuning深度提示调优

早期Prompt Tuning(如P-tuning、PROMPT TUNING)的两大劣势：可调参数有限、输入嵌入对模型预测的影响相对间接

In (Lester et al., 2021) and (Liu et al., 2021), con-tinuous prompts are only inserted into the input embedding sequence (Cf. Figure 2 (a)). This leads to two challenges. First, the number of tunable parameters is limited due to the constraints of se-quence length. Second, the input embeddings have relatively indirect impact on model predictions.

在(Lester et al.， 2021)和(Liu et al.， 2021)中，连续提仅插入到输入嵌入序列中(参见图2 (a))。这带来了两个挑战。首先，由于序列长度的限制，可调参数的数量有限。其次，输入嵌入对模型预测的影响相对间接。

P-tuning v2的灵感：采用深层提示调优，两大改进(更多可调的任务特定参数、更深层带来更直接的影响)

To address these challenges, P-tuning v2 em-ploys the idea of deep prompt tuning (Li and Liang, 2021; Qin and Eisner, 2021). As illustrated in Fig-ure 2, prompts in different layers are added as pre-fix tokens. On one hand, P-tuning v2 have more tunable task-specific parameters (from 0.01% to 0.1%-3%) to allow more per-task capacity while be-ing parameter-efficient; on the other hand, prompts added to deeper layers have more direct impact on model predictions (see analysis in Appendix B).

为了应对这些挑战，P-tuning v2采用了深层提示调优的想法(Li和Liang, 2021;Qin and Eisner, 2021)。如图2所示，不同层中的提示被添加为前缀令牌。一方面，P-tuning v2具有更多可调的任务特定参数(从0.01%到0.1%-3%)，以允许每个任务的容量更大，同时保持参数效率;另一方面，添加到更深层的提示对模型预测有更直接的影响(见附录B中的分析)。

3.3 Optimization and Implementation优化与实现

There are a few useful details of optimization and implementation for achieving the best performance.

有几个关于优化和实现的细节对于实现最佳性能很有用。

重新参数化：早期的重新参数化编码器，比如Prefix Tuning利用MLP来转换可训练嵌入

Reparameterization. Prior works usually lever-age a reparameterization encoder such as an MLP (Li and Liang, 2021; Liu et al., 2021) to trans-form trainable embeddings. However, for NLU, we discover that its usefulness depends on tasks and datasets. For some datasets (e.g., RTE and CoNLL04), MLP brings a consistent improvement; for the others, MLP leads to minimal or even negative effects on the results (e.g., BoolQ and CoNLL12). See Appendix B for more analysis.

Reparameterization。先前的工作通常利用重新参数化编码器，如MLP (Li and Liang, 2021;Liu et al.， 2021)来转换可训练的嵌入。然而，对于NLU，我们发现它的有用性取决于任务和数据集。对于某些数据集(例如，RTE和CoNLL04)， MLP带来了一致的改进;对于其他的，MLP对结果的影响很小，甚至是负面的(例如，BoolQ和CoNLL12)。更多分析请参见附录B。

提示长度：不同的NLU任务通常在不同的提示长度下达到最佳性能，

Prompt Length. The prompt length plays a crit-ical role in P-Tuning v2. We find that different NLU tasks usually achieve their best performance with different prompt lengths (Cf. Appendix B). Generally, simple classification tasks prefer shorter prompts (less than 20); hard sequence labeling tasks prefer longer ones (around 100).

提示长度。提示符长度在P-Tuning v2中起着关键作用。我们发现，不同的NLU任务通常在不同的提示长度下达到最佳性能(参见附录B)。一般来说，简单的分类任务倾向于更短的提示(少于20个);硬序列标注任务倾向于较长的序列(大约100个)。

多任务学习：通过共享的连续提示共同优化多个任务，然后为每个任务进行微调

Multi-task Learning. Multi-task learning jointly optimizes multiple tasks with shared continuous prompts before fine-tuning for individual tasks.Multi-task is optional for P-Tuning v2 but can be used for further boost performance by providing a better initialization (Gu et al., 2021).

多任务学习。多任务学习在对单个任务进行微调之前，通过共享的连续提示共同优化多个任务。对于P-Tuning v2来说，多任务是可选的，但可以通过提供更好的初始化来进一步提高性能(Gu et al.， 2021)。

分类头

Classification Head. Using a language modeling head to predict verbalizers (Schick and Schütze, 2020) has been central for prompt tuning (Liu et al., 2021), but we find it unnecessary in a full-data setting and incompatible with sequence labeling. P-tuning v2 instead applies a randomly-initialized classification head on top of the tokens as in BERT (Devlin et al., 2018) (Cf. Figure 2).

分类头。使用语言建模头来预测语言表达器(Schick and sch<e:1>策，2020)是提示调优的核心(Liu et al.， 2021)，但我们发现在全数据设置中它是多余的，并且与序列标注不兼容。相反，P-tuning v2在token之上应用随机初始化的分类头，类似于BERT（Devlin等，2018）中的方式（参见图2）。

Prompt Tuning方法对比：P-tuning、PROMPT TUNING、Prefix Tuning、SOFT PROMPTS、P-tuning v2

To clarify P-tuning v2’s major contribution, we present a conceptual comparison to existing prompt tuning approaches in Table 1.

为了阐明P-tuning v2的主要贡献，我们在表1中给出了与现有提示调优方法的概念比较。

Table 1: Conceptual comparison between P-tuning v2 and existing Prompt Tuning approaches (KP: Knowl-edge Probe; SeqTag: Sequence Tagging; Re-param.: Reparameterization; No verb.: No verbalizer).表1:P-tuning v2与现有的Prompt Tuning方法的概念比较(KP: 知识探针; SeqTag:序列标注; 重新参数化: Reparameterization;No verb.:(无verbalizer)。

4 Experiments实验

We conduct extensive experiments over different commonly-used pre-trained models and NLU tasks to verify the effectiveness of P-tuning v2. In this work, all methods except for fine-tuning are con-ducted with frozen language model backbones, which accords with (Lester et al., 2021)’s setting but differs from (Liu et al., 2021)’s tuned setting. Ratios of task-specific parameters (e.g., 0.1%) are derived from comparing continuous prompts’ pa-rameters with transformers’ parameters. Another thing to notice is that our experiments are all con-ducted in the fully-supervised setting rather than few-shot setting.

我们在不同的常用预训练模型和NLU任务上进行了广泛的实验，以验证P-tuning v2的有效性。在本工作中，除了微调之外的所有方法都是使用冻结的语言模型主干进行的，这符合(Lester et al.， 2021)的设置，但与(Liu et al.， 2021)的调整设置有所不同。特定任务参数的比率(例如，0.1%)是通过比较连续提示“pa参数”与Transformer参数得出的。另一件需要注意的事情是，我们的实验都是在完全监督的环境中进行的，而不是在少量示例下进行的。

NLU任务：来自SuperGLUE的数据集,引入序列标注任务，包括命名实体识别、抽取式问答、语义角色标注等

NLU Tasks. First, we include datasets from Su-perGLUE (Wang et al., 2019) to test P-tuning v2’s general NLU ability. Additionally, we introduce a suite of sequence labeling tasks, including named entity recognition (Sang and De Meulder, 2003; Weischedel et al., 2013; Carreras and Màrquez, 2004), extractive Question Answering (Rajpurkar et al., 2016), and semantic role labeling (Carreras and Màrquez, 2005; Pradhan et al., 2012)).

NLU任务。首先，我们包括来自Su-perGLUE的数据集(Wang et al.， 2019)来测试P-tuning v2的一般NLU能力。此外，我们引入了一套序列标注任务，包括命名实体识别(Sang and De Meulder, 2003;Weischedel et al.， 2013;Carreras and Màrquez, 2004)、抽取式问答(Rajpurkar et al.， 2016)和语义角色标注(Carreras and Màrquez, 2005;Pradhan et al.， 2012)。

预训练模型：采用BERT-large、DeBERTa-xlarge、GLM-xlarge

Pre-trained Models. We include BERT-large (De-vlin et al., 2018), RoBERTa-large (Liu et al., 2019), DeBERTa-xlarge (He et al., 2020), GLM-xlarge/xxlarge (Du et al., 2021) for evaluation. They are all bidirectional models designed for NLU tasks, covering a wide range of sizes from about 300M to 10B.

预训练模型。我们包括BERT-large (De-vlin等人，2018)、RoBERTa-large (Liu等人，2019)、DeBERTa-xlarge (He等人，2020)、GLM-xlarge/xxlarge (Du等人，2021)进行评估。它们都是为NLU任务设计的双向模型，涵盖了从约300M到10B的广泛尺寸。

多任务学习：为每个数据集使用独立的线性分类器，同时共享连续提示

Multitask Learning. For the multi-task setting, we combine the training sets of the datasets in each task type (e.g., combing all training sets of seman-tic role labeling). We use separate linear classi-fiers for each dataset while sharing the continuous prompts (Cf. Appendix A).

多任务学习。对于多任务设置，我们将每个任务类型的数据集的训练集进行组合(例如，将语义角色标注的所有训练集进行组合)。我们为每个数据集使用独立的线性分类器，同时共享连续提示（参见附录A）。

Table 4: Comparison between [CLS] label with linear head and verbalizer with LM head on RoBERTa-large.表4:RoBERTa-large上带有线性头部的[CLS]标签与带有LM头部的语言化器的比较。

4.1 P-tuning v2: Across Scales跨规模

P-tuning、PROMPT TUNING在小规模上的表现可能相当糟糕

Table 2 presents P-tuning v2’s performances across model scales. In SuperGLUE, performances of Lester et al. (2021) and P-tuning at smaller scales can be quite poor. On the contrary, P-tuning v2 matches the fine-tuning performance in all the tasks at a smaller scale. P-tuning v2 even significantly outperforms fine-tuning on RTE.

In terms of larger scales (2B to 10B) with GLM (Du et al., 2021), the gap between Lester et al. (2021); Liu et al. (2021) and fine-tuning is gradually narrowed down. On 10B scale, we have a similar observation as Lester et al. (2021) re-ports, that prompt tuning becomes competitive to fine-tuning. That said, P-tuning v2 is always com-parable to fine-tuning at all scales but with only 0.1% task-specific parameters needed comparing to fine-tuning.

表2展示了P-tuning v2不同模型规模上的性能。在SuperGLUE中，Lester等人(2021)和P-tuning在小规模上的表现可能相当糟糕。相反，P-Tuning v2在所有任务上都与微调的表现相当。P-Tuning v2甚至在RTE上显著优于微调。

在GLM的更大尺度(2B至10B)方面(Du et al.， 2021)， Lester et al. (2021);Liu et al.(2021)，微调范围逐渐缩小。在10B尺度上，我们有类似于Lester等人(2021)报告的观察结果，即提示调优与微调竞争。也就是说，P-tuning v2总是可以与所有尺度的微调相媲美，但与微调相比，只需要0.1%的任务特定参数。

4.2 P-tuning v2: Across Tasks跨任务

From Table 3, we observe that P-tuning v2 can be generally comparable to fine-tuning on all tasks. P-tuning and Lester et al. (2021) show much poorer performance, especially on QA, which might be the most challenging of the three tasks. We also notice that there are some abnormal results of Lester et al.(2021) and P-tuning on SQuAD 2.0. This is prob-ably because SQuAD 2.0 contains unanswerable questions, which causes optimization challenges for single-layer prompt tuning. Multi-task learn-ing generally brings significant improvements to P-Tuning v2 over most tasks except for QA.

从表3中，我们观察到P-Tuning v2在所有任务上通常可以与微调相当。P-Tuning和Lester等（2021）在QA上的表现较差，这可能是这三个任务中最具有挑战性的。我们还注意到Lester等（2021）和P-Tuning在SQuAD 2.0上的一些异常结果。这可能是因为SQuAD 2.0包含无法回答的问题，这为单层提示调优的优化带来了挑战。除了QA之外，多任务学习通常会给P-Tuning v2带来显著的改进。

4.3 Ablation Study消融研究

分类头与语言建模头的比较：没有显著差异

Verbalizer with LM head v.s. [CLS] label with linear head. Verbalizer with LM head has been a central component in previous prompt tuning ap-proaches. However, for P-tuning v2 in a supervised setting, it is affordable to tune a linear head with about several thousand parameters. We present our comparison in Table 4, where we keep other hyper-parameters and only change [CLS] label with linear head to verbalizer with LM head. Here, for simplic-ity, we use “true” and “false” for SST-2, RTE and BoolQ; “true”, “false” and “neutral” for CB. Re-sults indicate that there is no significant difference between performances of verbalizer and [CLS].

带LM头的语言分析器与带线性头的CLS标签。具有LM头的语言分析器在以前的提示调优方法中一直是一个核心组件。然而，对于P-Tuning v2来说，在监督设置下，调整一个大约几千个参数的线性头是可行的。我们在表4中展示了我们的比较结果，其中我们保持其他超参数不变，只将[CLS]标签替换为语言建模头。在这里，为了简化，我们在SST-2、RTE和BoolQ上使用“真”和“假”，在CB上使用“真”、“假”和“中立”。结果表明，分类头和语言建模头的性能没有显著差异。

提示深度：P-Tuning v2的主要区别在于多层连续提示

Prompt depth. The main difference between Lester et al. (2021); (Liu et al., 2021) and P-tuning v2 is the multi-layer continuous prompts. To ver-ify its exact influence, given a certain number of k layers to add prompts, we select them in both as-cending and descending order to add prompts; for the rest layers, we left them untouched. As shown in Figure 3, with the same amount of parameters (i.e., num of transformer layers to add prompts), adding them in the descending order is always bet-ter than in the ascending order. In the RTE case, only adding prompts to layers 17-24 can yield a very close performance to all layers.

提示深度。Lester等人（2021）和Liu等人（2021）与P-Tuning v2的主要区别在于多层连续提示。为了验证其确切影响，我们给定一个添加提示的k层，我们分别以升序和降序选择它们添加提示；对于剩余的层，我们保持不变。如图3所示，当添加相同数量的参数（即添加提示的Transformer层数）时，以降序添加总是比以升序添加更好。在RTE的情况下，只需在17-24层添加提示，就可以获得与所有层非常接近的性能。

Figure 3: Ablation study on prompt depth using BERT-large. “[x-y]" refers to the layer-interval we add contin-uous prompts (e.g., “21-24” means we are add prompts to transformer layers from 21 to 24). Same amount of continuous prompts added to deeper transformer layers (i.e., more close to the output layer) can yield a better performance than those added to beginning layers.图3:使用BERT-large对提示深度的消融研究。“[x-y]”表示我们添加连续提示的层间隔(例如，“21-24”表示我们从21层到24层添加提示)。相同数量的连续提示添加到更深的Transformer层(即，更接近输出层)可以产生比添加到开始层更好的性能。

5 Conclusions结论

We present P-tuning v2, a prompt tuning method. Despite its relatively limited technical novelty, it contributes to a novel finding that prompt tuning can be comparable to fine-tuning universally across scales (from 330M to 10B parameters) and tasks. With high accuracy and parameter efficiency, P-Tuning v2 can be a potential alternative for fine-tuning and a strong baseline for future work.

本文提出了一种提示调优方法P-tuning v2。尽管它的技术新颖性相对有限，但它贡献了一个新的发现，即提示调优可以与微调在规模（从330M到10B参数）和任务上普遍相当。由于具有高准确性和参数效率，P-Tuning v2可以成为微调的潜在替代方案，并为未来的工作提供强大的基线。

ACKNOWLEDGEMENT

We would like to thank the anonymous reviewers for their suggestions and comments. Jie Tang is supported by the NSFC for Distinguished Young Scholar (61825602) and NSFC (61836013). Kaix-uan Ji is supported by Tsinghua University Initia-tive Scientific Research Program and DCST Stu-dent Academic Training Program.

我们要感谢匿名评论者的建议和评论。感谢清华大学杰出青年学者基金（61825602）和国家自然科学基金（61836013）对Jie Tang的支持。Kaixuan Ji得到了清华大学自主科研计划和DCST学生学术培训计划的支持。