Prefix-Tuning: Optimizing Continuous Prompts for Generation

Proceedings of the 59th Annual Meeting of the Association for Computational Linguisticsand the 11th International Joint Conference on Natural Language Processing, pages 4582 4597 August 1 6, 2021. 2021 Association for Computational Linguistics4582 Prefix-Tuning: Optimizing Continuous Prompts for GenerationXiang Lisa LiStanford LiangStanford is the de facto way of leveraginglarge pretrained language models for down-stream tasks. However, fine-tuning modifiesall the language model parameters and there-fore necessitates storing a full copy for eachtask. In this paper, we propose prefix-tuning, alightweight alternative to fine-tuning for natu-ral language Generation tasks, which keeps lan-guage model parameters frozen and instead op-timizes a sequence ofcontinuous task-specificvectors, which we call theprefix. Prefix-tuningdraws inspiration from prompting for languagemodels, allowing subsequent tokens to attendto this prefix as if it were virtual tokens.

We apply prefix-tuning to GPT-2 for table-to-text Generation and to BART for summariza-tion. We show that by modifying only ofthe parameters, prefix-tuning obtains compara-ble performance in the full data setting, outper-forms fine-tuning in low-data settings, and ex-trapolates better to examples with topics thatare unseen during IntroductionFine-tuning is the prevalent paradigm for usinglarge pretrained language models (LMs) (Radfordet al., 2019; Devlin et al., 2019) to perform down-stream tasks ( , summarization), but it requiresupdating and storing all the parameters of the , to build and deploy NLP systemsthat rely on large pretrained LMs, one currentlyneeds to store a modified copy of all the LM pa-rameters for each task. This can be prohibitivelyexpensive given the size of current LMs; for exam-ple, GPT-2 has 774M parameters (Radford et al.,2019) and GPT-3 has 175B parameters (Brownet al.)

, 2020).A natural approach to this problem islightweightfine-tuning, which freezes most of the pretrainedparameters and only tunes a smaller set of param-eters. For example, adapter-tuning (Rebuffi et al.,Figure 1:Fine-tuning (top) updates all LM param-eters (the red Transformer box) and requires storinga full model copy for each task. We propose prefix-tuning (bottom), which freezes the LM parameters andonly optimizes the prefix (the red prefix blocks). Con-sequently, we only need to store the prefix for eachtask, making prefix-tuning modular and that each vertical block denote transformer activa-tions at one time ; Houlsby et al., 2019) inserts additional task-specific layers between the layers of pretrainedlanguage models. Adapter-tuning has promisingperformance on natural language understandingand Generation benchmarks, attaining comparableperformance with fine-tuning while adding onlyaround 2 4% task-specific parameters (Houlsbyet al.

, 2019; Lin et al., 2020).At the limit, GPT-3 (Brown et al., 2020) canbe deployed using in-context learning, which isa form ofprompting, without modifying any LMparameters. In in-context learning, Brown et al.(2020) prepend a natural language task instruction( ,TL;DRfor summarization) and a few exam-ples to the task input, and then generate the taskoutput from the LM. However, since Transformerscan only condition on a bounded-length context( , 2048 tokens for GPT-3), in-context learningis restricted to very small training this paper, we proposeprefix-tuning, alightweight alternative to fine-tuning for natural lan-guage Generation (NLG) tasks, inspired by prompt-ing. Consider the task of generating a textual de-scription of a data table, as shown in Figure 1,where the task input is a linearized table ( , name: Starbucks|type: coffee shop ) and the out-put is a textual description ( , Starbucks servescoffee.

Prefix-tuning prepends a sequence ofcontinuous task-specificvectors to the input, whichwe call aprefix, depicted by red blocks in Figure 1(bottom). To generate each token, the LM can at-tend to the prefix as if it were a sequence of virtualtokens , but unlike prompting, the prefix consistsentirely of free parameters which do not correspondto real tokens. In contrast to fine-tuning in Figure 1(top), which updates all LM parameters and thusrequires storing a tuned copy of the model for eachtask, prefix-tuning only optimizes the prefix. Con-sequently, we only need to store one copy of thelarge LM and a learned task-specific prefix, yield-ing a very small overhead for each additional task( , 250K parameters for table-to-text).In contrast to full fine-tuning, prefix-tuning isalso modular: we train an upstream prefix whichsteers an unmodified LM, and therefore, a singleLM can support many tasks at once.

In the con-text of personalization where the tasks correspondto users (Shokri and Shmatikov, 2015; McMahanet al., 2016), we would have a separate prefix foreach user trained only on that user s data, therebyavoiding data cross-contamination. Moreover, theprefix-based architecture enables us to even pro-cess examples from multiple users/tasks in a singlebatch, something that is not possible with otherlightweight fine-tuning approaches like evaluate prefix-tuning on table-to-text gen-eration using GPT-2 and abstractive summariza-tion using BART. In terms of storage, prefix-tuningstores 1000x fewer parameters than full terms of performance when trained on fulldatasets, prefix-tuning and fine-tuning are compara-ble for table-to-text ( ), while prefix-tuning suf-fers a small degradation for summarization ( ).In low-data settings, prefix-tuning outperforms fine-tuning on both tasks ( ).

Prefix-tuning also ex-trapolates better to tables (for table-to-text) and arti-cles (for summarization) with unseen topics ( ).2 Related WorkFine-tuning for natural language state-of-the-art systems for natural lan-guage Generation (NLG) are based on fine-tuningpretrained LMs. For table-to-text Generation , Kale(2020) fine-tunes a sequence-to-sequence model(T5; Raffel et al., 2020). For extractive and abstrac-tive summarization, researchers fine-tune maskedlanguage models ( , BERT; Devlin et al., 2019)and encode-decoder models ( , BART; Lewiset al., 2020), respectively (Zhong et al., 2020; Liuand Lapata, 2019; Raffel et al., 2020). For otherconditional NLG tasks such as machine transla-tion and dialogue Generation , fine-tuning is also theprevalent paradigm (Zhang et al., 2020c; Sticklandet al., 2020; Zhu et al., 2020; Liu et al., 2020). Inthis paper, we focus on table-to-text using GPT-2and summarization using BART, but prefix-tuningin principle can be applied to other Generation tasksand pretrained models, such as masked fallsunder the broad class of lightweight fine-tuningmethods, which freeze most of the pretrainedparameters and only tune a smaller set of param-eters.

The key question is how to augment the LMarchitecture and decide which subset of pretrainedparameters to tune. One line of research learns atask-specific parameter mask (Zhao et al., 2020;Radiya-Dixit and Wang, 2020).Another lineof research inserts new modules with trainableparameters. For example, Zhang et al. (2020a)trains a side network that is fused with thepretrained model via summation; adapter-tuninginserts task-specific layers (adapters) between eachlayer of the pretrained LM (Houlsby et al., 2019;Lin et al., 2020; Rebuffi et al., 2017; Pfeiffer et al.,2020). Compared to this line of work, which the LM parameters, our methodobtains a further 30x reduction in task-specificparameters, tuning only while maintainingcomparable performance on table-to-text is a way of leveraging apretrained LM by prepending instructions and afew examples to the task input and generating thetask output from the LM.

For autoregressive LMs,the most successful form of prompting is GPT-3 sin-context learning (Brown et al., 2020), whichuses manually designed Prompts to adapt its gen-eration for different tasks in few-shot settings. Formasked LMs like BERT and RoBERTa (Liu et al.,45842019), prompt engineering has been explored fornatural language understanding tasks (Jiang et al.,2020; Schick and Sch utze, 2020). For example,AutoPrompt (Shin et al., 2020) searches for a se-quence of discrete trigger words and concatenatesit with each input to elicit sentiment or factualknowledge from BERT and RoBERTa. In contrastwith AutoPrompt, our method optimizes contin-uous prefixes, which are more expressive ( );moreover, we focus on language Generation vectors have been used to steer LMs;for example, Subramani et al. (2020) showed that apretrained LSTM language model can reconstructarbitrary sentences by Optimizing a Continuous vec-tor for each sentence, making the vectorinput-specific.

In contrast, prefix-tuning optimizes atask-specificprefix that applies to all instances of thattask. As a result, unlike the previous work whoseapplication is limited to sentence reconstruction,prefix-tuning can be applied to NLG genera-tion aims to steer a pretrained language modelto match a sentence-level attribute ( , positivesentiment or sports). Such control can happen attraining time: Keskar et al. (2019) pretrains thelanguage model (CTRL) to condition on metadatasuch as keywords or URLs. The control can alsohappen at decoding time, by weighted decoding(GeDi, Krause et al., 2020) or iteratively updat-ing the past activations (PPLM, Dathathri et al.,2020). However, there is no straightforward wayto apply these controllable Generation techniquesto enforce fine-grained control over generated con-tents, as demanded by tasks like table-to-text * tuning is an instance of a newclass of methods that has emerged, which we callp*-tuning (since the other prominent instances, p-tuning and prompt-tuning, also start with p), allbased on the idea of Optimizing a Continuous prefixor prompt.

Prefix-Tuning: Optimizing Continuous Prompts for Generation

Tags:

Information

Advertisement

Transcription of Prefix-Tuning: Optimizing Continuous Prompts for Generation

Related search queries

Prefix-Tuning: Optimizing Continuous Prompts for Generation

Tags:

Information

Advertisement

Documents from same domain

Related documents

Related search queries