Instruction Position Matters in Sequence Generation With Large Language Models

Download as pdf or txt
Download as pdf or txt
You are on page 1of 11

Instruction Position Matters in Sequence Generation

with Large Language Models

Yijin Liu, Xianfeng Zeng, Fandong Meng∗ and Jie Zhou


Pattern Recognition Center, WeChat AI, Tencent Inc, China
{yijinliu, xianfzeng, fandongmeng, withtomzhou}@tencent.com

Abstract Instruction + Input sentence + Respose

Large language models (LLMs) are capable (a) Pre-Instruction Mode


of performing conditional sequence genera-
arXiv:2308.12097v1 [cs.CL] 23 Aug 2023

tion tasks, such as translation or summariza- Input sentence + Instruction + Respose


tion, through instruction fine-tuning. The fine-
tuning data is generally sequentially concate- (b) Post-Instruction Mode
nated from a specific task instruction, an in-
put sentence, and the corresponding response. Figure 1: Example data in Pre-Instruction and Post-
Considering the locality modeled by the self- Instruction format. Different blocks represent textual
attention mechanism of LLMs, these models data from different fields, while the ‘+’ symbol signifies
face the risk of instruction forgetting when gen- the concatenation operation for the textual data. For
erating responses for long input sentences. To sequence generation tasks, the length of the input sen-
mitigate this issue, we propose enhancing the tence is generally much larger than the length of the task
instruction-following capability of LLMs by instruction.
shifting the position of task instructions af-
ter the input sentences. Theoretical analysis
suggests that our straightforward method can 2023). However, there is also a growing inter-
alter the model’s learning focus, thereby em-
est in open-source medium-sized language mod-
phasizing the training of instruction-following
capabilities. Concurrently, experimental re-
els, such as the LLaMA model with 13 billion pa-
sults demonstrate that our approach consis- rameters (Touvron et al., 2023) and the BLOOMZ
tently outperforms traditional settings across language model with 7.1 billion parameters (Muen-
various model scales (1B / 7B / 13B) and differ- nighoff et al., 2022), to meet the research and hard-
ent sequence generation tasks (translation and ware deployment requirements.
summarization), without any additional data or To align the outputs of language models with
annotation costs. Notably, our method signifi-
human intentions and unlock their full potential,
cantly improves the zero-shot performance on
conditional sequence generation, e.g., up to 9.7
InstructGPT (Ouyang et al., 2022) uses a small
BLEU points on WMT zero-shot translation amount of supervised data to construct instruction-
tasks. following data for fine-tuning LLMs and conducts
reinforcement learning to train the model based on
1 Introduction human preferences. This approach of instruction
fine-tuning has gained widespread adoption and
In recent years, there has been a rapid emergence following from both the academic and industrial
of large language models (LLMs) like GPT-4 communities (Brooks et al., 2023; Chung et al.,
and ChatGPT1 , which have demonstrated excel- 2022; Wei et al., 2022; Ahn et al., 2022; Wei et al.,
lent zero-shot capabilities without the need for su- 2021; Aher et al., 2023).
pervised fine-tuning. These models have shown Generally, the instruction-following data con-
promising performance in various traditional natu- sists of three parts. Taking the machine transla-
ral language processing tasks (Wang et al., 2023; tion task as an example, these parts include a spe-
Jiao et al., 2023b; Kasneci et al., 2023; Hill-Yardin cific task instruction (e.g., "Please translate the fol-
et al., 2023; Šlapeta, 2023; Aydın and Karaarslan, lowing paragraph from English to French"), an

Corresponding author. input sentence (the English sentence to be trans-
1
https://chat.openai.com/chat lated), and the final response (the corresponding
French translation). Since most large language struction (e.g., predicting the task instruction given
models are based on the decoder-only structure of inputs and outputs), emphasizing the modeling of
Transformer (Vaswani et al., 2017; Radford et al., task instruction-following ability.
2019; Brown et al., 2020), with a training objective In addition to the theoretical analysis, we
of next token predicting. Generally, these three conduct extensive experiments based on two
parts of the instruction-following data are sequen- widely used large language models, LLaMA and
tially concatenated into a long nature sentence as BLOOMZ, with various parameter sizes ranging
the input for language models. Given that the from 1.7 billion to 13 billion. We select two com-
self-attention mechanism in Transformer decoders mon sequence generation tasks as specific down-
tends to focus more on nearby words i.e., the local- stream tasks, namely, machine translation and
ity of self-attention modeling (Beltagy et al., 2020; long text summarization. The experimental results
Kitaev et al., 2019; Voita et al., 2019), there is a show that Post-Ins consistently outperforms Pre-
considerable risk of instruction forgetting when pre- Ins across various settings without using any ad-
dicting responses for long input sentences. Such as ditional supervised data. Furthermore, due to the
performing long text summarization tasks, the input superior modeling ability of task instruction, Post-
sentence may contain thousands of tokens. Conse- Ins exhibits stronger task instruction generalization
quently, the model may be at risk of forgetting the capabilities, resulting in significant performance
initial task instruction when predicting responses, gains in zero-shot translation tasks (e.g., up to a
leading to the generation of responses that do not 9.7 BLEU score improvement). Finally, human
fully comply with the user’s intent. In this paper, analysis results confirm that Post-Ins generates re-
we refer to this issue as the instruction forgetting sponses more faithful to user instructions, with a
issue. noticeable improvement on the issue of hallucina-
To alleviate the above issue for LLMs during tion responses.
instruction fine-tuning, we first observe that the Our contribution can be summarized as follows:2
relative position of the input sentence and the task
• We show that the position of task instruction is
instruction is crucial. Therefore, we propose a sim-
a key factor to conduct instruction fine-tuning
ple and straightforward solution, namely, placing
with LLMs, and propose to relocate the task
the task instruction at the end of the input sen-
instruction after the input sequence (i.e., Post-
tence (referred to as ‘Post-Ins’). In this way, when
Ins) could enhance the instruction-following
the model predicts the final response, it effectively
ability of LLMs.
models the nearest preceding sequence, which is
just the task instruction indicating what content
• Both our theoretical and experimental analy-
should be generated next. For comparison, we re-
ses demonstrate that Post-Ins pays larger at-
fer to the data format in existing studies where the
tentions on the model’s instruction-following
task instruction is concatenated to the front of the
capabilities, yielding consistent performance
input sentence as Post-Instruction (abbreviated as
improvements across two common sequence
‘Pre-Ins’).
generation tasks.
To verify whether Post-Ins improves the
instruction-following ability of language models 2 Background
and alleviates the instruction forgetting issue on
long sentences compared to Pre-Ins, we first ana- Instruction Fine-Tuning. InstructGPT (Ouyang
lyze the conditional probability characteristics of et al., 2022) is the first to unveil the immense poten-
the models under both data formats with the tri- tial of instruction learning, namely, a InstructGPT
nomial Bayes formula. Through appropriate as- model with 1.3 billion parameters can outperform
sumptions and formula derivations, we draw the the 175B GPT-3, despite having 100x fewer param-
following conclusions: (1) Pre-Ins tends to model a eters. Then Stanford release the Alpaca instruction-
reverse conditional probability (e.g., reverse trans- following datasets (Taori et al., 2023), which is
lation probability), emphasizing the coverage of constructed by the Self-instruction data generation
the input sentence while insufficiently modeling pipline (Wang et al., 2022). In the field of machine
the task instruction. (2) Post-Ins is more inclined to 2
Codes and data are at https://github.com/Adaxry/
model a conditional probability about the task in- Post-Instruction
Instruction:
Input:
Translate the following sentences from [SRC] to [TGT].
At the start of this week the WADA had already raised the issue
of tampering and demanded that Russia reply to the
Input: accusations within three weeks. Russia is now faced with the
At the start of this week the WADA had already raised the issue RUSADA again being excluded and in the worst case being
of tampering and demanded that Russia reply to the banned from the 2020 Olympic Games in Tokyo.
accusations within three weeks. Russia is now faced with the
RUSADA again being excluded and in the worst case being Instruction:
banned from the 2020 Olympic Games in Tokyo. Translate the above sentences from [SRC] to [TGT].

Response: Response:
Zu Beginn dieser Woche hatte bereits die Wada Manipulationen Zu Beginn dieser Woche hatte bereits die Wada Manipulationen
in den Raum gestellt und Russland aufgefordert, binnen drei in den Raum gestellt und Russland aufgefordert, binnen drei
Wochen auf die Vorwürfe zu antworten. Russland droht nun die Wochen auf die Vorwürfe zu antworten. Russland droht nun die
erneute Sperre der Rusada und im schlimmsten Fall auch der erneute Sperre der Rusada und im schlimmsten Fall auch der
Ausschluss von den Olympischen Spielen 2020 in Tokio. Ausschluss von den Olympischen Spielen 2020 in Tokio.

(a) Pre-Ins (b) Post-Ins

Figure 2: An example of Pre-Ins formatted data for the machine translation tasks. ‘[SRC]’ and ‘[TGT]’ refer to the
source and target language, which are respectively English and German in this example.

translation, Parrot (Jiao et al., 2023a) builds con- shown in Figure 2a.
trastive and error-guided instructions to align the Considering the nature of sequence generation
translation results of LLMs with humman prefer- tasks, the input part (ins) often tends to be lengthy,
ences. Subsequently, (Zeng et al., 2023) further such as translating an entire article or generating
extend the error-guided instructions with token- a summary of a paragraph. After applying the
level Direct Preference Optimization (Rafailov fine-grained tokenization, it can result in a long se-
et al., 2023). To better transfer the capabilities quence of tokens for training. Generally, the main-
of sequence generation of LLMs, BayLing (Zhang stream LLMs are based the decoder-only architec-
et al., 2023b) propose to conduct interactive trans- ture of Transformer, where the self-attention tends
lation task for instructing fine-tuning. Although the to pay larger attention on nearby tokens. Therefore,
aforementioned methods have made considerable in the case of long sequences as mentioned above,
progress, we argue that the Pre-Ins data format ul- there is a significant risk that the model may forget
tilized in existing studies face the potential risk of the frontmost task instruction in the Pre-Ins data
instruction forgetting issue, which is just we aims format, yiedling responses that don not follow the
to address in this paper. task instruction.

3 Approach 3.2 Preliminary Observations on Pre-Ins


3.1 Definition To verify whether the Pre-Ins data format suffer
The standard instruction-following data format con- from the above issue of instruction forgetting on
sists of three components: a specific task instruc- long input sentences, we conduct preliminary ex-
tion inst, an input sentence inp, and the corre- periments on the machine translation task (detailed
sponding response res. Take the machine trans- experimental setups are at Section 4.1). We di-
lation task as an example, inst is a specific task vide the training data into multiple groups based
instruction that directs the model to translate from on the length range of the source text, ensuring
the source language into the target language, while that the total number of tokens in each group of
inp and res are respectively the source input sen- training samples is approximately the same. Sim-
tence and target translation. The inst, inp and ilarly, we also select corresponding test sets for
res are then sequentially concatenated into a long different length ranges. Results on BLOOMZ are
sequence, which is then fed into the LLMs for ploted in Figure 3, we observe that the model tends
training in a teacher-forcing mode3 . We provide a to perform better on test sets with similar lengths
specific training example in the Pre-ins format, as to the training set, while it often struggles on test
sets with different lengths (the diagonal line in Fig-
3
Following existing studies (Taori et al., 2023; Jiao et al., ure 3). Specifically, a model trained on short sen-
2023a), the cross-entropy loss is calculated merely on res,
while inp and inst only participate in the forward encoding tences may face difficulties in translating longer
process. sentences due to limitations in the distribution of
primarily encounter natural language during the
pre-training phase, we can make the basic assump-
tion that the input inp and response res are in-
dependent when no specific task instructions are
given, namely, p(inp|res) ≈ p(inp). Therefore,
we can further simplify the above formula as fol-
lows:
p(res) · p(inp|res) · p(inst|res, inp)
p(res|inp, inst) =
p(inp) · p(inst|inp)
p(res) · p(inst|res, inp)

p(inst|inp)
(2)
Given that the task instruction inst are not involved
in the training loss, we can simply treat its predicted
Figure 3: Performance of Pre-Ins over different length probability p(inst|inp) as a constant, and get the
invertals in our preliminary experiments. following form:

p(res|inp, inst) = p(res) · p(inst|res, inp) ·const


| {z } | {z } (3)
training data. Interestingly, a model trained on f luency instruct

longer sentences performs poorly on shorter sen-


where p(res) denotes the modeling probability of
tence datasets (as shown in the top-left corner of
the target response, which guarantees the fluency of
the Figure 3). Based on humman analysis, we dis-
the model in predicting the response. On the other
cover a noticeable translation hallucination issue,
hand, p(inst|res, inp) represents the probability
where the model generates content that does not ex-
of the model determining what task instruction is
ist in the source text. These observations indicate
currently being executed given the input inp and
that the existing Pre-Ins data format has limited
response res. This can ensure that the model has
ability to follow the instructions, especially when
a strong perception of the requirements of the task
the input sentence ins is long. Pre-Ins exhibits a
instruction.
risk of instruction forgetting, resulting in outputs
that are not faithful to the user’s intent. 3.4 Post-Instruction versus Pre-Instruction
3.3 Post-Instruction As a comparison, we have also conducted a simi-
To address the above issue of instruction forgetting, lar theoretical analysis for Pre-Ins, and ultimately
we propose a simple and straightforward solution, obtain the following formula:
namely, relocating the task instruction inst after
p(res) · p(inst|res) · p(inp|res, inst)
the input sentence inp. As a result, the model can p(res|inst, inp) =
p(inst) · p(inp|inst)
perceive the specific task instructions more closely p(res) · p(inp|res, inst)
when generating responses, regardless of the length ≈
p(inp|inst)
of the input sentence. We refer this data format as = p(res) · p(inp|res, inst) ·const
Post-Instruction (Post-Ins), and provide a Post-Ins | {z } |
f luency
{z
coverage
}

formatted example in Figure 2b. (4)


Formally, the Post-Ins format of data encourages Similar to Post-Ins, Pre-Ins also includes a com-
the LLMs to model the following conditional prob- ponent responsible for modeling the fluency of the
ability p(res|inp, inst). Here, we can decompose response, denoted as p(res). However, the key dif-
the above formula using the trinomial Bayes’ theo- ference lies in the Pre-Ins emphasis on modeling
rem as follows: the probability of the input given the instruction and
response, namely, p(inp|res, inst), which is simi-
lar to the modeling coverage in translation tasks (Tu
p(res) · p(inp|res) · p(inst|res, inp)
p(res|inp, inst) = et al., 2016). Such modeling approach may be suit-
p(inp) · p(inst|inp)
(1) able for a single task or a small number of training
where p(inp|res) represents the probability of the tasks, as the model can memorize these few task
input given the response. Considering that LLMs instructions through supervised fine-tuning.
However, considering that LLMs inherently have struction fine-tuning. To facilitate comparison, we
strong fundamental capabilities that can naturally follow the settings of existing methods (Jiao et al.,
be applied to various sequence generation tasks, 2023a; Zeng et al., 2023) and fine-tune LLMs on
when modeling multiple sequence generation tasks data for three languages and four translation di-
simultaneously (such as multiple translation direc- rections: Chinese-to-English, English-to-Chinese,
tions), Pre-Ins may suffer from instruction forget- German-to-English, and English-to-German. The
ting and produce low-quality responses that do not test sets for these four directions in WMT-2022 are
follow instructions due to the lack of task instruc- used to evaluate translation performance, while the
tion modeling. In contrast, Post-Ins, with its pref- remaining directions, such as French-to-German or
erence for directly modeling task instructions as Russian-to-English, are used to evaluate the zero-
shown in Equation (3), can easily handle various se- shot performance of the models. Furthermore, con-
quence generation tasks and has good transferabil- sidering the similarity in data distribution across
ity for task instructions. We experimentally verify the years in the WMT dataset (Barrault et al., 2020;
the stronger instruction transferability of Post-Ins Zeng et al., 2021), we also conduct evaluation and
compared to Pre-Ins in zero-shot translation tasks validation on another test set, namely, the FLORES-
in section 5.1. Furthermore, we analyz the model- 200 benchmark.
ing preferences of the two data formats from the
Multidimensional Quality Metrics (MQM).
perspective of attention distribution and observe
MQM dataset are based on the outputs of top sys-
that the conclusions are consistent with our previ-
tems from the WMT 2020 shared task, it provide
ous theoretical analysis in section 5.3.
error analysis of above translations annotated by
professional translators. We follow the preprocess-
4 Experiments and Evaluations
ing scripts of existing studies and finally obtrain
4.1 Datasets a same sized training set with 99k examples (Jiao
et al., 2023a; Zeng et al., 2023). In this paper,
Alpaca. The Alpaca dataset, released by Stan- MQM is only used for translation task.
ford (Taori et al., 2023), is widely used for instruc-
tion following tasks. It is constructed by the self- CNN/DailyMail. The popular CNN/DailyMail
instruct data generation pipline (Wang et al., 2022), Dataset (See et al., 2017) is a collection of English-
utilizing the text-davinci-003 model to generate language news articles, comprising slightly over
high-quality instruction-following data. The data 300k unique articles authored by journalists from
format follows the aforementioned Pre-Ins format, CNN and the Daily Mail. The average sentence
consisting of three parts: instruction, input, and length of the source text of these data is about 665
output. We adjust the positions of the instruction words, or about one thousand tokens, which served
and input, yielding a Post-Ins formated Alpaca. We as a widely used benchmark for the long text sum-
apply this Post-Ins formatted Alpaca dataset to ex- marization (Tang et al., 2023; Zhang et al., 2023a;
periments that are conducted in the Post-Ins format, Lin et al., 2023). We follow the pre-processing
while the other experiments are still conducted on and post-processing scrips of existing studies (Qi
the original Alpaca dataset. et al., 2020). We use the CNN/DailyMail dataset
only for the text summarization task and conduct
WMT Datasets. The annual Conference on Ma- the evaluation on the standard test set with 11,490
chine Translation (Kocmi et al., 2022; Akhbardeh samples.
et al., 2021) provid high-quality human translations
for evaluating the cutting edge machine translation 4.2 Evaluation
systems. In the experiments of this paper, we uti- Inference Settings. For all tasks, we set the batch
lize the development sets from 2017 to 2020 as size to 1 during inference to avoid the effect of
high-quality translation training data, following ex- padding side (e.g, BLOOMZ applies left-padding
isting settings (Jiao et al., 2023a; Zeng et al., 2023). mode, while LLaMA uses right-padding mode
For the translation directions with multiple refer- when batching the input data). We set the tem-
ences, we duplicate the source side and then match perature coefficient to 0.1 in order to encourage the
them with the corresponding translations to form model to output more accurate rather than diverse
multiple translation sentence pairs. Finally, we responses for these conditional sequence genera-
obtrain a collection of 51k sentence pairs for in- tion tasks. As for the decoding strategies, we apply
SacreBLEU COMET22
Systems #Params Instruction
De ⇐⇒ En Zh ⇐⇒ En De ⇐⇒ En Zh ⇐⇒ En
WMT22 Winners
WMT22 Winners N/A N/A 33.70 38.40 33.50 54.30 85.46 88.09 81.12 87.84
BLOOMZ-based
Parrot (Jiao et al., 2023a) 7.1B Pre-Ins 24.96 20.56 22.72 34.58 78.09 73.62 79.00 83.54
TIM (Zeng et al., 2023) 7.1B Pre-Ins 24.31 20.63 23.42 37.20 77.65 74.16 79.50 84.89
1.7B Pre-Ins 21.01 15.51 20.31 33.35 72.63 61.63 77.44 82.56
1.7B Post-Ins 20.99 16.68 20.15 34.02 73.76 63.64 77.38 82.97
3.0B Pre-Ins 23.29 17.02 22.20 35.02 75.42 66.96 78.85 83.33
BLOOMZ
3.0B Post-Ins 23.70 18.24 22.21 35.62 76.12 68.64 78.70 83.77
7.1B Pre-Ins 24.37 19.77 22.98 36.64 78.45 73.77 79.54 84.72
7.1B Post-Ins 25.46 19.79 23.65 37.60 77.79 72.79 79.21 84.69
LLaMA-based
Parrot (Jiao et al., 2023a) 7.0B Pre-Ins 27.38 26.14 20.23 30.33 82.47 81.67 75.90 80.34
BayLing (2023b) * 7.0B Pre-Ins 28.16 25.66 20.31 38.19 83.19 82.18 77.48 84.43
TIM (Zeng et al., 2023) 7.0B Pre-Ins 27.91 25.02 19.33 30.07 82.80 82.56 75.46 80.03
BayLing (2023b) * 13.0B Pre-Ins 27.34 25.62 20.12 37.92 83.02 82.69 77.72 84.62
TIM (Zeng et al., 2023) 13.0B Pre-Ins 29.03 26.71 20.27 32.14 83.48 83.31 76.64 81.30
7.0B Pre-Ins 29.98 25.23 17.68 23.83 82.63 81.27 72.90 75.70
7.0B Post-Ins 30.41 26.50 21.69 30.50 83.62 82.32 76.60 80.66
LLaMA
13.0B Pre-Ins 30.92 28.51 21.95 32.55 84.03 83.14 77.02 81.16
13.0B Post-Ins 31.25 28.70 22.37 33.04 84.19 83.65 77.33 82.16

Table 1: SacreBLEU and COMET22 score(%) of different models with varying instruction modes on the WMT-2022
test sets. ‘De’, ‘En’ and ‘Zh’ are the language code of ‘German’, ‘English’ and "Chinese", respectively. The bolded
scores correspond to the best performance under the same or comparable settings for models with more than 7B
parameters. Results marked with ‘*’ indicate that they are not directly comparable with other results due to the use
of additional supervised data.

the beam search for all task, and set beam size evaluation results in Section 5.4.
to 4 for machine translation. While for the text
summarization task, we have to decrease the beam 5.1 Results of Machine Translation
size to 2, as encoding the long input sentences will Supervised Translation. Table 1 presents a sum-
consume a large portion of GPU memory. mary of experimental results on WMT22. Our pro-
Metrics For the machine translation task, we use posed method, Post-Ins, consistently outperforms
SacreBLEU4 to calculate BLEU scores. Given the Pre-Ins in most settings of BLOOMZ. The maxi-
limitations of N-gram based metrics to measure mum improvement in the BLEU score is observed
semantic similarity, we also calculate the popular to be +1.22, specifically in the En⇒De direction of
neural-based metric, namely, COMET225 . For the the BLOOMZ-3B. Additionally, COMET22 scores
text summarization task, we report the F1 scores of reach up to +2.01 improvement in the Ee⇒De
ROUGE-1, ROUGE-2, and ROUGE-L following translation using the 1.7B model. Our method
existing studies (Tang et al., 2023; Qi et al., 2020). proves to be even more effective in the LLaMA
model, outperforming Pre-Ins in all settings. Partic-
5 Main Results and Analysis ularly, LLaMA-7B achieves a remarkable increase
of +6.67 BLEU and +4.96 in COMET22 in En⇒Zh
In this section, we first list the detailed experimen- translation. It is worth noting that Post-Ins also sur-
tal results of both the machine translations and text passes existing translation approaches that based
summarization tasks in Section 5.1 and Section 5.2. on BLOOMZ and LLaMA. Table 2 showcases the
Subsequently, we show the analysis of the distri- performance of our method on the Flores-200 test
butions of self-attention in Section 5.3, and human set. Our approach outperforms Pre-Ins in 13 out
4
https://github.com/mjpost/sacrebleu of 16 settings, with maximum improvements reach-
5
https://github.com/Unbabel/COMET ing +7.20 BLEU and +6.16 COMET22 score in
SacreBLEU COMET22
Systems #Params Instruction
De ⇐⇒ En Zh ⇐⇒ En De ⇐⇒ En Zh ⇐⇒ En
TIM (Zeng et al., 2023) 7.0B Pre-Ins 39.15 29.31 22.30 28.43 88.19 85.05 83.32 80.55
7.0B Pre-Ins 38.86 29.51 18.10 21.69 88.05 84.57 80.69 75.07
7.0B Post-Ins 41.12 31.27 21.80 28.89 88.63 85.53 83.57 81.23
LLaMA
13.0B Pre-Ins 41.78 33.62 22.21 30.74 88.91 86.36 84.26 82.50
13.0B Post-Ins 41.46 33.12 22.62 31.37 88.91 86.23 84.38 82.83

Table 2: SacreBLEU and COMET22 score(%) of different models with varying instruction modes on the FLORES-
200 test sets. The bolded scores correspond to the best performance under the same or comparable settings.

SacreBLEU
Systems #Para. Ins.
Cs ⇔ En De ⇔ Fr Ja ⇔ En Uk ⇔ En Ru ⇔ En Liv ⇔ En Average
7.0B Pre-Ins 6.0 4.3 15.6 23.0 11.0 2.5 11.0 1.9 21.7 5.8 3.0 3.5 9.1
BLOOMZ
7.0B Post-Ins 5.6 4.5 24.4 22.7 11.0 2.6 10.2 2.0 21.3 6.1 2.9 4.2 9.8
7.0B Pre-Ins 36.8 13.7 3.0 3.4 12.2 4.8 33.9 4.6 34.8 16.8 5.9 2.6 14.3
7.0B Post-Ins 36.8 17.4 3.2 8.8 12.1 7.3 34.6 11.7 34.7 18.9 5.0 3.3 16.2
LLaMA
13.0B Pre-Ins 39.5 19.7 4.9 27.5 13.9 3.4 36.8 17.2 37.6 21.1 5.5 2.9 19.1
13.0B Post-Ins 36.8 19.7 14.6 33.3 13.6 6.0 35.6 17.6 36.2 20.8 5.6 3.0 20.2

Table 3: SacreBLEU score(%) of different models with varying instruction modes on the WMT-2022 zero-shot test
sets. The bolded scores correspond to the best performance under the same or comparable settings. ‘Para.’ is short
for ‘Parameters’ and ‘Ins.’ stands for the data format for the instruction-following data. ‘CS’, ‘Uk’, ‘Ja’, ‘Ru’ and
’Liv’ are the language code for ‘Czech’, ‘Ukrainian’, ‘Japanese’, ‘Russian’ and ‘Livonian’, respectively.

#Params Instruction RG-1 RG-2 RG-L Zero-Shot Translation. Furthermore, we ob-


serve significant improvements in zero-shot trans-
BLOOMZ-based
lation under the Post-Ins mode. Table 3 reports
3.0B Pre-Ins 35.41 16.33 25.81
3.0B Post-Ins 38.90 17.84 27.67 the results of different instruction modes on the
7.0B Pre-Ins 37.54 17.04 26.90 WMT22 zero-shot test set. In terms of BLOOMZ,
7.0B Post-Ins 38.61 17.64 27.49 there is an impressive increase of +8.8 in De⇒Fr
translation, with an average improvement of +0.7
LLaMA-based
BLEU. For LLaMA-7B, an average improvement
7.0B Pre-Ins 37.55 17.17 26.30
7.0B Post-Ins 38.11 17.66 26.88
of +1.9 BLEU is achieved, while LLaMA-13B ex-
hibit an average improvement of +1.1 BLEU. No-
Table 4: F1 scores of ROUGE-1 / ROUGE-2 / ROUGE- tably, LLaMA-13B showcase the highest improve-
L on the test set of CNN/DailyMail. ‘RG’ is an abbre- ment of +9.7 BLEU in De⇒Fr translation. Over-
viation for ‘ROUGE’. The bolded scores correspond all, the consistent improvements of Post-Ins over
to the best performance under the same or comparable Pre-Ins indicate that Post-Ins exhibits stronger in-
settings. struction generalization capabilities, being able to
generate responses effectively even for task instruc-
tions it has never encountered during fine-tuning.
En⇒Zh.
5.2 Results of Text Summarization
Specially for the LLaMA, the improvement in
En⇔Zh translation is significantly larger than that To further validate whether Post-Ins can effectively
in EN⇔DE translation. We hypothesize that this alleviate the issue of instruction forgetting, we con-
is due to the fact that the LLaMA vocabulary splits duct experiments on tasks the long text summa-
Chinese into individual characters, resulting in rization task, where the average lengths of input
longer sentence lengths and an increase in the num- tokens is over 1,000. Table 4 presents the experi-
ber of tokens, which align with the discussion in mental results for the text summarization task in the
Section 3.4, namely, Post-Ins are naturally better CNN/DailyMail, where we report the F1 scores of
equipped to handle generation for long sentence. ROUGE-1, ROUGE-2, and ROUGE-L. It is evident
Instruction

Focused attention
Relatively Weak attention
on the first token
on instruction

Source input
Source input

Word alignment via Word alignment via


unsupervised learning unsupervised learning

Instruction
Focused attention Focused attention
on source input on instruction
Target Response

Target Response
Instruction Source input Target Response Source input Instruction Target Response

(a) Attention heatmap of Pre-Ins. (b) Attention heatmap of Post-Ins.

Figure 4: The visualization analysis of self-attention for the instruction fine-tuned BLOOMZ-7.1B model, where
thicker lines indicate higher attention for the corresponding positions, while thinner lines indicate lower attention.

Source Input Golden References Response of Post-Ins Response of Pre-Ins

1. 将eReader连接⾄电源,打开电源开关,将eReader
Connect your eReader to a 执⾏以下操作之⼀,将您的 执⾏以下操作之⼀:将电⼦阅 放 置 在 电 源 附 近 。 \ n 2 . 将 电 源 线 的 ⼀ 端 连 接
power source by doing one of
the following: 电⼦阅读器连接到电源: 读器连接到电源上 eReader,另⼀端连接电源,eReader会⾃动检测到电
源。将电源线的另⼀端插⼊eReader的电源插⼝。

- M a i n t a i n s fl e x i b i l i t y o f -保持所有⼯作⽝和宠物⽝ - ⽆论年龄⼤⼩、体型⼤⼩、 -能够根据不同年龄、不同⼤⼩、不同运动⽔平的⼯作


movement in all working and
的 运 动 灵 活 性 , ⽆ 论 其 年 运动⽔平⾼低,都能保持灵活 和宠物狗进⾏运动量调整。\n-运动量⾜够⼤,保证所
pet dogs regardless of age,
size and level of exercise. 龄、体型和运动⽔平如何。 的动作。 有年龄、⼤⼩和运动⽔平⼯作及宠物狗的灵活性。

Pictures come in a sturdy 照 ⽚ 装 在 ⼀ 个 结 实 的 信 封 照⽚附在画框内,在盒内还有 在盒⼦上⽅有张照⽚,照⽚中有⼀幅画框,画框中有


carded envelope in the box 中,信封与相框装在⼀个盒 ⼀张照⽚。\n照⽚的底⾊为⽩⾊,背景为⿊⾊。照⽚
with the frame. ⼀张卡⽚。
⼦中。 中有⼀名男⼦

Figure 5: Case studies on Chinese-to-English translation tasks, where the highlighted red texts indicate the model
deviates from the translation instruction, generating content not present in the source text.

that all models achieved significant performance To mitigate the impact of fluctuations of multi-head
improvements up to +3.49 in BLOOMZ-3B when attention and various layers, we average the scores
utilizing the Post-Ins approach. The effectiveness of all heads over different layers to obtain the final
of Post-Ins on the task of long sequence generation score. We reach the following observations:
is further demonstrated by the superiority of its
• A higher distribution of attention scores is
effect on the text summarization task.
observed at the beginning of sentences and
5.3 Distributions of Self-attention along the diagonal positions of the attention
matrix, which is consistent with existing con-
Given that the distribution of self-attention can ex-
clusions (Liu et al., 2023).
plain the behavior of the transformer model to some
extent (Hao et al., 2021; Mahmood et al., 2021; • As shown in the lower right part of the Fig-
Dai et al., 2021; Braşoveanu and Andonie, 2020), ure 4b, Post-Ins pay more attentions on the
we plot the heatmap of self-attention for models specific task instruction when generating the
trained with Pre-Ins and Post-Ins in Figure 4. We response, while Pre-Ins mainly foucs on the
take BLOOMZ-7.1B as the base model and con- source input and pay weak attention on in-
duct forward propagation on the training samples of struction as shown in the upper left part of the
machine translation to obtrain the attention scores. Figure 4a.
• After instruction fine-tuning on the machine performance without incurring additional data or
translation data, models learn latent word annotation costs. These findings suggest that the
alignment information on both data format. proposed Post-Ins approach is a promising direc-
Namely, models tend to allocate more atten- tion for further research and practical applications
tion to the aligned parts of the source when in the field of natural language processing and large-
generating responses word by word, which is scale language models.
similar to the conclusions of the traditional
encoder-decoder structure in the field of ma-
chine translation (Bahdanau et al., 2014; Lam- References
ple et al., 2017). Gati V Aher, Rosa I Arriaga, and Adam Tauman Kalai.
2023. Using large language models to simulate mul-
In summary, through the visualization analysis of tiple humans and replicate human subject studies.
the self-attention heatmap, we observe that Post- In International Conference on Machine Learning,
Ins naturally tends to model task instructions, while pages 337–371. PMLR.
Pre-Ins relatively emphasizes task instructions to Michael Ahn, Anthony Brohan, Noah Brown, Yevgen
a weaker extent. This finding is consistent with Chebotar, Omar Cortes, Byron David, Chelsea Finn,
the theoretical analysis and conclusions presented Chuyuan Fu, Keerthana Gopalakrishnan, Karol Haus-
earlier in Section 3.4. man, et al. 2022. Do as i can, not as i say: Ground-
ing language in robotic affordances. arXiv preprint
5.4 Human Analysis arXiv:2204.01691.

We employ two linguistics professionals to evalu- Farhad Akhbardeh, Arkady Arkhangorodsky, Mag-
ate the English-to-Chinese translation task. Specifi- dalena Biesialska, Ondřej Bojar, Rajen Chatter-
jee, Vishrav Chaudhary, Marta R. Costa-jussa,
cally, the annotators are requested to judge whether Cristina España-Bonet, Angela Fan, Christian Fe-
the model’s output faithfully follow to the transla- dermann, Markus Freitag, Yvette Graham, Ro-
tion instructions and the source input. If translation man Grundkiewicz, Barry Haddow, Leonie Harter,
hallucinations occurred, i.e., the response contain Kenneth Heafield, Christopher Homan, Matthias
Huck, Kwabena Amponsah-Kaakyire, Jungo Kasai,
content that is not present in the source sentence, Daniel Khashabi, Kevin Knight, Tom Kocmi, Philipp
or if the translation task is not effectively executed, Koehn, Nicholas Lourie, Christof Monz, Makoto
the label was marked as ‘0’; otherwise, it is marked Morishita, Masaaki Nagata, Ajay Nagesh, Toshiaki
as ‘1’. The manual annotation results on 1,000 Nakazawa, Matteo Negri, Santanu Pal, Allahsera Au-
samples show that the hallucination rate for Pre-Ins guste Tapo, Marco Turchi, Valentin Vydrin, and Mar-
cos Zampieri. 2021. Findings of the 2021 conference
is 4.8%, while 1.7% for Post-Ins. We also provid on machine translation (WMT21). In Proceedings of
several examples in the Figure 5. In summary, Post- the Sixth Conference on Machine Translation, pages
Ins can enhance the model’s instruction-following 1–88, Online. Association for Computational Linguis-
capability and effectively reduce the proportion of tics.
prediction hallucinations. Ömer Aydın and Enis Karaarslan. 2023. Is chatgpt
leading generative ai? what is beyond expectations?
6 Conclusion What is beyond expectations.
In conclusion, this paper highlights the impor- Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Ben-
tance of task instruction positioning in the instruc- gio. 2014. Neural machine translation by jointly
tion fine-tuning process of large language models learning to align and translate. arXiv preprint
arXiv:1409.0473.
(LLMs) for conditional sequence generation tasks.
We propose a simple yet effective method, Post-Ins, Loïc Barrault, Magdalena Biesialska, Ondřej Bo-
which relocates the task instruction after the input jar, Marta R. Costa-jussà, Christian Federmann,
sequence to enhance the instruction-following abil- Yvette Graham, Roman Grundkiewicz, Barry Had-
dow, Matthias Huck, Eric Joanis, Tom Kocmi,
ity of LLMs. Our theoretical analysis and experi- Philipp Koehn, Chi-kiu Lo, Nikola Ljubešić, Christof
mental results demonstrate that Post-Ins effectively Monz, Makoto Morishita, Masaaki Nagata, Toshi-
shifts the model’s learning focus, leading to im- aki Nakazawa, Santanu Pal, Matt Post, and Marcos
proved performance across various model scales Zampieri. 2020. Findings of the 2020 conference on
machine translation (WMT20). In Proceedings of
and different sequence generation tasks, such as the Fifth Conference on Machine Translation, pages
machine translation and long text summarization. 1–55, Online. Association for Computational Linguis-
Notably, our method significantly boosts zero-shot tics.
Iz Beltagy, Matthew E Peters, and Arman Cohan. 2020. Nikita Kitaev, Lukasz Kaiser, and Anselm Levskaya.
Longformer: The long-document transformer. arXiv 2019. Reformer: The efficient transformer. In Inter-
preprint arXiv:2004.05150. national Conference on Learning Representations.
Adrian M. P. Braşoveanu and Răzvan Andonie. 2020. Tom Kocmi, Rachel Bawden, Ondřej Bojar, Anton
Visualizing transformers for nlp: A brief survey. In Dvorkovich, Christian Federmann, Mark Fishel,
2020 24th International Conference Information Vi- Thamme Gowda, Yvette Graham, Roman Grund-
sualisation (IV), pages 270–279. kiewicz, Barry Haddow, Rebecca Knowles, Philipp
Koehn, Christof Monz, Makoto Morishita, Masaaki
Tim Brooks, Aleksander Holynski, and Alexei A Efros.
Nagata, Toshiaki Nakazawa, Michal Novák, Martin
2023. Instructpix2pix: Learning to follow image edit-
Popel, and Maja Popović. 2022. Findings of the 2022
ing instructions. In Proceedings of the IEEE/CVF
conference on machine translation (WMT22). In
Conference on Computer Vision and Pattern Recog-
Proceedings of the Seventh Conference on Machine
nition, pages 18392–18402.
Translation (WMT), pages 1–45, Abu Dhabi, United
Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Arab Emirates (Hybrid). Association for Computa-
Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind tional Linguistics.
Neelakantan, Pranav Shyam, Girish Sastry, Amanda
Askell, Sandhini Agarwal, Ariel Herbert-Voss, Guillaume Lample, Alexis Conneau, Ludovic Denoyer,
Gretchen Krueger, Tom Henighan, Rewon Child, and Marc’Aurelio Ranzato. 2017. Unsupervised ma-
Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, chine translation using monolingual corpora only.
Clemens Winter, Christopher Hesse, Mark Chen, Eric arXiv preprint arXiv:1711.00043.
Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess,
Jack Clark, Christopher Berner, Sam McCandlish, Zhenghao Lin, Yeyun Gong, Yelong Shen, Tong Wu,
Alec Radford, Ilya Sutskever, and Dario Amodei. Zhihao Fan, Chen Lin, Nan Duan, and Weizhu Chen.
2020. Language models are few-shot learners. In 2023. Text generation with diffusion language mod-
NeurIPS 2020. els: A pre-training approach with continuous para-
graph denoise. In International Conference on Ma-
Hyung Won Chung, Le Hou, Shayne Longpre, Bar- chine Learning, pages 21051–21064. PMLR.
ret Zoph, Yi Tay, William Fedus, Eric Li, Xuezhi
Wang, Mostafa Dehghani, Siddhartha Brahma, et al. Nelson F Liu, Kevin Lin, John Hewitt, Ashwin Paran-
2022. Scaling instruction-finetuned language models. jape, Michele Bevilacqua, Fabio Petroni, and Percy
arXiv preprint arXiv:2210.11416. Liang. 2023. Lost in the middle: How lan-
guage models use long contexts. arXiv preprint
Damai Dai, Li Dong, Yaru Hao, Zhifang Sui, Baobao arXiv:2307.03172.
Chang, and Furu Wei. 2021. Knowledge neu-
rons in pretrained transformers. arXiv preprint Kaleel Mahmood, Rigel Mahmood, and Marten
arXiv:2104.08696. Van Dijk. 2021. On the robustness of vision trans-
formers to adversarial examples. In Proceedings
Yaru Hao, Li Dong, Furu Wei, and Ke Xu. 2021. Self- of the IEEE/CVF International Conference on Com-
attention attribution: Interpreting information interac- puter Vision, pages 7838–7847.
tions inside transformer. In Proceedings of the AAAI
Conference on Artificial Intelligence, pages 12963– Niklas Muennighoff, Thomas Wang, Lintang Sutawika,
12971. Adam Roberts, Stella Biderman, Teven Le Scao,
M Saiful Bari, Sheng Shen, Zheng-Xin Yong, Hailey
Elisa L Hill-Yardin, Mark R Hutchinson, Robin Lay- Schoelkopf, et al. 2022. Crosslingual generaliza-
cock, and Sarah J Spencer. 2023. A chat (gpt) about tion through multitask finetuning. arXiv preprint
the future of scientific publishing. Brain Behav Im- arXiv:2211.01786.
mun, 110:152–154.
Wenxiang Jiao, Jen tse Huang, Wenxuan Wang, Xing Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida,
Wang, Shuming Shi, and Zhaopeng Tu. 2023a. Par- Carroll Wainwright, Pamela Mishkin, Chong Zhang,
rot: Translating during chat using large language Sandhini Agarwal, Katarina Slama, Alex Ray, et al.
models. In ArXiv. 2022. Training language models to follow instruc-
tions with human feedback. Advances in Neural
Wenxiang Jiao, Wenxuan Wang, Jen-tse Huang, Xing Information Processing Systems, 35:27730–27744.
Wang, and Zhaopeng Tu. 2023b. Is chatgpt a good
translator? a preliminary study. arXiv preprint Weizhen Qi, Yu Yan, Yeyun Gong, Dayiheng Liu,
arXiv:2301.08745. Nan Duan, Jiusheng Chen, Ruofei Zhang, and Ming
Zhou. 2020. ProphetNet: Predicting future n-gram
Enkelejda Kasneci, Kathrin Seßler, Stefan Küchemann, for sequence-to-SequencePre-training. In Findings
Maria Bannert, Daryna Dementieva, Frank Fischer, of the Association for Computational Linguistics:
Urs Gasser, Georg Groh, Stephan Günnemann, Eyke EMNLP 2020, pages 2401–2410, Online. Association
Hüllermeier, et al. 2023. Chatgpt for good? on op- for Computational Linguistics.
portunities and challenges of large language models
for education. Learning and Individual Differences, Alec Radford, Jeffrey Wu, Rewon Child, David Luan,
103:102274. Dario Amodei, Ilya Sutskever, et al. 2019. Language
models are unsupervised multitask learners. OpenAI Yizhong Wang, Yeganeh Kordi, Swaroop Mishra, Al-
blog, 1(8):9. isa Liu, Noah A Smith, Daniel Khashabi, and Han-
naneh Hajishirzi. 2022. Self-instruct: Aligning lan-
Rafael Rafailov, Archit Sharma, Eric Mitchell, Stefano guage model with self generated instructions. arXiv
Ermon, Christopher D Manning, and Chelsea Finn. preprint arXiv:2212.10560.
2023. Direct preference optimization: Your language
model is secretly a reward model. arXiv preprint Jason Wei, Maarten Bosma, Vincent Y Zhao, Kelvin
arXiv:2305.18290. Guu, Adams Wei Yu, Brian Lester, Nan Du, An-
drew M Dai, and Quoc V Le. 2021. Finetuned lan-
Abigail See, Peter J. Liu, and Christopher D. Manning. guage models are zero-shot learners. arXiv preprint
2017. Get to the point: Summarization with pointer- arXiv:2109.01652.
generator networks. In Proceedings of the 55th An-
nual Meeting of the Association for Computational Jason Wei, Yi Tay, Rishi Bommasani, Colin Raffel,
Linguistics (Volume 1: Long Papers), pages 1073– Barret Zoph, Sebastian Borgeaud, Dani Yogatama,
1083, Vancouver, Canada. Association for Computa- Maarten Bosma, Denny Zhou, Donald Metzler, et al.
tional Linguistics. 2022. Emergent abilities of large language models.
arXiv preprint arXiv:2206.07682.
Jan Šlapeta. 2023. Are chatgpt and other pretrained Jiali Zeng, Fandong Meng, Yongjing Yin, and Jie Zhou.
language models good parasitologists? Trends in 2023. Tim: Teaching lm to translate with comparison.
Parasitology. In ArXiv.
Moming Tang, Chengyu Wang, Jianing Wang, Cen Xianfeng Zeng, Yijin Liu, Ernan Li, Qiu Ran, Fan-
Chen, Ming Gao, and Weining Qian. 2023. Parasum: dong Meng, Peng Li, Jinan Xu, and Jie Zhou.
Contrastive paraphrasing for low-resource extractive 2021. WeChat neural machine translation systems
text summarization. In International Conference on for WMT21. In Proceedings of the Sixth Conference
Knowledge Science, Engineering and Management, on Machine Translation, pages 243–254, Online. As-
pages 106–119. Springer. sociation for Computational Linguistics.
Rohan Taori, Ishaan Gulrajani, Tianyi Zhang, Yann Haopeng Zhang, Xiao Liu, and Jiawei Zhang.
Dubois, Xuechen Li, Carlos Guestrin, Percy Liang, 2023a. Diffusum: Generation enhanced extrac-
and Tatsunori B. Hashimoto. 2023. Stanford alpaca: tive summarization with diffusion. arXiv preprint
An instruction-following llama model. https:// arXiv:2305.01735.
github.com/tatsu-lab/stanford_alpaca.
Shaolei Zhang, Qingkai Fang, Zhuocheng Zhang, Zhen-
Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier grui Ma, Yan Zhou, Langlin Huang, Mengyu Bu,
Martinet, Marie-Anne Lachaux, Timothée Lacroix, Shangtong Gui, Yunji Chen, Xilin Chen, and Yang
Baptiste Rozière, Naman Goyal, Eric Hambro, Feng. 2023b. Bayling: Bridging cross-lingual align-
Faisal Azhar, et al. 2023. Llama: Open and effi- ment and instruction following through interactive
cient foundation language models. arXiv preprint translation for large language models. arXiv preprint
arXiv:2302.13971. arXiv:2306.10968.

Zhaopeng Tu, Zhengdong Lu, Yang Liu, Xiaohua Liu,


and Hang Li. 2016. Modeling coverage for neural
machine translation. In Proceedings of the 54th An-
nual Meeting of the Association for Computational
Linguistics (Volume 1: Long Papers), pages 76–85.

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob


Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz
Kaiser, and Illia Polosukhin. 2017. Attention is all
you need. In NeurIPS 2017, pages 5998–6008.

Elena Voita, David Talbot, Fedor Moiseev, Rico Sen-


nrich, and Ivan Titov. 2019. Analyzing multi-head
self-attention: Specialized heads do the heavy lift-
ing, the rest can be pruned. In Proceedings of the
57th Annual Meeting of the Association for Computa-
tional Linguistics, pages 5797–5808, Florence, Italy.
Association for Computational Linguistics.

Jiaan Wang, Yunlong Liang, Fandong Meng, Haoxiang


Shi, Zhixu Li, Jinan Xu, Jianfeng Qu, and Jie Zhou.
2023. Is chatgpt a good nlg evaluator? a preliminary
study. arXiv preprint arXiv:2303.04048.

You might also like