Instruction Position Matters in Sequence Generation With Large Language Models
Instruction Position Matters in Sequence Generation With Large Language Models
Instruction Position Matters in Sequence Generation With Large Language Models
Response: Response:
Zu Beginn dieser Woche hatte bereits die Wada Manipulationen Zu Beginn dieser Woche hatte bereits die Wada Manipulationen
in den Raum gestellt und Russland aufgefordert, binnen drei in den Raum gestellt und Russland aufgefordert, binnen drei
Wochen auf die Vorwürfe zu antworten. Russland droht nun die Wochen auf die Vorwürfe zu antworten. Russland droht nun die
erneute Sperre der Rusada und im schlimmsten Fall auch der erneute Sperre der Rusada und im schlimmsten Fall auch der
Ausschluss von den Olympischen Spielen 2020 in Tokio. Ausschluss von den Olympischen Spielen 2020 in Tokio.
Figure 2: An example of Pre-Ins formatted data for the machine translation tasks. ‘[SRC]’ and ‘[TGT]’ refer to the
source and target language, which are respectively English and German in this example.
translation, Parrot (Jiao et al., 2023a) builds con- shown in Figure 2a.
trastive and error-guided instructions to align the Considering the nature of sequence generation
translation results of LLMs with humman prefer- tasks, the input part (ins) often tends to be lengthy,
ences. Subsequently, (Zeng et al., 2023) further such as translating an entire article or generating
extend the error-guided instructions with token- a summary of a paragraph. After applying the
level Direct Preference Optimization (Rafailov fine-grained tokenization, it can result in a long se-
et al., 2023). To better transfer the capabilities quence of tokens for training. Generally, the main-
of sequence generation of LLMs, BayLing (Zhang stream LLMs are based the decoder-only architec-
et al., 2023b) propose to conduct interactive trans- ture of Transformer, where the self-attention tends
lation task for instructing fine-tuning. Although the to pay larger attention on nearby tokens. Therefore,
aforementioned methods have made considerable in the case of long sequences as mentioned above,
progress, we argue that the Pre-Ins data format ul- there is a significant risk that the model may forget
tilized in existing studies face the potential risk of the frontmost task instruction in the Pre-Ins data
instruction forgetting issue, which is just we aims format, yiedling responses that don not follow the
to address in this paper. task instruction.
Table 1: SacreBLEU and COMET22 score(%) of different models with varying instruction modes on the WMT-2022
test sets. ‘De’, ‘En’ and ‘Zh’ are the language code of ‘German’, ‘English’ and "Chinese", respectively. The bolded
scores correspond to the best performance under the same or comparable settings for models with more than 7B
parameters. Results marked with ‘*’ indicate that they are not directly comparable with other results due to the use
of additional supervised data.
the beam search for all task, and set beam size evaluation results in Section 5.4.
to 4 for machine translation. While for the text
summarization task, we have to decrease the beam 5.1 Results of Machine Translation
size to 2, as encoding the long input sentences will Supervised Translation. Table 1 presents a sum-
consume a large portion of GPU memory. mary of experimental results on WMT22. Our pro-
Metrics For the machine translation task, we use posed method, Post-Ins, consistently outperforms
SacreBLEU4 to calculate BLEU scores. Given the Pre-Ins in most settings of BLOOMZ. The maxi-
limitations of N-gram based metrics to measure mum improvement in the BLEU score is observed
semantic similarity, we also calculate the popular to be +1.22, specifically in the En⇒De direction of
neural-based metric, namely, COMET225 . For the the BLOOMZ-3B. Additionally, COMET22 scores
text summarization task, we report the F1 scores of reach up to +2.01 improvement in the Ee⇒De
ROUGE-1, ROUGE-2, and ROUGE-L following translation using the 1.7B model. Our method
existing studies (Tang et al., 2023; Qi et al., 2020). proves to be even more effective in the LLaMA
model, outperforming Pre-Ins in all settings. Partic-
5 Main Results and Analysis ularly, LLaMA-7B achieves a remarkable increase
of +6.67 BLEU and +4.96 in COMET22 in En⇒Zh
In this section, we first list the detailed experimen- translation. It is worth noting that Post-Ins also sur-
tal results of both the machine translations and text passes existing translation approaches that based
summarization tasks in Section 5.1 and Section 5.2. on BLOOMZ and LLaMA. Table 2 showcases the
Subsequently, we show the analysis of the distri- performance of our method on the Flores-200 test
butions of self-attention in Section 5.3, and human set. Our approach outperforms Pre-Ins in 13 out
4
https://github.com/mjpost/sacrebleu of 16 settings, with maximum improvements reach-
5
https://github.com/Unbabel/COMET ing +7.20 BLEU and +6.16 COMET22 score in
SacreBLEU COMET22
Systems #Params Instruction
De ⇐⇒ En Zh ⇐⇒ En De ⇐⇒ En Zh ⇐⇒ En
TIM (Zeng et al., 2023) 7.0B Pre-Ins 39.15 29.31 22.30 28.43 88.19 85.05 83.32 80.55
7.0B Pre-Ins 38.86 29.51 18.10 21.69 88.05 84.57 80.69 75.07
7.0B Post-Ins 41.12 31.27 21.80 28.89 88.63 85.53 83.57 81.23
LLaMA
13.0B Pre-Ins 41.78 33.62 22.21 30.74 88.91 86.36 84.26 82.50
13.0B Post-Ins 41.46 33.12 22.62 31.37 88.91 86.23 84.38 82.83
Table 2: SacreBLEU and COMET22 score(%) of different models with varying instruction modes on the FLORES-
200 test sets. The bolded scores correspond to the best performance under the same or comparable settings.
SacreBLEU
Systems #Para. Ins.
Cs ⇔ En De ⇔ Fr Ja ⇔ En Uk ⇔ En Ru ⇔ En Liv ⇔ En Average
7.0B Pre-Ins 6.0 4.3 15.6 23.0 11.0 2.5 11.0 1.9 21.7 5.8 3.0 3.5 9.1
BLOOMZ
7.0B Post-Ins 5.6 4.5 24.4 22.7 11.0 2.6 10.2 2.0 21.3 6.1 2.9 4.2 9.8
7.0B Pre-Ins 36.8 13.7 3.0 3.4 12.2 4.8 33.9 4.6 34.8 16.8 5.9 2.6 14.3
7.0B Post-Ins 36.8 17.4 3.2 8.8 12.1 7.3 34.6 11.7 34.7 18.9 5.0 3.3 16.2
LLaMA
13.0B Pre-Ins 39.5 19.7 4.9 27.5 13.9 3.4 36.8 17.2 37.6 21.1 5.5 2.9 19.1
13.0B Post-Ins 36.8 19.7 14.6 33.3 13.6 6.0 35.6 17.6 36.2 20.8 5.6 3.0 20.2
Table 3: SacreBLEU score(%) of different models with varying instruction modes on the WMT-2022 zero-shot test
sets. The bolded scores correspond to the best performance under the same or comparable settings. ‘Para.’ is short
for ‘Parameters’ and ‘Ins.’ stands for the data format for the instruction-following data. ‘CS’, ‘Uk’, ‘Ja’, ‘Ru’ and
’Liv’ are the language code for ‘Czech’, ‘Ukrainian’, ‘Japanese’, ‘Russian’ and ‘Livonian’, respectively.
Focused attention
Relatively Weak attention
on the first token
on instruction
Source input
Source input
Instruction
Focused attention Focused attention
on source input on instruction
Target Response
Target Response
Instruction Source input Target Response Source input Instruction Target Response
Figure 4: The visualization analysis of self-attention for the instruction fine-tuned BLOOMZ-7.1B model, where
thicker lines indicate higher attention for the corresponding positions, while thinner lines indicate lower attention.
1. 将eReader连接⾄电源,打开电源开关,将eReader
Connect your eReader to a 执⾏以下操作之⼀,将您的 执⾏以下操作之⼀:将电⼦阅 放 置 在 电 源 附 近 。 \ n 2 . 将 电 源 线 的 ⼀ 端 连 接
power source by doing one of
the following: 电⼦阅读器连接到电源: 读器连接到电源上 eReader,另⼀端连接电源,eReader会⾃动检测到电
源。将电源线的另⼀端插⼊eReader的电源插⼝。
Figure 5: Case studies on Chinese-to-English translation tasks, where the highlighted red texts indicate the model
deviates from the translation instruction, generating content not present in the source text.
that all models achieved significant performance To mitigate the impact of fluctuations of multi-head
improvements up to +3.49 in BLOOMZ-3B when attention and various layers, we average the scores
utilizing the Post-Ins approach. The effectiveness of all heads over different layers to obtain the final
of Post-Ins on the task of long sequence generation score. We reach the following observations:
is further demonstrated by the superiority of its
• A higher distribution of attention scores is
effect on the text summarization task.
observed at the beginning of sentences and
5.3 Distributions of Self-attention along the diagonal positions of the attention
matrix, which is consistent with existing con-
Given that the distribution of self-attention can ex-
clusions (Liu et al., 2023).
plain the behavior of the transformer model to some
extent (Hao et al., 2021; Mahmood et al., 2021; • As shown in the lower right part of the Fig-
Dai et al., 2021; Braşoveanu and Andonie, 2020), ure 4b, Post-Ins pay more attentions on the
we plot the heatmap of self-attention for models specific task instruction when generating the
trained with Pre-Ins and Post-Ins in Figure 4. We response, while Pre-Ins mainly foucs on the
take BLOOMZ-7.1B as the base model and con- source input and pay weak attention on in-
duct forward propagation on the training samples of struction as shown in the upper left part of the
machine translation to obtrain the attention scores. Figure 4a.
• After instruction fine-tuning on the machine performance without incurring additional data or
translation data, models learn latent word annotation costs. These findings suggest that the
alignment information on both data format. proposed Post-Ins approach is a promising direc-
Namely, models tend to allocate more atten- tion for further research and practical applications
tion to the aligned parts of the source when in the field of natural language processing and large-
generating responses word by word, which is scale language models.
similar to the conclusions of the traditional
encoder-decoder structure in the field of ma-
chine translation (Bahdanau et al., 2014; Lam- References
ple et al., 2017). Gati V Aher, Rosa I Arriaga, and Adam Tauman Kalai.
2023. Using large language models to simulate mul-
In summary, through the visualization analysis of tiple humans and replicate human subject studies.
the self-attention heatmap, we observe that Post- In International Conference on Machine Learning,
Ins naturally tends to model task instructions, while pages 337–371. PMLR.
Pre-Ins relatively emphasizes task instructions to Michael Ahn, Anthony Brohan, Noah Brown, Yevgen
a weaker extent. This finding is consistent with Chebotar, Omar Cortes, Byron David, Chelsea Finn,
the theoretical analysis and conclusions presented Chuyuan Fu, Keerthana Gopalakrishnan, Karol Haus-
earlier in Section 3.4. man, et al. 2022. Do as i can, not as i say: Ground-
ing language in robotic affordances. arXiv preprint
5.4 Human Analysis arXiv:2204.01691.
We employ two linguistics professionals to evalu- Farhad Akhbardeh, Arkady Arkhangorodsky, Mag-
ate the English-to-Chinese translation task. Specifi- dalena Biesialska, Ondřej Bojar, Rajen Chatter-
jee, Vishrav Chaudhary, Marta R. Costa-jussa,
cally, the annotators are requested to judge whether Cristina España-Bonet, Angela Fan, Christian Fe-
the model’s output faithfully follow to the transla- dermann, Markus Freitag, Yvette Graham, Ro-
tion instructions and the source input. If translation man Grundkiewicz, Barry Haddow, Leonie Harter,
hallucinations occurred, i.e., the response contain Kenneth Heafield, Christopher Homan, Matthias
Huck, Kwabena Amponsah-Kaakyire, Jungo Kasai,
content that is not present in the source sentence, Daniel Khashabi, Kevin Knight, Tom Kocmi, Philipp
or if the translation task is not effectively executed, Koehn, Nicholas Lourie, Christof Monz, Makoto
the label was marked as ‘0’; otherwise, it is marked Morishita, Masaaki Nagata, Ajay Nagesh, Toshiaki
as ‘1’. The manual annotation results on 1,000 Nakazawa, Matteo Negri, Santanu Pal, Allahsera Au-
samples show that the hallucination rate for Pre-Ins guste Tapo, Marco Turchi, Valentin Vydrin, and Mar-
cos Zampieri. 2021. Findings of the 2021 conference
is 4.8%, while 1.7% for Post-Ins. We also provid on machine translation (WMT21). In Proceedings of
several examples in the Figure 5. In summary, Post- the Sixth Conference on Machine Translation, pages
Ins can enhance the model’s instruction-following 1–88, Online. Association for Computational Linguis-
capability and effectively reduce the proportion of tics.
prediction hallucinations. Ömer Aydın and Enis Karaarslan. 2023. Is chatgpt
leading generative ai? what is beyond expectations?
6 Conclusion What is beyond expectations.
In conclusion, this paper highlights the impor- Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Ben-
tance of task instruction positioning in the instruc- gio. 2014. Neural machine translation by jointly
tion fine-tuning process of large language models learning to align and translate. arXiv preprint
arXiv:1409.0473.
(LLMs) for conditional sequence generation tasks.
We propose a simple yet effective method, Post-Ins, Loïc Barrault, Magdalena Biesialska, Ondřej Bo-
which relocates the task instruction after the input jar, Marta R. Costa-jussà, Christian Federmann,
sequence to enhance the instruction-following abil- Yvette Graham, Roman Grundkiewicz, Barry Had-
dow, Matthias Huck, Eric Joanis, Tom Kocmi,
ity of LLMs. Our theoretical analysis and experi- Philipp Koehn, Chi-kiu Lo, Nikola Ljubešić, Christof
mental results demonstrate that Post-Ins effectively Monz, Makoto Morishita, Masaaki Nagata, Toshi-
shifts the model’s learning focus, leading to im- aki Nakazawa, Santanu Pal, Matt Post, and Marcos
proved performance across various model scales Zampieri. 2020. Findings of the 2020 conference on
machine translation (WMT20). In Proceedings of
and different sequence generation tasks, such as the Fifth Conference on Machine Translation, pages
machine translation and long text summarization. 1–55, Online. Association for Computational Linguis-
Notably, our method significantly boosts zero-shot tics.
Iz Beltagy, Matthew E Peters, and Arman Cohan. 2020. Nikita Kitaev, Lukasz Kaiser, and Anselm Levskaya.
Longformer: The long-document transformer. arXiv 2019. Reformer: The efficient transformer. In Inter-
preprint arXiv:2004.05150. national Conference on Learning Representations.
Adrian M. P. Braşoveanu and Răzvan Andonie. 2020. Tom Kocmi, Rachel Bawden, Ondřej Bojar, Anton
Visualizing transformers for nlp: A brief survey. In Dvorkovich, Christian Federmann, Mark Fishel,
2020 24th International Conference Information Vi- Thamme Gowda, Yvette Graham, Roman Grund-
sualisation (IV), pages 270–279. kiewicz, Barry Haddow, Rebecca Knowles, Philipp
Koehn, Christof Monz, Makoto Morishita, Masaaki
Tim Brooks, Aleksander Holynski, and Alexei A Efros.
Nagata, Toshiaki Nakazawa, Michal Novák, Martin
2023. Instructpix2pix: Learning to follow image edit-
Popel, and Maja Popović. 2022. Findings of the 2022
ing instructions. In Proceedings of the IEEE/CVF
conference on machine translation (WMT22). In
Conference on Computer Vision and Pattern Recog-
Proceedings of the Seventh Conference on Machine
nition, pages 18392–18402.
Translation (WMT), pages 1–45, Abu Dhabi, United
Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Arab Emirates (Hybrid). Association for Computa-
Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind tional Linguistics.
Neelakantan, Pranav Shyam, Girish Sastry, Amanda
Askell, Sandhini Agarwal, Ariel Herbert-Voss, Guillaume Lample, Alexis Conneau, Ludovic Denoyer,
Gretchen Krueger, Tom Henighan, Rewon Child, and Marc’Aurelio Ranzato. 2017. Unsupervised ma-
Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, chine translation using monolingual corpora only.
Clemens Winter, Christopher Hesse, Mark Chen, Eric arXiv preprint arXiv:1711.00043.
Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess,
Jack Clark, Christopher Berner, Sam McCandlish, Zhenghao Lin, Yeyun Gong, Yelong Shen, Tong Wu,
Alec Radford, Ilya Sutskever, and Dario Amodei. Zhihao Fan, Chen Lin, Nan Duan, and Weizhu Chen.
2020. Language models are few-shot learners. In 2023. Text generation with diffusion language mod-
NeurIPS 2020. els: A pre-training approach with continuous para-
graph denoise. In International Conference on Ma-
Hyung Won Chung, Le Hou, Shayne Longpre, Bar- chine Learning, pages 21051–21064. PMLR.
ret Zoph, Yi Tay, William Fedus, Eric Li, Xuezhi
Wang, Mostafa Dehghani, Siddhartha Brahma, et al. Nelson F Liu, Kevin Lin, John Hewitt, Ashwin Paran-
2022. Scaling instruction-finetuned language models. jape, Michele Bevilacqua, Fabio Petroni, and Percy
arXiv preprint arXiv:2210.11416. Liang. 2023. Lost in the middle: How lan-
guage models use long contexts. arXiv preprint
Damai Dai, Li Dong, Yaru Hao, Zhifang Sui, Baobao arXiv:2307.03172.
Chang, and Furu Wei. 2021. Knowledge neu-
rons in pretrained transformers. arXiv preprint Kaleel Mahmood, Rigel Mahmood, and Marten
arXiv:2104.08696. Van Dijk. 2021. On the robustness of vision trans-
formers to adversarial examples. In Proceedings
Yaru Hao, Li Dong, Furu Wei, and Ke Xu. 2021. Self- of the IEEE/CVF International Conference on Com-
attention attribution: Interpreting information interac- puter Vision, pages 7838–7847.
tions inside transformer. In Proceedings of the AAAI
Conference on Artificial Intelligence, pages 12963– Niklas Muennighoff, Thomas Wang, Lintang Sutawika,
12971. Adam Roberts, Stella Biderman, Teven Le Scao,
M Saiful Bari, Sheng Shen, Zheng-Xin Yong, Hailey
Elisa L Hill-Yardin, Mark R Hutchinson, Robin Lay- Schoelkopf, et al. 2022. Crosslingual generaliza-
cock, and Sarah J Spencer. 2023. A chat (gpt) about tion through multitask finetuning. arXiv preprint
the future of scientific publishing. Brain Behav Im- arXiv:2211.01786.
mun, 110:152–154.
Wenxiang Jiao, Jen tse Huang, Wenxuan Wang, Xing Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida,
Wang, Shuming Shi, and Zhaopeng Tu. 2023a. Par- Carroll Wainwright, Pamela Mishkin, Chong Zhang,
rot: Translating during chat using large language Sandhini Agarwal, Katarina Slama, Alex Ray, et al.
models. In ArXiv. 2022. Training language models to follow instruc-
tions with human feedback. Advances in Neural
Wenxiang Jiao, Wenxuan Wang, Jen-tse Huang, Xing Information Processing Systems, 35:27730–27744.
Wang, and Zhaopeng Tu. 2023b. Is chatgpt a good
translator? a preliminary study. arXiv preprint Weizhen Qi, Yu Yan, Yeyun Gong, Dayiheng Liu,
arXiv:2301.08745. Nan Duan, Jiusheng Chen, Ruofei Zhang, and Ming
Zhou. 2020. ProphetNet: Predicting future n-gram
Enkelejda Kasneci, Kathrin Seßler, Stefan Küchemann, for sequence-to-SequencePre-training. In Findings
Maria Bannert, Daryna Dementieva, Frank Fischer, of the Association for Computational Linguistics:
Urs Gasser, Georg Groh, Stephan Günnemann, Eyke EMNLP 2020, pages 2401–2410, Online. Association
Hüllermeier, et al. 2023. Chatgpt for good? on op- for Computational Linguistics.
portunities and challenges of large language models
for education. Learning and Individual Differences, Alec Radford, Jeffrey Wu, Rewon Child, David Luan,
103:102274. Dario Amodei, Ilya Sutskever, et al. 2019. Language
models are unsupervised multitask learners. OpenAI Yizhong Wang, Yeganeh Kordi, Swaroop Mishra, Al-
blog, 1(8):9. isa Liu, Noah A Smith, Daniel Khashabi, and Han-
naneh Hajishirzi. 2022. Self-instruct: Aligning lan-
Rafael Rafailov, Archit Sharma, Eric Mitchell, Stefano guage model with self generated instructions. arXiv
Ermon, Christopher D Manning, and Chelsea Finn. preprint arXiv:2212.10560.
2023. Direct preference optimization: Your language
model is secretly a reward model. arXiv preprint Jason Wei, Maarten Bosma, Vincent Y Zhao, Kelvin
arXiv:2305.18290. Guu, Adams Wei Yu, Brian Lester, Nan Du, An-
drew M Dai, and Quoc V Le. 2021. Finetuned lan-
Abigail See, Peter J. Liu, and Christopher D. Manning. guage models are zero-shot learners. arXiv preprint
2017. Get to the point: Summarization with pointer- arXiv:2109.01652.
generator networks. In Proceedings of the 55th An-
nual Meeting of the Association for Computational Jason Wei, Yi Tay, Rishi Bommasani, Colin Raffel,
Linguistics (Volume 1: Long Papers), pages 1073– Barret Zoph, Sebastian Borgeaud, Dani Yogatama,
1083, Vancouver, Canada. Association for Computa- Maarten Bosma, Denny Zhou, Donald Metzler, et al.
tional Linguistics. 2022. Emergent abilities of large language models.
arXiv preprint arXiv:2206.07682.
Jan Šlapeta. 2023. Are chatgpt and other pretrained Jiali Zeng, Fandong Meng, Yongjing Yin, and Jie Zhou.
language models good parasitologists? Trends in 2023. Tim: Teaching lm to translate with comparison.
Parasitology. In ArXiv.
Moming Tang, Chengyu Wang, Jianing Wang, Cen Xianfeng Zeng, Yijin Liu, Ernan Li, Qiu Ran, Fan-
Chen, Ming Gao, and Weining Qian. 2023. Parasum: dong Meng, Peng Li, Jinan Xu, and Jie Zhou.
Contrastive paraphrasing for low-resource extractive 2021. WeChat neural machine translation systems
text summarization. In International Conference on for WMT21. In Proceedings of the Sixth Conference
Knowledge Science, Engineering and Management, on Machine Translation, pages 243–254, Online. As-
pages 106–119. Springer. sociation for Computational Linguistics.
Rohan Taori, Ishaan Gulrajani, Tianyi Zhang, Yann Haopeng Zhang, Xiao Liu, and Jiawei Zhang.
Dubois, Xuechen Li, Carlos Guestrin, Percy Liang, 2023a. Diffusum: Generation enhanced extrac-
and Tatsunori B. Hashimoto. 2023. Stanford alpaca: tive summarization with diffusion. arXiv preprint
An instruction-following llama model. https:// arXiv:2305.01735.
github.com/tatsu-lab/stanford_alpaca.
Shaolei Zhang, Qingkai Fang, Zhuocheng Zhang, Zhen-
Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier grui Ma, Yan Zhou, Langlin Huang, Mengyu Bu,
Martinet, Marie-Anne Lachaux, Timothée Lacroix, Shangtong Gui, Yunji Chen, Xilin Chen, and Yang
Baptiste Rozière, Naman Goyal, Eric Hambro, Feng. 2023b. Bayling: Bridging cross-lingual align-
Faisal Azhar, et al. 2023. Llama: Open and effi- ment and instruction following through interactive
cient foundation language models. arXiv preprint translation for large language models. arXiv preprint
arXiv:2302.13971. arXiv:2306.10968.