Fantastically Ordered Prompts and Where To Find Them: Overcoming Few-Shot Prompt Order Sensitivity

Fantastically Ordered Prompts and Where to Find Them:
Overcoming Few-Shot Prompt Order Sensitivity
Yao Lu† Max Bartolo† Alastair Moore‡ Sebastian Riedel† Pontus Stenetorp†
†
University College London ‡ Mishcon de Reya LLP
{yao.lu,m.bartolo,s.riedel,p.stenetorp}@cs.ucl.ac.uk
alastair.moore@mishcon.com
100
Abstract
90
When primed with only a handful of training
samples, very large, pretrained language mod-
SST-2 Accuracy (%)

80
els such as GPT-3 have shown competitive re-
arXiv:2104.08786v2 [cs.CL] 3 Mar 2022
sults when compared to fully-supervised, fine- 70

tuned, large, pretrained language models. We
demonstrate that the order in which the sam- 60
ples are provided can make the difference be-
50
tween near state-of-the-art and random guess
performance: essentially some permutations 0.1 0.3 0.8 1.5 2.7 6.7 13 175
Model Parameters (Billion)
are “fantastic” and some not. We analyse this 100
phenomenon in detail, establishing that: it is
present across model sizes (even for the largest 90
current models), it is not related to a specific

80
Subj Accuracy (%)
subset of samples, and that a given good per-

mutation for one model is not transferable to 70
another. While one could use a development
set to determine which permutations are per- 60
formant, this would deviate from the true few-
50
shot setting as it requires additional annotated
data. Instead, we use the generative nature 0.1 0.3 0.8 1.5 2.7 6.7 13 175
Model Parameters (Billion)
of language models to construct an artificial
development set and based on entropy statis- Figure 1: Four-shot performance for 24 different sam-
tics of the candidate permutations on this set, ple orders across different sizes of GPT-family models
we identify performant prompts. Our method (GPT-2 and GPT-3) for the SST-2 and Subj datasets.
yields a 13% relative improvement for GPT-
family models across eleven different estab-
lished text classification tasks. text classification results that can match those of
fully supervised models. This type of few shot set-
1 Introduction
ting, is commonly referred to as “In-context Learn-
Large pretrained language models (PLMs, De- ing” (Brown et al., 2020).
vlin et al., 2019; Peters et al., 2018; Raffel et al., A core component of in-context learning is the
2020; Liu et al., 2019; Yang et al., 2019; Radford text-based prompt that serves as the context. Com-
et al., 2019) have shown remarkable performance posing a prompt requires: (i) text linearisation us-
when conditioned with an appropriate textual con- ing a template; and (ii) training sample concate-
text (Petroni et al., 2019, 2020; Jiang et al., 2020; nation (See Table 1 for an example). It has been
Shin et al., 2020; Davison et al., 2019). For exam- established that the structure of the template has
ple, when conditioned on a long document and a a large impact on performance (Shin et al., 2020;
“TL;DR:” token, they can generate a summary of Gao et al., 2020; Schick and Schütze, 2020; Jiang
said document, and when provided a partial ques- et al., 2020). However, to the best of our knowl-
tion (“The theory of relativity was developed by edge, no work has studied the effect of the sample
__”), they can generate the correct answer. Perhaps ordering on In-context Learning performance.
most strikingly, when primed with a context con- Perhaps counter-intuitively, we find that the right
sisting of very few training examples, they produce sample order can make as much of a difference as
Example
(the greatest musicians, 1)
training set
(redundant concept, 0)
Review: the greatest musicians. Sentiment: positive
linearization
Review: redundant concept. Sentiment: negative
Review: the greatest musicians. Sentiment: positive. Review: redundant
concept. Sentiment: negative
concatenation OR
Review: redundant concept. Sentiment: negative. Review: the greatest
musicians. Sentiment: positive
Figure 2: Training sample permutations for the In-
context Learning setting. The concatenation of training
Table 1: Procedures for prompt construction.
samples as well as test data transforms the classifica-
tion task into a sequence generation task.
the right template. As can be seen in Figure 1,
some permutations have comparable performance
13% relative improvement over a wide range
(over 85% accuracy) to supervised training for sen-
of tasks.
timent classification, while others perform close to
random (around 50%). This order sensitivity is uni-
2 Order Sensitivity and Prompt Design
versal across models, and although increasing the
model size somewhat addresses it, the problem is In this section, we study the relationship between
still present for some text classification tasks (Subj permutation performance and various factors. For
in Figure 1) for models with billions of parameters. the ease of visualisation, we use a fixed random
In our analysis, we find no common denomi- subset of four samples with a balanced label distri-
nator between performant sample orders and that bution from the SST-2 dataset and consider all 24
they are not transferable across different model possible sample order permutations. This setup is
sizes and tasks. In a fully-supervised setting, we illustrated in Figure 2. We also test five randomly-
could rely on a development set to select among selected sets of examples and summarised variance
sample orders. However, this is not desirable in statistics in the experiment section (Section 5).
a few-shot setting where the size of the develop-
ment set is very limited, even unavailable (Perez Although beneficial, increasing model size does
et al., 2021) . Instead, we use the generative na- not guarantee low variance We evaluate the or-
ture of language models to construct an unlabelled der permutations for four different sizes of GPT-2
artificial development set and refer to it as a prob- (0.1B–1.5B)1 and GPT-3 (2.7B–175B). As we can
ing set. As the probing set is unlabelled, we use observe in Figure 1, models can obtain remarkable
the predicted label distribution statistics and pro- few-shot performance. We see that the GPT2-XL
pose entropy-based metrics to measure the quality (1.5B) model can even surpass 90% accuracy given
of candidate prompts.Experimental results show just four samples. This result is comparable to
that we can achieve on average 13% relative im- those of supervised models trained on more than
provement across eleven different established text 60,000 samples. However, the performance varia-
classification tasks across all different sizes (four tion of different permutations remain a big issue,
orders of magnitude) of PLMs. especially for “smaller” models.2 The same model
To summarise, our contributions are as follows: can exhibit nearly perfect behaviour given one sam-
ple order, but then fall back to be on par with a
1. We study order sensitivity for In-context random baseline for another. While increasing the
Learning, which we show is crucial for the model size (by a few order of magnitudes) can
success of pretrained language models for few- sometimes alleviate the issue, it still cannot resolve
shot learning. it entirely (especially if we consider tasks other
2. We propose a simple, generation-based prob- than SST-2). In contrast, different initialisations of
ing method to identify performant prompts supervised fine-tuning approaches typically result
without requiring additional data. in less than 1% standard deviation for their test set
performance (Gao et al., 2020).
3. Our probing method is universally applica-
1
ble and effective across different sizes of pre- We can also refer these models as GPT2-base, GPT2-
medium, GPT2-Large, and GPT2-XL.
trained language models and for different 2
The smallest model in our experiment is the same size as
types of datasets – achieving on average a BERT-base.
100 1
GPT2-Small (0.1B) 175B -0.17 -0.23 -0.35 -0.14 0.05 0.27 -0.22 1.00
GPT2-Medium (0.3B)
90 GPT2-Large (0.8B)
GPT2-XL (1.5B) 13B -0.24 0.01 -0.12 0.01 0.12 0.04 1.00 -0.22
6.7B -0.10 -0.26 0.19 -0.03 0.13 1.00 0.04 0.27
Spearman Correlation
SST-2 Accuracy(%)
80
Model Parameters
2.7B 0.07 -0.11 0.10 -0.27 1.00 0.13 0.12 0.05
70
1.5B -0.24 0.20 -0.04 1.00 -0.27 -0.03 0.01 -0.14
60 0.8B 0.23 0.08 1.00 -0.04 0.10 0.19 -0.12 -0.35
0.3B 0.09 1.00 0.08 0.20 -0.11 -0.26 0.01 -0.23

50
0.1B 1.00 0.09 0.23 -0.24 0.07 -0.10 -0.24 -0.17
12 4 8 16 32 0
N-shot Training Examples B B B B B B
0.1 0.3 0.8 1.5 2.7 6.7 13B 175B
Figure 3: Order sensitivity using different numbers of
training samples. Figure 4: Training sample permutation performance
correlation across different models.
Adding training samples does not significantly 1

175B -0.26 -0.20 0.09 -0.66 -0.26 0.77 -0.54 1.00
reduce variance To further explore the order sen-
sitivity of few-shot prompts, we increase the num- 13B -0.09 0.09 -0.37 0.77 -0.49 -0.09 1.00 -0.54
ber of training samples and then sample a subset of 6.7B -0.49 0.03 -0.49 -0.43 -0.49 1.00 -0.09 0.77
Spearman Correlation
Model Parameters
at most 24 different orderings.3 We use the GPT2

2.7B 0.71 0.20 0.31 0.09 1.00 -0.49 -0.49 -0.26
family models for this experiment. In Figure 3, we
can observe that increasing the number of training 1.5B 0.37 0.31 0.09 1.00 0.09 -0.43 0.77 -0.66
samples leads to increases in performance. How- 0.8B 0.26 0.09 1.00 0.09 0.31 -0.49 -0.37 0.09
ever, a high level of variance remains, even with
a large number of samples and can even increase. 0.3B -0.31 1.00 0.09 0.31 0.20 0.03 0.09 -0.20
Based on this, we draw the conclusion that order 0.1B 1.00 -0.31 0.26 0.37 0.71 -0.49 -0.09 -0.26
sensitivity is likely to be a fundamental issue of 0
B B B B B B 13B 175B
In-context Learning regardless of the number of 0.1 0.3 0.8 1.5 2.7 6.7
training samples. Figure 5: Training label pattern permutation perfor-
Performant prompts are not transferable mance correlation across different models.
across models We find that a specific permuta-
tion’s performance may drop from 88.7% to 51.6% different sizes of the same model. For example,
by changing the underlying model from GPT2-XL the 175B and 2.7B model only has a correlation of
(1.5B) to GPT2-Large (0.8B). This suggests that a 0.05, this means a good permutation for the 2.7B
particular permutation working well for one model model is in no way guaranteed that it will also yield
does not imply that it will provide good results for good performance for the 175B model.
another model. To validate this hypothesis, we use
all possible order permutations of the four sam- Performant label orderings are not consistent
ples as prompts – 24 in total. We then perform across models In addition to training example
prediction conditioned on each of these prompts ordering, we also explore label ordering for train-
for different models and calculate the pairwise ing prompts. We use all patterns of the above-
Spearman’s rank correlation coefficient between mentioned full permutations – six different label
the scores. These results are shown in Figure 4. patterns.4 We then compute the pairwise Spearman
If there is a common pattern for performant correlation across different models as described in
prompts, we should then be able to observe high the previous paragraph. As shown in Figure 5, the
correlation across models. However, the behaviour behaviour of label orderings is once again seem-
of permutations is seemingly random even across ingly random across different sizes of the same
model. It is thus not possible to identify a label
3
Bounded at the lower limit by the total number of samples
4
given, and at the upper limit as there can be up to 64! possible NNPP, NPNP, NPPN, PNNP, PNPN, PPNN, where P/N
orders. respectively denotes positive/negative
85
250 positive
negative
80
pling from the language model itself. This ap-
200
75 proach makes it possible to generate probing sets
Number of Examples
SST-2 Accuracy (%)

150 70
automatically, without access to any additional
100 65
60
data. Concretely, given a set of training samples
50
55 S = {(xi , yi )}, i = 1, · · · , n, where xi and yi
0
51.6
SST-2 Accuracy(%)
85.2 orginal calibrated denote the sentence and label of the ith training
sample. We then define a transformation T , map-
Figure 6: Left: Predicted SST-2 label distribution un- ping each sample into natural language space, such
der different prompts. Right: 2-shot calibrated perfor-
that ti = T (xi , yi ). ti is therefore a text sequence
mance (Zhao et al., 2021) of all possible permutations
on GPT2-XL (1.5B). of the ith training sample using the template defined
by T . In this work, we use a simple transformation
function T such that T (xi , yi ) = input:xi type:yi .
ordering that is performant across different models. This transforms each sample into a standard for-
mat sentence, which linearises each element in
Degenerate behaviour of bad prompts We per-
the set into natural language space defined as
form error analysis across performant and non- 0
S = {ti }, i = 1, · · · , n.
performant prompts and observe that the majority
of failing prompts suffer from highly unbalanced We then define a full permutation function group
predicted label distributions (Figure 6, left). An in- of n training samples, F = {fm }, m = 1, · · · , n!,
0
tuitive way to address this would be by calibrating where each function fm takes S as input and out-
the output distribution, along the lines of Zhao et al. puts cm : the concatenation of a unique permutation.
(2021). However, we find that although calibration In our case, sampling four training samples at ran-
leads to much higher performance, the variance dom gives up to 24 possible ordering permutations
remains high (Figure 6, right). of the transformed samples.
For each prompt candidate cm , we then sam-
3 Methodology ple from the language model to obtain the probing
sequence gm ∼ P (·|cm ; θ), where θ denotes the
The previous section demonstrates that prompt or- parameters of the pretrained language model. We
der can have a substantial effect on performance, stop decoding from the language model upon gen-
with some orderings of the same prompts for the erating the special end-of-sentence token defined
same model providing random performance, and by a template, or reach the generation length limit.
other “better” orderings providing performance Our probing set construction method is illustrated
competitive with supervised approaches. This sug- in Figure 7, where the objective is to generate a
gests that there could be various ways of selecting probing set that shares a similar distribution to the
prompt orders to achieve better performance, but training samples.
the challenge is to do so automatically and without
We run this sampling process for all possible
the need for additional labels (e.g., a development
prompt ordering permutations and extract prob-
set).
ing samples from them (T −1 (g)). Then gather
Hence, in this section, we explore the question
extracted samples together to form the probing set
of: “How can we automatically generate a ‘prob-
D = T −1 (g1 )⊕...⊕T −1 (gn! ). Although the prob-
ing set’ to find performant prompt orderings”? We
ing set contains predicted label for each sentence,
approach this by: (i) for a randomly-selected set of
there is no guarantee on the validity of these labels.
training samples, we use every possible ordering
Therefore, we discard them from the probing set as
permutation of this set as candidates; (ii) construct-
we are only interested in sampling probes from the
ing a probing set by querying the language model
language model corresponding to the input distri-
using all candidate prompts as context; and (iii)
bution.
use this probing set to identify the best ordering by
ranking them using a probing metric.
3.2 Probing Metrics
3.1 Sampling from the Language Model to Once we have constructed a probing set for a given
Construct a Probing Set set of samples, we can now use that probing set
We propose a simple methodology to automati- to identify the best possible prompt ordering for
cally construct a “probing set”, by directly sam- that particular sample set. Here, we explore two
Figure 7: Our probing set construction method, showing the various possible ordering permutations of the ran-
domly selected training samples, the resulting generation for each permutation, and the concatenation of each into
a probing set. Note that we discard the generated labels, as there is no guarantee that these generated labels are
correct.
methods for selecting the best ordering: Global We then calculate the average prediction entropy
Entropy (GlobalE), and Local Entropy (LocalE). per data point as the LocalE score:
Global Entropy (GlobalE) The motivation be-

−pvi,m log pvi,m
P P
hind GlobalE is to identify prompts of specific sam- i v∈V
LocalEm = (5)
ple orderings that avoid the issue of extremely un- |D|
balanced predictions (as we have previously es-
tablished it as key problem for non-performant As we now have a way to score each prompt order-
prompts). We compute the predicted label ŷi for ing, based on its effect against the probing set, we
0 0
data point (xi , yi ) under context cm as follows: can rank each prompt ordering by performance as
measured by GlobalE or LocalE respectively.
0
ŷi,m = argmax P (v|cm ⊕ T (xi ); θ) (1)
v∈V 4 Experimental Setup
For each label v ∈ V (where V denotes the We use four different sizes of GPT-2 (Radford et al.,
target label set), we compute the label probability 2019) (with 0.1B, 0.3B, 0.8B, and 1.5B parame-
over the probing set as: teers) and two sizes of GPT-3 (Brown et al., 2020)
(with 2.7B, and 175B parameters). Due to limited
1{ŷi,m =v}
P
i context window size (up to 1024 word-pieces for
pvm = (2) the GPT-2 series of models), we use a 4-shot setting
|D|
for all datasets except AGNews and DBPedia. Our
We then use the predicted category label entropy experiments are based on the open-source check-
as the GlobalE score for cm as follows: points of GPT-2 models and access to the OpenAI
X GPT-3 API.5 For probing set generation, we restrict
GlobalEm = −pvm log pvm (3) the maximum generation length to 128. We also
v∈V use sampling with a temperature, t, of 2, and we
also make use of block n-gram repetitions (Paulus
et al., 2018) to encourage diverse generation.
Local Entropy (LocalE) The motivation behind
We use 24 different permutations for each set
LocalE is that if a model is overly confident for all
of randomly selected training samples and use 5
probing inputs, then it is likely that the model is not
different sets (except for GPT-3 with 175B parame-
behaving as desired. At the very least, it is poorly
ters, where we only do two sets with 12 different
calibrated, which could also be an indication of
permutation due to the high monetary cost) for each
a poor capability to appropriately differentiate be-
experiment, giving a total of 120 runs. We report
tween classes. Similar to the GlobalE computation,
the mean and standard deviation of the correspond-
we calculate the prediction probability of a data
0 0 ing evaluation metric over 5 different sets.
point (xi , yi ) over the target labels v ∈ V under
For performant prompt selection, we rank candi-
context cm , as follows:
date prompts using the LocalE and GlobalE prob-
0
pvi,m = P(x0 ,y0 )∼D (v|cm ⊕ T (xi ); θ), v ∈ V (4) 5
https://openai.com/api/
i i
SST-2 SST-5 DBPedia MR CR MPQA Subj TREC AGNews RTE CB
Majority 50.9 23.1 9.4 50.0 50.0 50.0 50.0 18.8 25.0 52.7 51.8
Finetuning (Full) 95.0 58.7 99.3 90.8 89.4 87.8 97.0 97.4 94.7 80.9 90.5
GPT-2 0.1B 58.97.8 29.04.9 44.99.7 58.67.6 58.46.4 68.97.1 52.10.7 49.24.7 50.811.9 49.72.7 50.11.0
LocalE 65.23.9 34.43.4 53.34.9 66.06.3 65.03.4 72.56.0 52.91.3 48.03.9 61.05.9 53.03.3 49.91.6
GlobalE 63.85.8 35.82.0 56.14.3 66.45.8 64.82.7 73.54.5 53.01.3 46.13.7 62.15.7 53.03.0 50.31.6
Oracle 73.51.7 38.24.0 60.54.2 74.34.9 70.84.4 81.32.5 55.21.7 58.14.3 70.32.8 56.82.0 52.11.3
GPT-2 0.3B 61.013.2 25.95.9 51.77.0 54.27.8 56.79.4 54.58.8 54.47.9 52.64.9 47.710.6 48.82.6 50.25.3
LocalE 75.34.6 31.03.4 47.13.7 65.26.6 70.96.3 67.67.2 66.79.3 53.03.9 51.27.3 51.81.0 47.14.2
GlobalE 78.75.2 31.75.2 58.35.4 67.05.9 70.76.7 68.36.9 65.810.1 53.34.6 59.67.2 51.11.9 50.33.7
Oracle 85.54.3 40.56.3 65.27.6 74.76.1 80.45.4 77.32.3 79.42.4 63.32.9 68.48.0 53.91.3 62.57.4
GPT-2 0.8B 74.510.3 34.78.2 55.012.5 64.613.1 70.912.7 65.58.7 56.49.1 56.52.7 62.211.6 53.22.0 38.88.5
LocalE 81.15.5 40.34.7 56.77.5 82.64.2 85.43.8 73.64.8 70.44.2 56.21.7 62.78.1 53.31.6 38.45.2
GlobalE 84.84.1 46.91.1 67.73.6 84.32.9 86.72.5 75.83.1 68.66.5 57.22.3 70.73.6 53.51.5 41.24.5
Oracle 88.91.8 48.40.7 72.33.3 87.51.1 89.90.9 80.34.9 76.64.1 62.11.5 78.11.3 57.31.0 53.25.3
GPT-2 1.5B 66.810.8 41.76.7 82.62.5 59.111.9 56.99.0 73.98.6 59.710.4 53.13.3 77.67.3 55.01.4 53.84.7
LocalE 76.78.2 45.13.1 83.81.7 78.15.6 71.88.0 78.53.6 69.75.8 53.63.1 79.33.7 56.81.1 52.63.9
GlobalE 81.83.9 43.54.5 83.91.8 77.95.7 73.46.0 81.42.1 70.96.0 55.53.0 83.91.2 56.31.2 55.14.6
Oracle 86.11.5 50.91.0 87.31.5 84.02.7 80.33.3 85.11.4 79.95.7 59.02.3 86.10.7 58.20.6 63.94.3
GPT-3 2.7B 78.010.7 35.36.9 81.11.8 68.012.9 76.811.7 66.510.3 49.12.9 55.34.4 72.94.8 48.61.9 50.40.7
LocalE 81.06.0 42.34.7 80.31.7 75.64.1 79.05.5 72.55.8 54.24.2 54.02.6 72.34.6 50.41.9 50.50.8
GlobalE 80.24.2 43.24.3 81.20.9 76.13.8 80.33.4 73.04.3 54.34.0 56.72.0 78.11.9 51.31.8 51.20.8
Oracle 89.80.7 48.01.1 85.41.6 87.40.9 90.10.7 80.91.4 60.310.3 62.84.2 81.32.9 53.43.1 52.51.4
GPT-3 175B 93.90.6 54.42.5 95.40.9 94.60.7 91.01.0 83.21.5 71.27.3 72.12.7 85.11.7 70.82.8 75.15.1
LocalE 93.80.5 56.01.7 95.50.9 94.50.7 91.30.5 83.31.7 75.04.6 71.83.2 85.90.7 71.91.4 74.64.2
GlobalE 93.90.6 53.22.1 95.70.7 94.60.2 91.70.4 82.00.8 76.33.5 73.62.5 85.71.0 71.81.9 79.93.3
Oracle 94.70.2 58.2 96.70.2 95.50.2 92.60.4 85.50.8 81.14.9 77.01.2 87.70.6 74.70.4 83.00.9
Table 2: Our main results on subset of the validation set. To fit the data within the GPT-2 model context win-
dow size, we use 1-shot for DBPedia, 2-shot for AGNews, 4-shot for other datasets. All the baseline results are
calculated based on 5 different random seeds over 24 train context permutations. LocalE and GlobalE results are
calculated based on the top 4 context permutations using our proposed approach. For the GPT-3 175B, we only
use 2 seeds with 12 different permutations due to a limited computation budget.
ing metrics over the automatically generated prob- sub-sample 256 samples of the validation sets for
ing set. We then select top k samples ranked by all datasets to control for the GPT-3 inference costs
highest entropy values, where k = 4 in our exper- as it requires the usage of a monetary paid-for API.
iments, of the available 24 permutations as per-
formant prompts. Finally, we use these perfor- 5 Results
mant prompts to evaluate performance on various
datasets and demonstrate both better performance We report experimental results in Table 2 and ob-
and reduced variance. We also provide results for serve consistent improvements for both LocalE and
a majority baseline, which always predicts the ma- GlobalE across all tasks.
jority label in the dataset, as a lower-bound of per-
formance. We also provide an oracle to show the
Entropy-based probing is effective for perfor-
upper-bound of performance by selecting the top
mant prompt selection regardless of model size
four performant orderings based on prompt perfor-
We find that GlobalE achieves, on average, a
mance on the validation set.
13% relative improvement across the eleven dif-
ferent sentence classification tasks in comparison
4.1 Evaluation Datasets
to prompts that do not make use of probing. LocalE
Similar to previous work (Gao et al., 2020; Zhao provides results slightly inferior to GlobalE, with
et al., 2021), we use eleven text classification an average 9.6% relative improvement over the
datasets ranging from sentiment classification to baseline model. Our selected performant prompts
textual entailment. Further details of the datasets also demonstrate considerably lower variance than
are provided in the Appendix. For evaluation, we using all candidate prompts.
Ranking using Entropy-based probing is robust Template 1 Template 2 Template 3 Template 4
In Figure 8, we visualise the average performance GPT-2 0.1B 58.97.8 57.56.8 58.17.4 56.66.6
LocalE 65.23.9 60.74.6 65.44.8 61.04.7
when varying K for the top K prompt selection. GlobalE 63.85.8 59.02.9 64.34.8 63.54.8
K = 24 corresponds to using all sampled prompt GPT-2 0.3B 61.013.2 63.911.3 68.311.8 59.26.4
orders, which is equivalent to the baseline model LocalE 75.34.6 70.07.2 80.24.2 62.23.4
performance in Table 2. We can observe that the GlobalE 78.75.2 73.34.5 81.34.1 62.84.3
slope of curves are negative for all datasets, suggest- GPT-2 0.8B 74.510.3 66.610.6 70.310.5 63.78.9
LocalE 81.15.5 80.05.6 73.76.2 71.34.5
ing that our method can rank performant prompts GlobalE 84.84.1 80.93.6 79.83.9 70.75.3
effectively. Though K = 1 can provide good per- GPT-2 1.5B 66.810.8 80.47.6 54.57.9 69.110.5
formance for most cases, in our experiments, we LocalE 76.78.2 83.13.6 66.97.5 72.75.5
GlobalE 81.83.9 83.43.2 67.26.1 74.25.3
use K = 4 as preliminary experiments indicated
that it yielded stable performance across datasets.
Table 3: Prompt selection performance of different tem-
90 SST-2 plates on SST-2
SST-5
DBPedia
MR
80 CR
MPQA
Subj
TREC
AGNews
ID Template Label Mapping
RTE
70 CB
Review: {Sentence}
Accuracy (%)
1 positive/negative
60
Sentiment: {Label}
Input: {Sentence}
50 2 positive/negative
Prediction: {Label}
40 Review: {Sentence}
3 good/bad
Sentiment: {Label}
30
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
TopK 4 {Sentence} It was {Label} good/bad
Figure 8: Average performance of different Top K per-

mutation selection on GPT2-Large (0.8B) Table 4: Different Templates for SST-2
Entropy-based probing is effective across tem- the performance of GPT-2 models is not signif-
plates We evaluate Entropy-based probing for icantly different from that of a random baseline.
four different templates similar to Gao et al. (2020) Despite this, we find that our method for identify-
and Zhao et al. (2021) (Table 4) for the SST-2 ing performant prompts can still provide minimal
dataset. Experimental results in Table 3 indicate performance gains, although these are still within
that Entropy-based probing is valid for different the levels of a random guess or majority vote. One
templates. We also observe that the randomness reason for this could be that, for these particular
across different templates is similar to Section 2. sizes of models on these tasks, no good prompt
These findings suggest that Entropy-based probing exists. As such, optimising the prompt is not par-
is not sensitive to specific templates, as it consis- ticularly effective in this setting. This is further
tently provides improvements for all cases. supported by the observation that prompt selection
can considerably improve performance on both CB
Performant permutation selection is a safe op- and RTE at larger model sizes (particularly so for
tion for In-context Learning We find that for the GPT-3 175B parameter model). In fact, we
models that suffer from high prompt variance, our find that prompt selection using GlobalE improves
prompt selection process can show large improve- performance by 4.9% for GPT-3 175B on CB. This
ments – up to 30% relative improvement. Fur- indicates that our method is widely applicable to
thermore, for tasks with low initial prompt perfor- all model sizes, and across all tasks, as long as they
mance variance, our method does not negatively im- already possess some existing classification ability
pact performance. Our prompt selection provides that can be improved through prompt design.
marginal improvement at worse and on average a
13% relative improvement in the most cases. Entropy-based probing outperforms using sub-
sets of the training data for tuning If one was
Sentence-pair tasks remain challenging for not to rely on generation, an alternative approach
smaller-sized models even with performant per- to prompt selection could be to split the (limited)
mutation selection For the CB and RTE datasets, training data to form a validation set. To compare
GPT-2 0.1B GPT-2 0.3B GPT-2 0.8B GPT-2 1.5B struct templates, Gao et al. (2020) uses an external
Baseline 58.97.8 61.013.2 74.510.3 66.810.8 language model to generate templates, and Shin
LocalE 65.23.9 75.34.6 81.15.5 76.78.2
GlobalE 63.85.8 78.75.2 84.84.1 81.83.9 et al. (2020) uses gradient-guided search to find
Split Training Set 62.85.3 64.26.1 75.16.8 71.47.8
templates that maximise performance. Jiang et al.
(2020) uses a mining-based method to create multi-
Table 5: Comparing our method with splitting the train- ple diverse templates automatically.
ing set into train and development for SST-2.
Order Sensitivity of Prompt Design Gao et al.
against this approach, we split the 4-shot training (2020) demonstrated that finetuning-based ap-
samples (same setting as in Table 2) in half. We proaches are not as order sensitive as In-context
then select the top four performing prompts using Learning. Making use of a standard-size training
validation set performance. As can be seen in Ta- set, Liu et al. (2021) used nearest neighbour search
ble 5, this approach consistently outperforms the to retrieve the most relevant training samples for
baseline. However, both Entropy-based probing a specific test sample. They were successful in
methods consistently provides better performance retrieving relevant samples and concluded that af-
across all model sizes. ter retrieving them the order in which they are
provided in the prompt has little to no effect on
6 Related Work performance. While our study is fundamentally
different from theirs in that we do not make use
Unified Interface Design for NLP Most previ- of a standard-size training set, we do come to the
ous work focuses on shared-parameters models, opposite conclusion. All previous work on prompt
pretrain on some tasks, then fine-tune for different design focuses on the textual quality of the prompt
tasks, e.g. ELMo (Peters et al., 2018), BERT (De- and, to the best of our knowledge, none has studied
vlin et al., 2019), etc. Eventually, leading to multi- order sensitivity in detail.
ple task-specific models. There has for some time
been attempts to design a unified interface for NLP True Few-shot Learning Perez et al. (2021)
tasks (Kumar et al., 2016; Raffel et al., 2020).In evaluated few-shot capability of LMs when a held-
parallel with these works, GPT-2 (Radford et al., out validation set is not available. Experimental
2019) shows that appending trigger tokens (e.g. result suggested that previous work overestimate
“TL;DR”) at the end of language model input can the few-shot ability of LMs in this (true few-shot
cause language models to behave like summari- learning) setting. Our work instead use the gen-
sation models. The zero-shot capability of lan- erative nature of language models to construct a
guage models shows the potential to unify NLP probing set without relying on held-out examples.
tasks into a language modelling framework where We show that our probing method is better than
fine-tuning is not necessary to achieve good perfor- relying on held out examples (Figure 5) and thus
mance. Furthermore, GPT-3 (Brown et al., 2020) enables true few-shot learning.
shows that task-agnostic, few-shot performance
can be improved by scaling up language models. It 7 Conclusion
can sometimes even become competitive with prior
We have shown that few-shot prompts suffer from
state-of-the-art fine-tuning approaches.
order sensitivity, in that for the same prompt the
Prompt Design for PLMs The core challenge order in which samples are provided can make the
of prompt design is to convert training data (if it difference between state-of-the-art and random per-
exists) into a text sequence. Most work on prompt formance. In our analysis of the problem, we estab-
design focuses on how to make prompts more com- lished that it is present across tasks, model sizes,
patible with language models. Petroni et al. (2019) prompt templates, samples, and number of training
uses human effort to design natural language sen- samples. To alleviate this problem, we introduced
tences and then perform token prediction given the a novel probing method that exploits the generative
input context. However, hand-crafted templates nature of language models to construct an artificial
require significant human effort and is likely to end development set. We were able to identity perfor-
up with sub-optimal performance. Recent work has mant permutations using entropy-based statistics
explored automatic template construction: Schick over this set, leading to an on average 13% im-
and Schütze (2020) uses cloze-style tasks to con- provement across eleven text classification tasks.
References Bo Pang and Lillian Lee. 2004. A sentimental educa-
tion: Sentiment analysis using subjectivity summa-
Tom B Brown, Benjamin Mann, Nick Ryder, Melanie rization based on minimum cuts. In Proceedings of
Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind the 42nd Annual Meeting of the Association for Com-
Neelakantan, Pranav Shyam, Girish Sastry, Amanda putational Linguistics (ACL-04), pages 271–278.
Askell, et al. 2020. Language models are few-shot
learners. arXiv preprint arXiv:2005.14165. Bo Pang and Lillian Lee. 2005. Seeing stars: Exploit-
ing class relationships for sentiment categorization
Ido Dagan, Oren Glickman, and Bernardo Magnini.
with respect to rating scales. In Proceedings of the
2005. The pascal recognising textual entailment
43rd Annual Meeting of the Association for Compu-
challenge. In Machine Learning Challenges Work-
tational Linguistics (ACL’05), pages 115–124.
shop, pages 177–190. Springer.
Romain Paulus, Caiming Xiong, and Richard Socher.
Joe Davison, Joshua Feldman, and Alexander M Rush.
2018. A deep reinforced model for abstractive sum-
2019. Commonsense knowledge mining from pre-
marization. In International Conference on Learn-
trained models. In Proceedings of the 2019 Con-
ing Representations.
ference on Empirical Methods in Natural Language
Processing and the 9th International Joint Confer- Ethan Perez, Douwe Kiela, and Kyunghyun Cho. 2021.
ence on Natural Language Processing (EMNLP- True few-shot learning with language models. arXiv
IJCNLP), pages 1173–1178. preprint arXiv:2105.11447.
Marie-Catherine De Marneffe, Mandy Simons, and Ju- Matthew Peters, Mark Neumann, Mohit Iyyer, Matt
dith Tonhauser. 2019. The commitmentbank: Inves- Gardner, Christopher Clark, Kenton Lee, and Luke
tigating projection in naturally occurring discourse. Zettlemoyer. 2018. Deep contextualized word repre-
In proceedings of Sinn und Bedeutung, pages 107– sentations. In Proceedings of the 2018 Conference
124. of the North American Chapter of the Association
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and for Computational Linguistics: Human Language
Kristina Toutanova. 2019. Bert: Pre-training of Technologies, Volume 1 (Long Papers), pages 2227–
deep bidirectional transformers for language under- 2237.
standing. In Proceedings of the 2019 Conference of Fabio Petroni, Patrick Lewis, Aleksandra Piktus, Tim
the North American Chapter of the Association for Rocktäschel, Yuxiang Wu, Alexander H Miller, and
Computational Linguistics: Human Language Tech- Sebastian Riedel. 2020. How context affects lan-
nologies, Volume 1 (Long and Short Papers), pages guage models’ factual predictions. In Automated
4171–4186. Knowledge Base Construction.
Tianyu Gao, Adam Fisch, and Danqi Chen. 2020.
Fabio Petroni, Tim Rocktäschel, Sebastian Riedel,
Making pre-trained language models better few-shot
Patrick Lewis, Anton Bakhtin, Yuxiang Wu, and
learners. arXiv preprint arXiv:2012.15723.
Alexander Miller. 2019. Language models as knowl-
Minqing Hu and Bing Liu. 2004. Mining and summa- edge bases? In Proceedings of the 2019 Confer-
rizing customer reviews. In Proceedings of the tenth ence on Empirical Methods in Natural Language
ACM SIGKDD international conference on Knowl- Processing and the 9th International Joint Confer-
edge discovery and data mining, pages 168–177. ence on Natural Language Processing (EMNLP-
IJCNLP), pages 2463–2473, Hong Kong, China. As-
Zhengbao Jiang, Frank F Xu, Jun Araki, and Graham sociation for Computational Linguistics.
Neubig. 2020. How can we know what language
models know? Transactions of the Association for A. Radford, Jeffrey Wu, R. Child, David Luan, Dario
Computational Linguistics, 8:423–438. Amodei, and Ilya Sutskever. 2019. Language mod-
els are unsupervised multitask learners. In OpenAI
Ankit Kumar, Ozan Irsoy, Peter Ondruska, Mohit Iyyer, Blog.
James Bradbury, Ishaan Gulrajani, Victor Zhong,
Romain Paulus, and Richard Socher. 2016. Ask me Colin Raffel, Noam Shazeer, Adam Roberts, Katherine
anything: Dynamic memory networks for natural Lee, Sharan Narang, Michael Matena, Yanqi Zhou,
language processing. In International conference on Wei Li, and Peter J Liu. 2020. Exploring the lim-
machine learning, pages 1378–1387. PMLR. its of transfer learning with a unified text-to-text
transformer. Journal of Machine Learning Research,
Jiachang Liu, Dinghan Shen, Yizhe Zhang, Bill Dolan, 21:1–67.
Lawrence Carin, and Weizhu Chen. 2021. What
makes good in-context examples for gpt-3? arXiv Timo Schick and Hinrich Schütze. 2020. It’s
preprint arXiv:2101.06804. not just size that matters: Small language mod-
els are also few-shot learners. arXiv preprint
Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Man- arXiv:2009.07118.
dar Joshi, Danqi Chen, Omer Levy, Mike Lewis,
Luke Zettlemoyer, and Veselin Stoyanov. 2019. Taylor Shin, Yasaman Razeghi, Robert L Logan IV,
Roberta: A robustly optimized bert pretraining ap- Eric Wallace, and Sameer Singh. 2020. Autoprompt:
proach. arXiv preprint arXiv:1907.11692. Eliciting knowledge from language models with
automatically generated prompts. arXiv preprint
arXiv:2010.15980.
Richard Socher, Alex Perelygin, Jean Wu, Jason
Chuang, Christopher D Manning, Andrew Y Ng,
and Christopher Potts. 2013. Recursive deep mod-
els for semantic compositionality over a sentiment
treebank. In Proceedings of the 2013 conference on
empirical methods in natural language processing,
pages 1631–1642.
Ellen M Voorhees and Dawn M Tice. 2000. Building
a question answering test collection. In Proceedings
of the 23rd annual international ACM SIGIR confer-
ence on Research and development in information
retrieval, pages 200–207.
Janyce Wiebe, Theresa Wilson, and Claire Cardie.
2005. Annotating expressions of opinions and emo-
tions in language. Language resources and evalua-
tion, 39(2):165–210.
Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Car-
bonell, Ruslan Salakhutdinov, and Quoc V Le.
2019. Xlnet: Generalized autoregressive pretrain-
ing for language understanding. arXiv preprint
arXiv:1906.08237.
Xiang Zhang, Junbo Zhao, and Yann Lecun. 2015.
Character-level convolutional networks for text clas-
sification. Advances in Neural Information Process-
ing Systems, 2015:649–657.
Tony Z Zhao, Eric Wallace, Shi Feng, Dan Klein, and
Sameer Singh. 2021. Calibrate before use: Im-
proving few-shot performance of language models.
arXiv preprint arXiv:2102.09690.
Dataset Prompt Label Mapping
Review: contains no wit , only labored gags
SST-2 positive/negative
Sentiment: negative
Review: apparently reassembled from the cutting-room floor of any given daytime soap .
SST-5 terrible/bad/okay/good/great
Sentiment: terrible
Review: lame sweet home leaves no southern stereotype unturned .
MR negative/positive
Sentiment: negative
Review: bluetooth does not work on this phone .
CR negative/positive
Sentiment: negative
Review: dangerous situation
MPQA negative/positive
Sentiment: negative
Input: too slow , too boring , and occasionally annoying .
Subj subjective/objective
Type: subjective
Question: When did the neanderthal man live ? description/entity/expression/
TREC
Type: number human/location/number
input: Wall St. Bears Claw Back Into the Black (Reuters).
AGNews world/sports/business/technology
type: business
company/school/artist/athlete/politics/
input: CMC Aviation is a charter airline based in Nairobi Kenya.
DBPedia transportation/building/nature/village/
type: company
animal/plant/album/film/book
premise: It was a complex language. Not written down but handed down.
One might say it was peeled down.
CB true/false/neither
hypothesis: the language was peeled down
prediction: true
premise: No Weapons of Mass Destruction Found in Iraq Yet.
RTE hypothesis: Weapons of Mass Destruction Found in Iraq. True/False
prediction: False
Table 6: Prompt template and label mapping for different tasks.

Notation Description Examples
x sentence nice movie
y label positive
template-based transformation
T (x) Review: nice movie
without label
Review: nice movie
T (x,y) template-based transformation
Sentiment: positive
extract (sentence, label) pair
T −1 (T (x,y)) (nice movie, positive)
from text sequence
Table 7: Examples of transformation notations.
Dataset # of Classes Avg. Len. Balanced

SST-2 (Socher et al., 2013) 2 12.4 Yes
SST-5 (Socher et al., 2013) 5 23.1 No
MR (Pang and Lee, 2005) 2 25.7 Yes
CR (Hu and Liu, 2004) 2 22.1 Yes
MPQA (Wiebe et al., 2005) 2 3.9 Yes
Subj (Pang and Lee, 2004) 2 28.9 Yes
TREC (Voorhees and Tice, 2000) 6 11.6 No
AGNews (Zhang et al., 2015) 4 53.8 Yes
DBPedia (Zhang et al., 2015) 14 65.5 Yes
CB (De Marneffe et al., 2019) 3 69.7/8.4 No
RTE (Dagan et al., 2005) 2 55.3/11.9 Yes
Table 8: Statistics of evaluation datasets, average

length is calculated based on GPT-2 sentence-piece
length. For sentence-pair tasks, we report each sen-
tence’s average length separately.
Dataset Synthetic data
not sure where to even begin
SST-2 the only real film on our watch lists
no one will care because it is just one story
not a bad documentary, but the story feels tacked on.
SST-5 one that i have never liked and was always too long to understand and not enjoyable in parts.
This movie is the opposite of what it pretentious title implies.
Gweno Mott’s book: Gweno is a New Yorker cartoonist published by Little, Brown, 1995/2002/2013.
DBPedia L. Ego Equestrians is North America’s first dedicated equine show in Las Vegas.
Graphed is a graph visualization package from the GraphViz project.
a solid first film for the debut helmer.
MR A good deal more of the material in his previous films can be found here but this film does not come across [...]
it is so effective and engaging It feels more real And at some point, maybe it was about [...]
It works just the same, i just prefer my iPhone 6.
CR the battery last so long for me it feels like ive already had my phone a year.
works great with both phones
this is really going nowhere
MPQA why does it look so angry??
Excellent book and will get a good reputation
this will become apparent as it gets older.
Subj how about something more subtle to show this girl’s love?
a perfect summary of an episode where the entire series is one massive meta romp, with [...]
Whales can hold 4 gallons. Whaler can also be written as: What whale is named Whalerel?
TREC To a certain degree, how do human eyes perceive colour?
From where does our moon orbit, in Earth’s Solar System?
Google buys for $11bn: A-Z and thesaurus online, music search; photo service and TV site [...]
AGNews Saudi-born billionaire takes $5 Billion Hit With Bankrupt. Saudi millionaire Sultan Al-Amoudi said [...]
China’s ’Sesame’ takes over for South Korea in world TV race as US TV loses market dominance.[...]
Premise: The Tuareg are a nomadic people who live in the Sahara desert.
Hypothesis: Tuareg are nomadic people who lived in the Sahara desert before the arrival of the Arabs.
Premise: In the early 1940s, the United States and the Soviet Union were at war with Germany.
RTE
Hypothesis: Germany was at war with the United States and Russia.
Premise: Water is a precious commodity.

Hypothesis: Water is not a precious commodity.
Premise: In the back corner of Melissa’s classroom her father walked through the door and walked across the front. [...]
Hypothesis: his curiosity was directed towards some, something other than Melissa
Premise: Maggie took Gloria out for a drive to the nearby city limits of Fort Myers on Tuesday
CB
Hypothesis: he couldn’t bear looking down his nose at all the other houses
Premise: There was one in Dallas. When it came out in New Jersey. And there were,[...]
Hypothesis: I would never see that movie
Table 9: Artificial development set generated by GPT2-XL (1.5B). We random select three examples per dataset.
Long sentences are trimmed due to limited space.

Fantastically Ordered Prompts and Where To Find Them: Overcoming Few-Shot Prompt Order Sensitivity

Uploaded by

Document Informationclick to expand document information

Copyright:

Available Formats

Fantastically Ordered Prompts and Where To Find Them: Overcoming Few-Shot Prompt Order Sensitivity

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Fantastically Ordered Prompts and Where To Find Them: Overcoming Few-Shot Prompt Order Sensitivity

Uploaded by

Copyright:

Available Formats

Fantastically Ordered Prompts and Where to Find Them:

Overcoming Few-Shot Prompt Order Sensitivity

SST-2 Accuracy (%)

sults when compared to fully-supervised, fine- 70

current models), it is not related to a specific

subset of samples, and that a given good per-

6.7B -0.10 -0.26 0.19 -0.03 0.13 1.00 0.04 0.27

60 0.8B 0.23 0.08 1.00 -0.04 0.10 0.19 -0.12 -0.35

0.3B 0.09 1.00 0.08 0.20 -0.11 -0.26 0.01 -0.23

Adding training samples does not significantly 1

at most 24 different orderings.3 We use the GPT2

SST-2 Accuracy (%)

Global Entropy (GlobalE) The motivation be-

Figure 8: Average performance of different Top K per-

Table 6: Prompt template and label mapping for different tasks.

Table 7: Examples of transformation notations.

Dataset # of Classes Avg. Len. Balanced

Table 8: Statistics of evaluation datasets, average

Premise: Water is a precious commodity.

You might also like