Fantastically Ordered Prompts and Where To Find Them: Overcoming Few-Shot Prompt Order Sensitivity
Fantastically Ordered Prompts and Where To Find Them: Overcoming Few-Shot Prompt Order Sensitivity
Fantastically Ordered Prompts and Where To Find Them: Overcoming Few-Shot Prompt Order Sensitivity
Yao Lu† Max Bartolo† Alastair Moore‡ Sebastian Riedel† Pontus Stenetorp†
†
University College London ‡ Mishcon de Reya LLP
{yao.lu,m.bartolo,s.riedel,p.stenetorp}@cs.ucl.ac.uk
alastair.moore@mishcon.com
100
Abstract
90
When primed with only a handful of training
samples, very large, pretrained language mod-
Spearman Correlation
SST-2 Accuracy(%)
80
Model Parameters
2.7B 0.07 -0.11 0.10 -0.27 1.00 0.13 0.12 0.05
70
1.5B -0.24 0.20 -0.04 1.00 -0.27 -0.03 0.01 -0.14
Spearman Correlation
Model Parameters
60
data. Concretely, given a set of training samples
50
55 S = {(xi , yi )}, i = 1, · · · , n, where xi and yi
0
51.6
SST-2 Accuracy(%)
85.2 orginal calibrated denote the sentence and label of the ith training
sample. We then define a transformation T , map-
Figure 6: Left: Predicted SST-2 label distribution un- ping each sample into natural language space, such
der different prompts. Right: 2-shot calibrated perfor-
that ti = T (xi , yi ). ti is therefore a text sequence
mance (Zhao et al., 2021) of all possible permutations
on GPT2-XL (1.5B). of the ith training sample using the template defined
by T . In this work, we use a simple transformation
function T such that T (xi , yi ) = input:xi type:yi .
ordering that is performant across different models. This transforms each sample into a standard for-
mat sentence, which linearises each element in
Degenerate behaviour of bad prompts We per-
the set into natural language space defined as
form error analysis across performant and non- 0
S = {ti }, i = 1, · · · , n.
performant prompts and observe that the majority
of failing prompts suffer from highly unbalanced We then define a full permutation function group
predicted label distributions (Figure 6, left). An in- of n training samples, F = {fm }, m = 1, · · · , n!,
0
tuitive way to address this would be by calibrating where each function fm takes S as input and out-
the output distribution, along the lines of Zhao et al. puts cm : the concatenation of a unique permutation.
(2021). However, we find that although calibration In our case, sampling four training samples at ran-
leads to much higher performance, the variance dom gives up to 24 possible ordering permutations
remains high (Figure 6, right). of the transformed samples.
For each prompt candidate cm , we then sam-
3 Methodology ple from the language model to obtain the probing
sequence gm ∼ P (·|cm ; θ), where θ denotes the
The previous section demonstrates that prompt or- parameters of the pretrained language model. We
der can have a substantial effect on performance, stop decoding from the language model upon gen-
with some orderings of the same prompts for the erating the special end-of-sentence token defined
same model providing random performance, and by a template, or reach the generation length limit.
other “better” orderings providing performance Our probing set construction method is illustrated
competitive with supervised approaches. This sug- in Figure 7, where the objective is to generate a
gests that there could be various ways of selecting probing set that shares a similar distribution to the
prompt orders to achieve better performance, but training samples.
the challenge is to do so automatically and without
We run this sampling process for all possible
the need for additional labels (e.g., a development
prompt ordering permutations and extract prob-
set).
ing samples from them (T −1 (g)). Then gather
Hence, in this section, we explore the question
extracted samples together to form the probing set
of: “How can we automatically generate a ‘prob-
D = T −1 (g1 )⊕...⊕T −1 (gn! ). Although the prob-
ing set’ to find performant prompt orderings”? We
ing set contains predicted label for each sentence,
approach this by: (i) for a randomly-selected set of
there is no guarantee on the validity of these labels.
training samples, we use every possible ordering
Therefore, we discard them from the probing set as
permutation of this set as candidates; (ii) construct-
we are only interested in sampling probes from the
ing a probing set by querying the language model
language model corresponding to the input distri-
using all candidate prompts as context; and (iii)
bution.
use this probing set to identify the best ordering by
ranking them using a probing metric.
3.2 Probing Metrics
3.1 Sampling from the Language Model to Once we have constructed a probing set for a given
Construct a Probing Set set of samples, we can now use that probing set
We propose a simple methodology to automati- to identify the best possible prompt ordering for
cally construct a “probing set”, by directly sam- that particular sample set. Here, we explore two
Figure 7: Our probing set construction method, showing the various possible ordering permutations of the ran-
domly selected training samples, the resulting generation for each permutation, and the concatenation of each into
a probing set. Note that we discard the generated labels, as there is no guarantee that these generated labels are
correct.
methods for selecting the best ordering: Global We then calculate the average prediction entropy
Entropy (GlobalE), and Local Entropy (LocalE). per data point as the LocalE score:
For each label v ∈ V (where V denotes the We use four different sizes of GPT-2 (Radford et al.,
target label set), we compute the label probability 2019) (with 0.1B, 0.3B, 0.8B, and 1.5B parame-
over the probing set as: teers) and two sizes of GPT-3 (Brown et al., 2020)
(with 2.7B, and 175B parameters). Due to limited
1{ŷi,m =v}
P
i context window size (up to 1024 word-pieces for
pvm = (2) the GPT-2 series of models), we use a 4-shot setting
|D|
for all datasets except AGNews and DBPedia. Our
We then use the predicted category label entropy experiments are based on the open-source check-
as the GlobalE score for cm as follows: points of GPT-2 models and access to the OpenAI
X GPT-3 API.5 For probing set generation, we restrict
GlobalEm = −pvm log pvm (3) the maximum generation length to 128. We also
v∈V use sampling with a temperature, t, of 2, and we
also make use of block n-gram repetitions (Paulus
et al., 2018) to encourage diverse generation.
Local Entropy (LocalE) The motivation behind
We use 24 different permutations for each set
LocalE is that if a model is overly confident for all
of randomly selected training samples and use 5
probing inputs, then it is likely that the model is not
different sets (except for GPT-3 with 175B parame-
behaving as desired. At the very least, it is poorly
ters, where we only do two sets with 12 different
calibrated, which could also be an indication of
permutation due to the high monetary cost) for each
a poor capability to appropriately differentiate be-
experiment, giving a total of 120 runs. We report
tween classes. Similar to the GlobalE computation,
the mean and standard deviation of the correspond-
we calculate the prediction probability of a data
0 0 ing evaluation metric over 5 different sets.
point (xi , yi ) over the target labels v ∈ V under
For performant prompt selection, we rank candi-
context cm , as follows:
date prompts using the LocalE and GlobalE prob-
0
pvi,m = P(x0 ,y0 )∼D (v|cm ⊕ T (xi ); θ), v ∈ V (4) 5
https://openai.com/api/
i i
SST-2 SST-5 DBPedia MR CR MPQA Subj TREC AGNews RTE CB
Majority 50.9 23.1 9.4 50.0 50.0 50.0 50.0 18.8 25.0 52.7 51.8
Finetuning (Full) 95.0 58.7 99.3 90.8 89.4 87.8 97.0 97.4 94.7 80.9 90.5
GPT-2 0.1B 58.97.8 29.04.9 44.99.7 58.67.6 58.46.4 68.97.1 52.10.7 49.24.7 50.811.9 49.72.7 50.11.0
LocalE 65.23.9 34.43.4 53.34.9 66.06.3 65.03.4 72.56.0 52.91.3 48.03.9 61.05.9 53.03.3 49.91.6
GlobalE 63.85.8 35.82.0 56.14.3 66.45.8 64.82.7 73.54.5 53.01.3 46.13.7 62.15.7 53.03.0 50.31.6
Oracle 73.51.7 38.24.0 60.54.2 74.34.9 70.84.4 81.32.5 55.21.7 58.14.3 70.32.8 56.82.0 52.11.3
GPT-2 0.3B 61.013.2 25.95.9 51.77.0 54.27.8 56.79.4 54.58.8 54.47.9 52.64.9 47.710.6 48.82.6 50.25.3
LocalE 75.34.6 31.03.4 47.13.7 65.26.6 70.96.3 67.67.2 66.79.3 53.03.9 51.27.3 51.81.0 47.14.2
GlobalE 78.75.2 31.75.2 58.35.4 67.05.9 70.76.7 68.36.9 65.810.1 53.34.6 59.67.2 51.11.9 50.33.7
Oracle 85.54.3 40.56.3 65.27.6 74.76.1 80.45.4 77.32.3 79.42.4 63.32.9 68.48.0 53.91.3 62.57.4
GPT-2 0.8B 74.510.3 34.78.2 55.012.5 64.613.1 70.912.7 65.58.7 56.49.1 56.52.7 62.211.6 53.22.0 38.88.5
LocalE 81.15.5 40.34.7 56.77.5 82.64.2 85.43.8 73.64.8 70.44.2 56.21.7 62.78.1 53.31.6 38.45.2
GlobalE 84.84.1 46.91.1 67.73.6 84.32.9 86.72.5 75.83.1 68.66.5 57.22.3 70.73.6 53.51.5 41.24.5
Oracle 88.91.8 48.40.7 72.33.3 87.51.1 89.90.9 80.34.9 76.64.1 62.11.5 78.11.3 57.31.0 53.25.3
GPT-2 1.5B 66.810.8 41.76.7 82.62.5 59.111.9 56.99.0 73.98.6 59.710.4 53.13.3 77.67.3 55.01.4 53.84.7
LocalE 76.78.2 45.13.1 83.81.7 78.15.6 71.88.0 78.53.6 69.75.8 53.63.1 79.33.7 56.81.1 52.63.9
GlobalE 81.83.9 43.54.5 83.91.8 77.95.7 73.46.0 81.42.1 70.96.0 55.53.0 83.91.2 56.31.2 55.14.6
Oracle 86.11.5 50.91.0 87.31.5 84.02.7 80.33.3 85.11.4 79.95.7 59.02.3 86.10.7 58.20.6 63.94.3
GPT-3 2.7B 78.010.7 35.36.9 81.11.8 68.012.9 76.811.7 66.510.3 49.12.9 55.34.4 72.94.8 48.61.9 50.40.7
LocalE 81.06.0 42.34.7 80.31.7 75.64.1 79.05.5 72.55.8 54.24.2 54.02.6 72.34.6 50.41.9 50.50.8
GlobalE 80.24.2 43.24.3 81.20.9 76.13.8 80.33.4 73.04.3 54.34.0 56.72.0 78.11.9 51.31.8 51.20.8
Oracle 89.80.7 48.01.1 85.41.6 87.40.9 90.10.7 80.91.4 60.310.3 62.84.2 81.32.9 53.43.1 52.51.4
GPT-3 175B 93.90.6 54.42.5 95.40.9 94.60.7 91.01.0 83.21.5 71.27.3 72.12.7 85.11.7 70.82.8 75.15.1
LocalE 93.80.5 56.01.7 95.50.9 94.50.7 91.30.5 83.31.7 75.04.6 71.83.2 85.90.7 71.91.4 74.64.2
GlobalE 93.90.6 53.22.1 95.70.7 94.60.2 91.70.4 82.00.8 76.33.5 73.62.5 85.71.0 71.81.9 79.93.3
Oracle 94.70.2 58.2 96.70.2 95.50.2 92.60.4 85.50.8 81.14.9 77.01.2 87.70.6 74.70.4 83.00.9
Table 2: Our main results on subset of the validation set. To fit the data within the GPT-2 model context win-
dow size, we use 1-shot for DBPedia, 2-shot for AGNews, 4-shot for other datasets. All the baseline results are
calculated based on 5 different random seeds over 24 train context permutations. LocalE and GlobalE results are
calculated based on the top 4 context permutations using our proposed approach. For the GPT-3 175B, we only
use 2 seeds with 12 different permutations due to a limited computation budget.
ing metrics over the automatically generated prob- sub-sample 256 samples of the validation sets for
ing set. We then select top k samples ranked by all datasets to control for the GPT-3 inference costs
highest entropy values, where k = 4 in our exper- as it requires the usage of a monetary paid-for API.
iments, of the available 24 permutations as per-
formant prompts. Finally, we use these perfor- 5 Results
mant prompts to evaluate performance on various
datasets and demonstrate both better performance We report experimental results in Table 2 and ob-
and reduced variance. We also provide results for serve consistent improvements for both LocalE and
a majority baseline, which always predicts the ma- GlobalE across all tasks.
jority label in the dataset, as a lower-bound of per-
formance. We also provide an oracle to show the
Entropy-based probing is effective for perfor-
upper-bound of performance by selecting the top
mant prompt selection regardless of model size
four performant orderings based on prompt perfor-
We find that GlobalE achieves, on average, a
mance on the validation set.
13% relative improvement across the eleven dif-
ferent sentence classification tasks in comparison
4.1 Evaluation Datasets
to prompts that do not make use of probing. LocalE
Similar to previous work (Gao et al., 2020; Zhao provides results slightly inferior to GlobalE, with
et al., 2021), we use eleven text classification an average 9.6% relative improvement over the
datasets ranging from sentiment classification to baseline model. Our selected performant prompts
textual entailment. Further details of the datasets also demonstrate considerably lower variance than
are provided in the Appendix. For evaluation, we using all candidate prompts.
Ranking using Entropy-based probing is robust Template 1 Template 2 Template 3 Template 4
In Figure 8, we visualise the average performance GPT-2 0.1B 58.97.8 57.56.8 58.17.4 56.66.6
LocalE 65.23.9 60.74.6 65.44.8 61.04.7
when varying K for the top K prompt selection. GlobalE 63.85.8 59.02.9 64.34.8 63.54.8
K = 24 corresponds to using all sampled prompt GPT-2 0.3B 61.013.2 63.911.3 68.311.8 59.26.4
orders, which is equivalent to the baseline model LocalE 75.34.6 70.07.2 80.24.2 62.23.4
performance in Table 2. We can observe that the GlobalE 78.75.2 73.34.5 81.34.1 62.84.3
slope of curves are negative for all datasets, suggest- GPT-2 0.8B 74.510.3 66.610.6 70.310.5 63.78.9
LocalE 81.15.5 80.05.6 73.76.2 71.34.5
ing that our method can rank performant prompts GlobalE 84.84.1 80.93.6 79.83.9 70.75.3
effectively. Though K = 1 can provide good per- GPT-2 1.5B 66.810.8 80.47.6 54.57.9 69.110.5
formance for most cases, in our experiments, we LocalE 76.78.2 83.13.6 66.97.5 72.75.5
GlobalE 81.83.9 83.43.2 67.26.1 74.25.3
use K = 4 as preliminary experiments indicated
that it yielded stable performance across datasets.
Table 3: Prompt selection performance of different tem-
90 SST-2 plates on SST-2
SST-5
DBPedia
MR
80 CR
MPQA
Subj
TREC
AGNews
ID Template Label Mapping
RTE
70 CB
Review: {Sentence}
Accuracy (%)
1 positive/negative
60
Sentiment: {Label}
Input: {Sentence}
50 2 positive/negative
Prediction: {Label}
40 Review: {Sentence}
3 good/bad
Sentiment: {Label}
30
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
TopK 4 {Sentence} It was {Label} good/bad
Entropy-based probing is effective across tem- the performance of GPT-2 models is not signif-
plates We evaluate Entropy-based probing for icantly different from that of a random baseline.
four different templates similar to Gao et al. (2020) Despite this, we find that our method for identify-
and Zhao et al. (2021) (Table 4) for the SST-2 ing performant prompts can still provide minimal
dataset. Experimental results in Table 3 indicate performance gains, although these are still within
that Entropy-based probing is valid for different the levels of a random guess or majority vote. One
templates. We also observe that the randomness reason for this could be that, for these particular
across different templates is similar to Section 2. sizes of models on these tasks, no good prompt
These findings suggest that Entropy-based probing exists. As such, optimising the prompt is not par-
is not sensitive to specific templates, as it consis- ticularly effective in this setting. This is further
tently provides improvements for all cases. supported by the observation that prompt selection
can considerably improve performance on both CB
Performant permutation selection is a safe op- and RTE at larger model sizes (particularly so for
tion for In-context Learning We find that for the GPT-3 175B parameter model). In fact, we
models that suffer from high prompt variance, our find that prompt selection using GlobalE improves
prompt selection process can show large improve- performance by 4.9% for GPT-3 175B on CB. This
ments – up to 30% relative improvement. Fur- indicates that our method is widely applicable to
thermore, for tasks with low initial prompt perfor- all model sizes, and across all tasks, as long as they
mance variance, our method does not negatively im- already possess some existing classification ability
pact performance. Our prompt selection provides that can be improved through prompt design.
marginal improvement at worse and on average a
13% relative improvement in the most cases. Entropy-based probing outperforms using sub-
sets of the training data for tuning If one was
Sentence-pair tasks remain challenging for not to rely on generation, an alternative approach
smaller-sized models even with performant per- to prompt selection could be to split the (limited)
mutation selection For the CB and RTE datasets, training data to form a validation set. To compare
GPT-2 0.1B GPT-2 0.3B GPT-2 0.8B GPT-2 1.5B struct templates, Gao et al. (2020) uses an external
Baseline 58.97.8 61.013.2 74.510.3 66.810.8 language model to generate templates, and Shin
LocalE 65.23.9 75.34.6 81.15.5 76.78.2
GlobalE 63.85.8 78.75.2 84.84.1 81.83.9 et al. (2020) uses gradient-guided search to find
Split Training Set 62.85.3 64.26.1 75.16.8 71.47.8
templates that maximise performance. Jiang et al.
(2020) uses a mining-based method to create multi-
Table 5: Comparing our method with splitting the train- ple diverse templates automatically.
ing set into train and development for SST-2.
Order Sensitivity of Prompt Design Gao et al.
against this approach, we split the 4-shot training (2020) demonstrated that finetuning-based ap-
samples (same setting as in Table 2) in half. We proaches are not as order sensitive as In-context
then select the top four performing prompts using Learning. Making use of a standard-size training
validation set performance. As can be seen in Ta- set, Liu et al. (2021) used nearest neighbour search
ble 5, this approach consistently outperforms the to retrieve the most relevant training samples for
baseline. However, both Entropy-based probing a specific test sample. They were successful in
methods consistently provides better performance retrieving relevant samples and concluded that af-
across all model sizes. ter retrieving them the order in which they are
provided in the prompt has little to no effect on
6 Related Work performance. While our study is fundamentally
different from theirs in that we do not make use
Unified Interface Design for NLP Most previ- of a standard-size training set, we do come to the
ous work focuses on shared-parameters models, opposite conclusion. All previous work on prompt
pretrain on some tasks, then fine-tune for different design focuses on the textual quality of the prompt
tasks, e.g. ELMo (Peters et al., 2018), BERT (De- and, to the best of our knowledge, none has studied
vlin et al., 2019), etc. Eventually, leading to multi- order sensitivity in detail.
ple task-specific models. There has for some time
been attempts to design a unified interface for NLP True Few-shot Learning Perez et al. (2021)
tasks (Kumar et al., 2016; Raffel et al., 2020).In evaluated few-shot capability of LMs when a held-
parallel with these works, GPT-2 (Radford et al., out validation set is not available. Experimental
2019) shows that appending trigger tokens (e.g. result suggested that previous work overestimate
“TL;DR”) at the end of language model input can the few-shot ability of LMs in this (true few-shot
cause language models to behave like summari- learning) setting. Our work instead use the gen-
sation models. The zero-shot capability of lan- erative nature of language models to construct a
guage models shows the potential to unify NLP probing set without relying on held-out examples.
tasks into a language modelling framework where We show that our probing method is better than
fine-tuning is not necessary to achieve good perfor- relying on held out examples (Figure 5) and thus
mance. Furthermore, GPT-3 (Brown et al., 2020) enables true few-shot learning.
shows that task-agnostic, few-shot performance
can be improved by scaling up language models. It 7 Conclusion
can sometimes even become competitive with prior
We have shown that few-shot prompts suffer from
state-of-the-art fine-tuning approaches.
order sensitivity, in that for the same prompt the
Prompt Design for PLMs The core challenge order in which samples are provided can make the
of prompt design is to convert training data (if it difference between state-of-the-art and random per-
exists) into a text sequence. Most work on prompt formance. In our analysis of the problem, we estab-
design focuses on how to make prompts more com- lished that it is present across tasks, model sizes,
patible with language models. Petroni et al. (2019) prompt templates, samples, and number of training
uses human effort to design natural language sen- samples. To alleviate this problem, we introduced
tences and then perform token prediction given the a novel probing method that exploits the generative
input context. However, hand-crafted templates nature of language models to construct an artificial
require significant human effort and is likely to end development set. We were able to identity perfor-
up with sub-optimal performance. Recent work has mant permutations using entropy-based statistics
explored automatic template construction: Schick over this set, leading to an on average 13% im-
and Schütze (2020) uses cloze-style tasks to con- provement across eleven text classification tasks.
References Bo Pang and Lillian Lee. 2004. A sentimental educa-
tion: Sentiment analysis using subjectivity summa-
Tom B Brown, Benjamin Mann, Nick Ryder, Melanie rization based on minimum cuts. In Proceedings of
Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind the 42nd Annual Meeting of the Association for Com-
Neelakantan, Pranav Shyam, Girish Sastry, Amanda putational Linguistics (ACL-04), pages 271–278.
Askell, et al. 2020. Language models are few-shot
learners. arXiv preprint arXiv:2005.14165. Bo Pang and Lillian Lee. 2005. Seeing stars: Exploit-
ing class relationships for sentiment categorization
Ido Dagan, Oren Glickman, and Bernardo Magnini.
with respect to rating scales. In Proceedings of the
2005. The pascal recognising textual entailment
43rd Annual Meeting of the Association for Compu-
challenge. In Machine Learning Challenges Work-
tational Linguistics (ACL’05), pages 115–124.
shop, pages 177–190. Springer.
Romain Paulus, Caiming Xiong, and Richard Socher.
Joe Davison, Joshua Feldman, and Alexander M Rush.
2018. A deep reinforced model for abstractive sum-
2019. Commonsense knowledge mining from pre-
marization. In International Conference on Learn-
trained models. In Proceedings of the 2019 Con-
ing Representations.
ference on Empirical Methods in Natural Language
Processing and the 9th International Joint Confer- Ethan Perez, Douwe Kiela, and Kyunghyun Cho. 2021.
ence on Natural Language Processing (EMNLP- True few-shot learning with language models. arXiv
IJCNLP), pages 1173–1178. preprint arXiv:2105.11447.
Marie-Catherine De Marneffe, Mandy Simons, and Ju- Matthew Peters, Mark Neumann, Mohit Iyyer, Matt
dith Tonhauser. 2019. The commitmentbank: Inves- Gardner, Christopher Clark, Kenton Lee, and Luke
tigating projection in naturally occurring discourse. Zettlemoyer. 2018. Deep contextualized word repre-
In proceedings of Sinn und Bedeutung, pages 107– sentations. In Proceedings of the 2018 Conference
124. of the North American Chapter of the Association
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and for Computational Linguistics: Human Language
Kristina Toutanova. 2019. Bert: Pre-training of Technologies, Volume 1 (Long Papers), pages 2227–
deep bidirectional transformers for language under- 2237.
standing. In Proceedings of the 2019 Conference of Fabio Petroni, Patrick Lewis, Aleksandra Piktus, Tim
the North American Chapter of the Association for Rocktäschel, Yuxiang Wu, Alexander H Miller, and
Computational Linguistics: Human Language Tech- Sebastian Riedel. 2020. How context affects lan-
nologies, Volume 1 (Long and Short Papers), pages guage models’ factual predictions. In Automated
4171–4186. Knowledge Base Construction.
Tianyu Gao, Adam Fisch, and Danqi Chen. 2020.
Fabio Petroni, Tim Rocktäschel, Sebastian Riedel,
Making pre-trained language models better few-shot
Patrick Lewis, Anton Bakhtin, Yuxiang Wu, and
learners. arXiv preprint arXiv:2012.15723.
Alexander Miller. 2019. Language models as knowl-
Minqing Hu and Bing Liu. 2004. Mining and summa- edge bases? In Proceedings of the 2019 Confer-
rizing customer reviews. In Proceedings of the tenth ence on Empirical Methods in Natural Language
ACM SIGKDD international conference on Knowl- Processing and the 9th International Joint Confer-
edge discovery and data mining, pages 168–177. ence on Natural Language Processing (EMNLP-
IJCNLP), pages 2463–2473, Hong Kong, China. As-
Zhengbao Jiang, Frank F Xu, Jun Araki, and Graham sociation for Computational Linguistics.
Neubig. 2020. How can we know what language
models know? Transactions of the Association for A. Radford, Jeffrey Wu, R. Child, David Luan, Dario
Computational Linguistics, 8:423–438. Amodei, and Ilya Sutskever. 2019. Language mod-
els are unsupervised multitask learners. In OpenAI
Ankit Kumar, Ozan Irsoy, Peter Ondruska, Mohit Iyyer, Blog.
James Bradbury, Ishaan Gulrajani, Victor Zhong,
Romain Paulus, and Richard Socher. 2016. Ask me Colin Raffel, Noam Shazeer, Adam Roberts, Katherine
anything: Dynamic memory networks for natural Lee, Sharan Narang, Michael Matena, Yanqi Zhou,
language processing. In International conference on Wei Li, and Peter J Liu. 2020. Exploring the lim-
machine learning, pages 1378–1387. PMLR. its of transfer learning with a unified text-to-text
transformer. Journal of Machine Learning Research,
Jiachang Liu, Dinghan Shen, Yizhe Zhang, Bill Dolan, 21:1–67.
Lawrence Carin, and Weizhu Chen. 2021. What
makes good in-context examples for gpt-3? arXiv Timo Schick and Hinrich Schütze. 2020. It’s
preprint arXiv:2101.06804. not just size that matters: Small language mod-
els are also few-shot learners. arXiv preprint
Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Man- arXiv:2009.07118.
dar Joshi, Danqi Chen, Omer Levy, Mike Lewis,
Luke Zettlemoyer, and Veselin Stoyanov. 2019. Taylor Shin, Yasaman Razeghi, Robert L Logan IV,
Roberta: A robustly optimized bert pretraining ap- Eric Wallace, and Sameer Singh. 2020. Autoprompt:
proach. arXiv preprint arXiv:1907.11692. Eliciting knowledge from language models with
automatically generated prompts. arXiv preprint
arXiv:2010.15980.
Richard Socher, Alex Perelygin, Jean Wu, Jason
Chuang, Christopher D Manning, Andrew Y Ng,
and Christopher Potts. 2013. Recursive deep mod-
els for semantic compositionality over a sentiment
treebank. In Proceedings of the 2013 conference on
empirical methods in natural language processing,
pages 1631–1642.
Ellen M Voorhees and Dawn M Tice. 2000. Building
a question answering test collection. In Proceedings
of the 23rd annual international ACM SIGIR confer-
ence on Research and development in information
retrieval, pages 200–207.
Janyce Wiebe, Theresa Wilson, and Claire Cardie.
2005. Annotating expressions of opinions and emo-
tions in language. Language resources and evalua-
tion, 39(2):165–210.
Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Car-
bonell, Ruslan Salakhutdinov, and Quoc V Le.
2019. Xlnet: Generalized autoregressive pretrain-
ing for language understanding. arXiv preprint
arXiv:1906.08237.
Xiang Zhang, Junbo Zhao, and Yann Lecun. 2015.
Character-level convolutional networks for text clas-
sification. Advances in Neural Information Process-
ing Systems, 2015:649–657.
Tony Z Zhao, Eric Wallace, Shi Feng, Dan Klein, and
Sameer Singh. 2021. Calibrate before use: Im-
proving few-shot performance of language models.
arXiv preprint arXiv:2102.09690.
Dataset Prompt Label Mapping
Review: contains no wit , only labored gags
SST-2 positive/negative
Sentiment: negative
Review: apparently reassembled from the cutting-room floor of any given daytime soap .
SST-5 terrible/bad/okay/good/great
Sentiment: terrible
Review: lame sweet home leaves no southern stereotype unturned .
MR negative/positive
Sentiment: negative
Review: bluetooth does not work on this phone .
CR negative/positive
Sentiment: negative
Review: dangerous situation
MPQA negative/positive
Sentiment: negative
Input: too slow , too boring , and occasionally annoying .
Subj subjective/objective
Type: subjective
Question: When did the neanderthal man live ? description/entity/expression/
TREC
Type: number human/location/number
input: Wall St. Bears Claw Back Into the Black (Reuters).
AGNews world/sports/business/technology
type: business
company/school/artist/athlete/politics/
input: CMC Aviation is a charter airline based in Nairobi Kenya.
DBPedia transportation/building/nature/village/
type: company
animal/plant/album/film/book
premise: It was a complex language. Not written down but handed down.
One might say it was peeled down.
CB true/false/neither
hypothesis: the language was peeled down
prediction: true
premise: No Weapons of Mass Destruction Found in Iraq Yet.
RTE hypothesis: Weapons of Mass Destruction Found in Iraq. True/False
prediction: False
Premise: In the early 1940s, the United States and the Soviet Union were at war with Germany.
RTE
Hypothesis: Germany was at war with the United States and Russia.
Premise: Maggie took Gloria out for a drive to the nearby city limits of Fort Myers on Tuesday
CB
Hypothesis: he couldn’t bear looking down his nose at all the other houses
Premise: There was one in Dallas. When it came out in New Jersey. And there were,[...]
Hypothesis: I would never see that movie
Table 9: Artificial development set generated by GPT2-XL (1.5B). We random select three examples per dataset.
Long sentences are trimmed due to limited space.