0% found this document useful (0 votes)
1 views41 pages

2411.05778v2

Download as pdf or txt
Download as pdf or txt
Download as pdf or txt
You are on page 1/ 41

LLMs as Method Actors:

A Model for Prompt Engineering and Architecture

Colin Doyle 1

Abstract quires preparation. 3) Complex tasks should be decomposed


We introduce “Method Actors” as a mental model to the point at which imitation and authenticity produce
arXiv:2411.05778v2 [cs.AI] 11 Nov 2024

for guiding LLM prompt engineering and prompt equivalent results. 4) Where imitation fails, compensate
architecture. Under this mental model, LLMs with methods that do not rely upon LLMs.
should be thought of as actors; prompts as scripts We apply this mental model to the task of improving LLM
and cues; and LLM responses as performances. performance at playing Connections, a New York Times
We apply this mental model to the task of improv- word puzzle game that prior research has identified as a
ing LLM performance at playing Connections, a useful benchmark for testing LLM complex reasoning per-
New York Times word puzzle game that prior re- formance. (Samadarshi et al., 2024; Todd et al., 2024). With
search identified as a challenging benchmark for these puzzles, a player is shown a four-by-four grid of 16
evaluating LLM reasoning. Our experiments with words and must identify four groups of four words that have
GPT-4o show that a “Method Actors” approach a unique connection to one another. Each word appears in
can significantly improve LLM performance over exactly one of the four groups, and each of the four groups
both a vanilla and “Chain of Thoughts” approach. is unique. For each game, the player is allowed to make up
A vanilla approach solves 27% of Connections to three incorrect guesses and still solve the puzzle. Connec-
puzzles in our dataset and a “Chain of Thoughts” tions puzzles test a variety of reasoning skills for a player,
approach solves 41% of puzzles, whereas our and the open-ended and qualitative nature of Connections
strongest “Method Actor” approach solves 86% puzzles makes solving these puzzles a unique challenge for
of puzzles. We also test OpenAI’s newest model LLMs. The New York Times reports that, “In Connections,
designed specifically for complex reasoning tasks, there’s no way to use math or even artificial intelligence to
o1-preview. When asked to solve a puzzle all reliably solve the game.”1
at once, o1-preview solves 79% of Connections
puzzles in our dataset, and when allowed to build Our experiments with GPT-4o show that a “Method Actors”
puzzle solutions one guess at a time over multiple approach can significantly improve LLM performance over
API calls, o1-preview solves 100% of the puzzles. both a vanilla and “Chain of Thoughts” approach. A vanilla
Incorporating a “Method Actor” prompt archi- approach solves 27% of the Connections puzzles in our
tecture increases the percentage of puzzles that dataset and solves 12% perfectly without making an incor-
o1-preview solves perfectly from 76% to 87%. rect guess. A “Chain of Thoughts” approach solves 41% of
the puzzles and solves 20% perfectly. Our initial “Method
Actor” approach solves 78% of the puzzles and solves 41%
1. Introduction perfectly. Our revised “Method Actor” approach solves 86%
of the puzzles and solves 50% perfectly.
We introduce “Method Actors” as a mental model for guid-
ing LLM prompt engineering and prompt architecture. Un- After these initial experiments were performed, OpenAI
der this mental model, LLMs should be thought of as actors; released o1-preview, an LLM model that specializes in per-
prompts as scripts and cues; and LLM responses as per- forming complex reasoning tasks like Connections puzzles.2
formances. Four principles for prompt writing and task Further experiments with o1-preview reveal o1-preview’s
decomposition follow from this mental model: 1) Prompt superior baseline performance at Connections puzzles over
engineering is playwriting and directing. 2) Performance re- GPT-4o and also reveal that using the “Method Actor” men-
tal model can improve the rate at which o1-preview solves
1
Loyola Law School, Loyola Marymount University, 1
Los Angeles, CA, USA. Correspondence to: Colin Doyle https://www.nytimes.com/2024/08/27/upshot/connections-
<colin.doyle@lls.edu>. bot-faq.html
2
https://openai.com/o1/
Copyright 2024 by the author(s).

1
LLMs as Method Actors

puzzles perfectly. A “one shot” approach — in which o1- by-step before reaching an ultimate conclusion. (Wei et al.,
preview must solve the puzzle with one prompt and response 2024) Tree of Thoughts (ToT) builds upon CoT with the
— solves 79% of the puzzles and solves 72% perfectly. prompt architecture of a thought tree with multiple reason-
When allowed to build puzzle solutions one guess at a time ing branches. (Yao et al., 2024a). But even state-of-the-art
over multiple API calls, a vanilla approach solves 100% methods are limited. Their performance gains are limited
of the Connections puzzles and solves 76% perfectly. A to specific examples, and “concepts related to the LLM rea-
“Method Actor” approach solves 99% of the puzzles and soning are not well-defined, hindering effective design of
solves 87% perfectly. new more powerful schemes.” (Besta et al., 2024). Unex-
pected — and seemingly inexplicable — results continue to
Based upon prior evaluations of human performance at solv-
crop up, such as research that found that injecting emotional
ing connections puzzles perfectly, the “Method Actor” ap-
stimuli into prompts such as, “This is very important to my
proaches with GPT-4o perform better than human novices
career,” can improve LLM reasoning performance. (Li et al.,
and slightly worse than human experts. (Samadarshi et al.,
2023). The goal of many of these techniques is to engineer
2024) Each of the o1-preview approaches surpasses human
a process by which an LLM mimics the steps of multi-step
expert performance at solving puzzles perfectly, with the
thinking and reasoning that humans perform. But “while
“Method Actor” approach having the strongest performance.
many schemes rely on the notion of the LLM thought, it
A comparison of LLM performance measured against the
is not clear how it relates to concepts such as a prompt.”
New York Times’ assessment of puzzle difficulty reveals
(Besta et al., 2024). Some research conceptualizes LLM
that puzzles that are easy for people also tend to be easy for
responses not as “thoughts” but as imitations of the products
LLMs, and puzzles that are difficult for people also tend to
of thought. (Banerjee et al., 2024; Ferrucci, 2010).
be difficult for LLMs.
This paper demonstrates how using “Method Actors” as a 2.2. Connections Puzzles as a Reasoning Benchmark
mental model for LLMs can improve LLM performance at
Connections puzzle, but leaves open the opportunity to ex- The advent of large language models has led many re-
amine how this mental model affects LLM performance with searchers to evaluate the models’ performance playing and
other tasks. Our code and data is available to the public at generating text-based reasoning games. (Gallotta et al.,
https://github.com/colindoyle0000/llms-as-method-actors. 2024). Prior work has evaluated LLM performance at NPR
word puzzles; crossword puzzles; and the popular language-
based board game, Codenames, among others. (Jaramillo
2. Related Work et al., 2020; Zhao & Anderson, 2023; Yao et al., 2024a).
2.1. Prompt Engineering and LLM System Architecture Another line of research has evaluated LLM performance
at generating text-based puzzles, including Connections
In recent years, large language models have shown a re- puzzles. (Merino et al., 2024). Throughout the history of
markable improvement in performance on natural language A.I., games have been used as benchmarks for model per-
processing tasks. (Chang et al., 2023). A wave of contempo- formance, including famous examples of chess, go, and
rary research focuses on improving large language models’ Jeopardy. (Bory, 2019).
performance at complex tasks through novel prompting tech-
niques. These techniques can encompass both prompt engi- Connections Puzzles have been proposed as a useful bench-
neering and prompt architecture. Prompt engineering refers mark for testing LLM complex reasoning performance.
to techniques for writing individual prompts to elicit LLM (Samadarshi et al., 2024; Todd et al., 2024). Connections
responses that exhibit stronger performance at natural lan- puzzles test a variety of reasoning skills for a player, in-
guage tasks. Prompt architecture refers to methods for struc- cluding the ability to flexibly switch between methods of
turing multiple prompts and responses to better accomplish reasoning to solve a puzzle. The open-ended and quali-
complex tasks through a series of LLM calls. In contrast tative nature of Connections puzzles makes solving these
to conventional methods for improving LLM performance, puzzles a unique challenge. Solving connections puzzles
prompting techniques do not require extensive retraining is a seemingly impossible task for LLMs, as even the New
or fine-tuning, making these methods both cost-effective York Times reports that, “In Connections, there’s no way
and widely accessible. (Vatsal & Dubey, 2024). Prompting to use math or even artificial intelligence to reliably solve
techniques such as Chain-of-Thought, Tree-of-Thoughts, the game.”3 At the same time, because Connections puzzles
ReAct, and Self-consistency with CoT have shown to im- have clearly specified correct and incorrect answers, they
prove LLM performance at different complex reasoning provide objective criteria for evaluating an LLM system’s
tasks. (Wei et al., 2024; Yao et al., 2024a;b; Wang et al., open-ended reasoning abilities.
2024). Chain-of-Thought (CoT) is a prompt engineering 3
https://www.nytimes.com/2024/08/27/upshot/connections-
technique for guiding an LLM to develop its reasoning step- bot-faq.html

2
LLMs as Method Actors

Prior work has found that LLMs can solve some Connec- is more than cute — it’s useful. Imagining LLMs as actors
tions puzzles but that there is much room for improvement. performing a part can better align a user’s expectations with
Todd et al. (2024) found that GPT-4o could solve 38.93% LLMs’ capabilities because prompting is more like giving a
of Connections puzzles within an python-based system that performer a cue than asking a robot mind for its thoughts.
used chain-of-thought prompting, used separate API calls to Four principles for prompt writing and task decomposition
have the model to submit guesses one at a time, and gave the follow from this mental model:
model feedback about whether its guesses were correct or
incorrect. Samadarshi et al. (2024) found that GPT-4o could Prompt engineering is playwriting and directing.
solve only 8% of puzzles when asked to solve the puzzle Prompts should set the scene like a playwright would: pro-
completely with one chain-of-thought prompt and response. viding the LLM with a character, motivation, setting, and
In evaluating LLM performance, Samadarshi et al. (2024) stage direction. Beyond just assigning the LLM a role, the
created a taxonomy of reasoning skills and knowledge re- scene should provide motivation and direction, setting up
quired to solve Connections puzzles, noting where LLMs a story for the LLM to perform. Direct instructions for an
succeed and fall short. LLM are like stage directions for an actor. They should
Performance at Connections puzzles may be guiding the focus on the form of the LLM’s response, delineating the
development of LLM models as well. While developing steps for an LLM to follow. Improvisation can be chan-
a specialized complex reasoning model, o1-preview, Ope- neled effectively by scripting out the beats of the LLM’s
nAI employees reportedly used the model’s performance on performance and the patterns of language to use.
Connections puzzles as proof of the model’s performance at
complex reasoning tasks. (Lambert, 2024). Performance requires preparation. Under this mental
model, LLM responses are not understood as thoughts but
This paper builds on that research by evaluating the perfor- as performances. Just as the verisimilitude of an acting
mance of GPT-4o using “Method Actors” as a mental model performance depends upon off-screen preparation — the
for guiding LLM prompt engineering and prompt architec- many smaller, unseen steps an actor takes to build up to the
ture. The paper also evaluates the performance of OpenAI’s present mental state of the character — the verisimilitude of
new o1-preview model, both with standard prompting and an LLM’s performance often requires similar background
with a “Method Actor” prompt architecture. preparation. LLMs imitate the products of thinking, not
thinking itself. For a complex performance, an LLM should
3. Method Acting with Large Language be prompted to produce the products of any “behind the
Models scenes” thinking required for a complex task. These inter-
mediary performances can build incrementally to culminate
All the web’s a stage, in a complex final performance. To manage an LLM’s con-
And all the models merely mimics; text window, this often requires separate API calls that feed
They have their prompts and their tokens, information forward to subsequent LLM calls.
And one model in its code plays many roles,
Its scripts being endless acts.
Complex tasks should be decomposed to the point at
Chat-GPT-4o which imitation and authenticity produce equivalent re-
(response to a prompt asking it to rewrite Shakespeare
sults. When accurate results are required, a prompt ar-
lines to be about LLMs as method actors)
chitecture should be designed so that it relies not upon the
truthfulness of an LLM’s response but rather upon the au-
We introduce “Method Actors” as a mental model for LLMs
thenticity of the LLM’s performance. LLM responses are
that can guide prompt engineering and prompt architecture.
more reliable and accurate when an imitation of performing
Under this mental model, LLMs should be thought of as
an action is equivalent to genuinely performing an action.
actors; prompts as scripts and cues; and LLM responses
The inverse is true. LLMs are less reliable and more prone
as performances. LLMs and actors have a lot in common.
to error when an imitation of a performing an action and
Both mimic the product of human thought and emotion.
genuinely performing an action produce divergent results.
Success for both actors’ and LLMs’ performance is often
Therefore, a complex task should be decomposed into sub-
measured by a performance’s verisimilitude: how much the
tasks in which the imitation of performing the subtask and
imitation of human thought and feeling seems authentic.
genuinely performing the subtask are equivalent.
(Chiang et al., 2024). Hallucinations are the sin qua non of
both actors and LLMs. Actors’ performances are faithful to
Where imitation fails, compensate with methods that
the text of a script but not to external reality, just as LLMs’
do not rely upon LLMs. Some complex tasks cannot
responses are faithful to the text of a prompt but not the
be broken down into subtasks in which the imitation of
external truth of the world. The analogy of LLMs as actors
performing the task and genuinely performing the tasks are

3
LLMs as Method Actors

BOO-BOO GAFFE STING POLLINATE low connection is “BLUNDER” and is comprised of the
DANCE LOU MICKEY BUZZ synonyms “BOO-BOO,” “FLUB,” “GAFFE,” and “GOOF.”
PETUNIA JASMINE YOGI FLUB The green connection is “FLOWERS” and includes dif-
POPPY BABE DAISY GOOF ferent types of flowers: “DAISY,” “JASMINE,” “PETU-
NIA,” and “POPPY.” The more challenging connections
Table 1. Connections Puzzle #430 typically require more precise knowledge or more creative
thinking. The blue connection here is “THINGS BEES
DO,” which are “BUZZ,” “DANCE,” “POLLINATE,” and
equivalent. When this occurs, the broader LLM system or “STING.” The purple connection is “FIRST NAMES OF
pipeline should be designed to compensate for potential YANKEES LEGENDS”: “LOU,” “BABE,” “YOGI,” and
errors through other programming methods. “MICKEY.” Connections puzzles often include red herrings:
connections between words that are not a part of the solu-
tion to the puzzle. In this puzzle, one red herring would
4. Experimental Setup be “DISNEY CHARACTERS,” which could include the
4.1. Game Details words “MICKEY,” “DAISY,” and “BUZZ.” Another red her-
ring could be “CARTOON CHARACTERS,” which could
Connections Puzzles are an interactive word puzzle game include “BOO-BOO,” “YOGI,” “MICKEY,” and “DAISY.”
published on the New York Times website and mobile games
app.4 The game debuted in June, 2023 and has quickly When playing Connections, users submit guesses one-by-
become one of the most popular and challenging puzzle one by selecting four words and clicking submit. For each
games that the Times publishes. (Jennings, 2024; Schwedel, game, the player is allowed to make up to three incorrect
2024). With these puzzles, a player is shown a four-by-four guesses and still solve the puzzle. Once the player has made
grid of 16 words and must identify four groups of four words either four correct guesses or four incorrect guesses, the
that have a unique connection to one another. Each word game ends.
appears in exactly one of the four groups, and each of the
four groups is unique. 4.2. Data

The rules for Connections are simple, but the puzzles Connections puzzles are released on a daily basis. This
are challenging. Players must use a variety of reasoning experiment used 100 recent New York Times Connections
methods and draw upon different sources of knowledge. puzzles, 331 through 430, which were released between
(Samadarshi et al., 2024). For each word, players must con- May 7, 2024 and August 13, 2024. The data for each puzzle,
sider different possible meanings, relationships to words including the words and the solution, was acquired from a
outside the puzzle, and unique linguistic attributes. The third-party website that collects past and present New York
challenge of a Connections puzzle goes beyond brainstorm- Times Connections puzzle information.5
ing attributes for each word and lies in successfully group-
ing words together under a set of shared attributes. Ta- 5. Experiment: GPT-4o
ble 1 shows the Connections puzzle #430 for August 14,
2024. If you’re not already familiar with Connections puz- The first experiment compares the performance of differ-
zles, give this puzzle a try before you read ahead. By at- ent approaches for solving Connections puzzles using Ope-
tempting to solve this puzzle, you’ll get a better sense of nAI’s GPT-4o model: Vanilla, Chain-of-Thought, Chain-
Connections puzzles’ level of difficulty and the reasoning of-Thought (Scripted), Actor, and Actor-2. The first three
skills required. See if you can identify four groups of four approaches vary only in the natural language content of
words that have a connection to one another. For the full the prompts. The two Actor approaches vary both in the
interactive experience, you can do the puzzle online here: natural language content of the prompts and in the prompt
https://connections.swellgarfo.com/nyt/430. architecture.
Connections puzzles are designed with a difficulty scale in Some features are common to all five approaches. The LLM
mind. The Times labels each of the correct connections is asked to make and submit guesses one at a time. If the
with a color — yellow, green, blue, or purple — repre- guess is a valid guess,6 the guess is submitted and checked
senting the difficulty of identifying that connection. Each 5
https://connections.swellgarfo.com/archive
puzzle has one connection corresponding to each color. The 6
Valid guesses contains four words that are part of the list of
connections for the easiest categories, yellow and green, words remaining to solve in the puzzle and are not identical to a
are often words that are synonyms or words that belong set of four words of an incorrect guess that was already submitted.
to the same general category. For this puzzle, the yel- For each approach, the LLM system is prevented from submitted
an invalid guess. If the guess to be submitted is not formatted
4
https://www.nytimes.com/games/connections correctly to be a set of four words remaining in the puzzle, the

4
LLMs as Method Actors

against the puzzle solution. If the guess is correct, those the “make a guess” prompt. First, the prompt includes a
four words are removed from consideration for subsequent more expansive, carefully curated set of examples of puz-
guesses. If the guess is incorrect, that guess is saved as an zle solutions. Second, the prompt instructs the LLM to
incorrect guess that the LLM system cannot submit again. approach the puzzle solving task following a particular set
If the system has already submitted any incorrect guesses, of steps.
a list of guesses known to be incorrect is included in sub-
The prompt includes twenty-three examples of correct
sequent prompts. Upon making three correct guesses, the
guesses to Connections puzzles. By comparison, the Vanilla
system automatically submits the fourth guess because —
and Chain-of-Thought approaches only had two examples
by process of elimination — it is the only possible guess that
of correct guesses (taken from the instructions on the New
remains, even if the LLM has not identified the connection
York Times website). The examples were chosen to rep-
that those four words share. The process ends when the
resent each of the types of answers seen in Connections
system has submitted either four correct or four incorrect
puzzles.8
guesses.
The prompt includes instructions on a two-step process for
5.1. Vanilla the LLM to follow. The first step is to identify two words
that have a connection with each other. The second step is
The Vanilla approach sets a baseline for how GPT-4o per- to look through the remaining words and see if other words
forms at solving connections puzzles. Each guess consists share the same connection. The LLM is also instructed to
of two separate API calls to the LLM. The first call prompts start over at the first step if it is unable to identify a group of
the LLM to guess one four-word group that represents part four words that share a connection. These instructions were
of the solution to the puzzle. The prompt is an edited version chosen to structure the LLM’s thinking to follow a method
of the instructions on The New York Times Connections commonly used by people solving Connections puzzles.
website that provides the LLM with more details about the
puzzle than the website instructions provide. If the LLM
5.4. Actor
cannot come up with a good guess, it is allowed to not sub-
mit a guess.7 The second call takes the LLM’s response The Actor approach was designed according to the princi-
to the first prompt and asks the LLM to extract the guess ples articulated in Section 3: prompt engineering is play-
information and format the guess in a way for subsequent writing and directing; performance requires preparation;
steps to parse and submit. complex tasks should be decomposed to the point at which
imitation and authenticity produce equivalent results; and
5.2. Chain-of-Thought where imitation fails, compensate with methods that do not
rely upon LLMs.
The Chain-of-Thought approach is identical to the Vanilla
approach except for changes to the “make a guess” prompt. The prompts for this approach include dramatic scene setting
In this prompt, the LLM is assigned a role, “You are a and role definition. The prompts inform the LLM that it
professional puzzle solver.” and at the end of the prompt is a professional puzzle solver who has been brought in
the LLM is instructed,“Let’s think this through step-by- by the FBI because terrorists have planted a bomb inside a
step.” This now-ubiquitous closing instruction has been children’s hospital, and the only way to defuse the bomb is
demonstrated many times over to improve LLM reasoning by solving this word puzzle correctly. An informal analysis
performance by having the LLM work through reasoning comparing the LLM responses from prompts that either used
steps ’aloud’ in the output of its responses. (Wei et al., 2024; or didn’t use this scene-setting text revealed that including
Kojima et al., 2023). this scene-setting text resulted in the LLM using more of its
output context window to keep thinking through possible
5.3. Chain-of-Thought (Scripted) solutions and less frequently concluding that it could not
find an answer.
The Chain-of-Thought (Scripted) approach is identical to
the Chain-of-Thought approach except for two changes to For prompt architecture, the task of solving Connections
puzzles is divided into two stages: brainstorming and dis-
guess selection process starts over. If the guess to be submitted cernment. The brainstorming stage consists of five separate
is a guess that was previously submitted and determined to be prompts for the LLM to brainstorm potential guesses. Be-
incorrect, the incorrect guess is not submitted a second time, and
the guess selection process starts over. cause the answers to Connections puzzles follow particular
7
The prompt includes the instruction, “If you still can’t identify patterns, the LLM’s task is framed to imitate these past per-
a guess to submit, say ’I can’t identify a good guess to submit.”’ formances. The process cycles through a set of 24 templates
When this occurred, the system would not submit a guess and
8
would restart the process of generating a guess with a new API These templates were drawn from real Connections puzzles
call. from December 2023 and March 2024.

5
LLMs as Method Actors

representing the patterns of answers featured in Connec- 5.5. Actor-2


tions puzzles.9 For each brainstorming call, one of the 24
A second “method acting” approach, Actor-2, was con-
templates is selected and the LLM is prompted to generate
ducted to test additional methods for working around LLM
a guess by applying that template to the word list for the
shortcomings within the Method-Actor approach. A qual-
current puzzle. Each template includes a description of a
itative analysis of the LLM responses at each stage in the
particular pattern of an answer along with examples of that
Actor process revealed that the weak point in the process was
pattern from prior puzzles. For example, the LLM may be
the discernment stage. The LLM was able to successfully
asked to identify “words that can be synonymous adjectives
brainstorm many different possible guesses but struggled to
with each other” or “words that share a pop culture refer-
distinguish between the good and bad guesses.
ence” or “words that are each followed by the same word
or phrase.” The templates include a script of multi-step Under the principles spelled out in Section 3, options for re-
instructions on the process for the LLM to follow. vision include further decomposition or engineering around
the LLM’s shortcomings. Given that further decomposi-
After brainstorming guesses, discernment begins. Discern-
tion did not seem viable, Actor-2 instead takes the LLM’s
ment is more of a challenge for decomposition than brain-
weakness at discernment as a given and attempts to engineer
storming. With brainstorming guesses, authentically brain-
the broader system to compensate for this weakness. Two
storming potential guesses is more-or-less equivalent to the
changes are introduced. First, external validation criteria
imitation of brainstorming potential guesses, particularly
must be satisfied before a guess that the LLM has chosen to
when the imitation is accomplished through applying the
submit is actually submitted. Second, a validation process
patterns of past answers to new words. The task of discern-
is introduced to filter out hallucinations.
ment involves judgment and decision-making about a unique
set of possible guesses. Brainstorming can be scripted for Actor-2 keeps the brainstorm and discern processes the same
an actor in a way that discernment cannot. Because an as the Actor approach but revises the evaluate process to no
imitation of discernment is unlikely to be equivalent to dis- longer rely exclusively upon the LLM’s discernment over
cernment, the prompt architecture needs to compensate for which potential guesses are strongest. This approach uses
potential weaknesses. This is accomplished by giving the deterministic logic to help the system navigate around red
LLM multiple passes to winnow down potential guesses and herrings, which LLMs struggle to identify. After the LLM
by providing discernment scripts with criteria for the LLM chooses to submit a guess, the guess is added to a “final
to apply — as much as it is possible to do so. guesses” list instead of submitting it right away. The system
analyzes — without an LLM’s involvement — whether any
The discernment process begins with an “extract” stage
guesses in the “final guesses” list form unique pairs (i.e.,
to reduce the quantity of information for the LLM to pro-
there is no overlap of words between the two guesses). Once
cess to only include viable guesses. The LLM is called
the “final guesses” list includes two guesses that are unique
to extract from the brainstorming notes any valid potential
pairs, the guesses are submitted. In this way, the system
guesses and the rationale behind those guesses. This in-
deterministically reduces the frequency of submitting red
formation is passed on to the “discern” stage in which the
herrings. Alternately, if the same guess appears in the “final
LLM is asked to discern the strongest guess from among
guesses” list three times, the guess is submitted. One po-
the possible guesses. Lastly, at a “decide” stage, the LLM
tential problem with the “final guesses” list is that it risks
is prompted to decide whether the process has produced
making the system much less efficient. The system may pro-
a guess worth submitting. Guesses worth submitting are
duce the same guess over and over again. To ensure that the
stockpiled. Once five guesses are ready to submit, a final
LLM produces a diversity of guesses and explores the larger
“evaluation” stage begins.10 The LLM is asked to consider
puzzle space of possibilities, words are sometimes removed
the potential guesses and select the strongest one to submit.
from the possible words list when generating guesses. The
This process is similar to the “Tree of Thoughts” frame-
removed words are words from guesses in the “final guesses”
work and other approaches that involve generating ideas
list and words from guesses that are waiting to be evaluated.
and then iteratively pruning down possibilities until only the
strongest options remain. (Yao et al., 2024a;c). Once the system has submitted at least two incorrect guesses,
9 the discern stage includes a validation process for filtering
Like the examples used in the Chain-of-Thought (Scripted) ap-
proach, these templates were drawn from real Connections puzzles out hallucinations. If a proportion of hallucinated guesses
from December 2023 and March 2024. can be filtered out from consideration, then the LLM will
10
After two correct guesses have been submitted, the evaluation submit fewer incorrect guesses. To ferret out hallucinations,
process starts once three guesses are ready to submit. the LLM’s prompts include “mole” words within the list of
possible words that could comprise a guess for the puzzle.
These “mole” words are randomly selected words that are
part of a correct guess that has already been submitted. If

6
LLMs as Method Actors

the LLM chooses to submit a guess that has one of these LLM with more details about the puzzle than the website
“mole” words in it, the guess is rejected as invalid. instructions provide.
The effect of this approach for filtering out hallucinations
can be illustrated mathematically. The process assumes that 6.2. Vanilla-o1
hallucinated incorrect guesses are equivalent to selecting The Vanilla-o1 approach is identical to the Vanilla approach
a guess by randomly selecting four words from the list of from the GPT-4o experiment, except that it uses o1-preview
possible words. If two correct guesses have already been as the model. It differs from the Oneshot-o1 approach in that
submitted, then the LLM will be selecting a group of four the LLM makes guesses one at a time rather than making
words out of a list of ten words. Eight of the words are all four guesses at once. The LLM receives feedback on
legitimate and two of the words are “mole” words. The whether prior guesses were correct or incorrect and can
frequency with which a random selection of four words adjust its subsequent guesses accordingly. The process ends
would not include one of the “mole” words is: when the system has submitted either four correct or four
incorrect guesses.
8 7 6 5 1
∗ ∗ ∗ =
10 9 8 7 3 6.3. Actor-o1
Provided that hallucinated incorrect guesses are the same The Actor-o1 approach adapts the Actor-2 approach for the
as the LLM randomly selecting words, the inclusion of two o1-preview model. Because o1 already includes a backend
“mole” words should allow the system to detect and reject reasoning process that was built to incorporate prompting
two thirds of the hallucinated incorrect guesses it generates. techniques such as chain-of-thought, OpenAI discourages
A risk with this approach is that, by including extraneous using these prompting techniques with o1-preview. Accord-
information, the approach may cause the system to generate ingly, the brainstorming and discern processes are simplified
more hallucinated incorrect guesses. into one LLM call to select a guess to submit. Once five
guesses are ready to submit, the LLM is asked to consider
6. Experiment: o1-preview the potential guesses and select one to submit.11 That guess
is added to the “final guesses” list. Given that o1-preview
After the first experiment was performed, OpenAI released is more adept at generating correct guesses than GPT-4o,
o1-preview, an LLM model that specializes in performing the Actor-o1 approach uses slightly different deterministic
complex reasoning tasks like Connections puzzles. (Ope- logic than Actor-2 for submitting guesses from the “final
nAI, 2024) A second experiment compares the performance guesses” list. The system analyzes — without an LLM’s
of three different approaches use the o1-preview model: involvement — whether the guesses on the “final guesses”
Oneshot-o1, Vanilla-o1, and Actor-o1. list form unique pairs, triplets, or quadruplets. When the
system has arrived at a unique quadruplet — four guesses
6.1. Oneshot-o1 that share no words in common — the system will submit
that guess as a potential full solution to the problem. By
With the Oneshot-o1 approach, the LLM is prompted to
waiting until the LLM has discerned four guesses without
solve the puzzle in its entirety within one response. With
any overlap between them, red herrings are avoided because
o1-preview, “one-shot” is a slight misnomer because —
the four legitimate unique guesses cannot be made with a
although the LLM is given one prompt and returns one re-
red herring among them. But the LLM system cannot be
sponse — o1-preview uses a backend, hidden reasoning
expected to always identify all four correct guesses within
process to “think through” its response before writing it.
a reasonable timeframe. After the system has processed
(OpenAI, 2024). OpenAI has disclosed few details about
at least thirteen guesses, if the “final guesses” list includes
this process beyond saying that o1-preview follows a chain
three unique guesses, the system will submit those guesses.
of thought that has been honed for reasoning tasks through
After more than fifteen guesses have been generated, the
reinforcement learning. (OpenAI, 2024). The process in-
system will submit any two guesses on the “final guesses”
volves the creation of intermediary responses comprised of
list that don’t overlap with one another. As with Actor-2, if
“reasoning” tokens that are kept hidden from the user. On
the same guess appears in the “final guesses” list three times,
the browser-based chat interface for o1-preview, the conver-
the guess is submitted. Actor-o1 uses the same “mole” word
sation history includes summaries of intermediary reasoning
validation process as Actor-2 to filter out hallucinations.
steps the model has taken to think through a response. At
11
the time of running the experiment, these steps were not If two correct guesses have already been submitted, once three
available via the API used for the experiment. guesses are ready to submit, the LLM is asked to consider the
potential guesses and select one to submit.
The prompt is an edited version of the instructions on The
New York Times Connections website that provides the

7
LLMs as Method Actors

7. Results three correct guesses, in these circumstances the one-shot


approach received credit for solving the puzzle but not for
7.1. GPT-4o solving the puzzle perfectly.
Overall, chain-of-thought prompting improved GPT-4o’s
performance at puzzle solving, and the method actor ap- 7.3. All Results
proaches improved performance even more. Table 2 shows
A comparison across all approaches reveals that shift-
each approach’s performance. The “puzzles solved” column
ing from GPT-4o to o1-preview and that incorporating a
captures how frequently each approach correctly solved the
“Method Actor” approach to prompt engineering and archi-
puzzle by submitting the four correct guesses before sub-
tecture improves performance. A one-shot approach using
mitting four incorrect guesses and losing the game. The
o1-preview achieves comparable success rate to the Actor
“solved perfectly” column captures how frequently each ap-
approach with GPT-4o, although it does not perform as
proach submitted four correct guesses without submitting
strongly the Actor-2 approach with GPT-4o.
any incorrect guesses.
Figure 1 captures the proportion of good and bad guesses
Approach Puzzles Solved Solved Perfectly submitted by each approach. Although the Vanilla-o1 ap-
Vanilla 27% 12% proach correctly solved 100 puzzles compared to the Actor-
CoT 41% 20% o1 approach correctly solving only 99 puzzles, the Vanilla-
CoT-Scripted 56% 24% o1 approach submitted nearly twice as many incorrect
Actor 78% 41% guesses overall compared to Actor-o1. Out of 400 pos-
Actor-2 86% 50% sible incorrect guesses (4 incorrect guesses for each of the
100 puzzles) Vanilla-o1 submitted 35 incorrect guesses and
Table 2. Success Rates for Solving Connections Puzzles
Actor-o1 submitted 19 incorrect guesses.
Consistent with prior research, the baseline vanilla approach
performed worst with 27% correct, and chain-of-thought 7.4. Success measured against puzzle difficulty
prompting improved performance to 41% correct. (Todd The success of each approach can also be measured against
et al., 2024). Supplementing chain-of-thought prompting the difficulty of the puzzles in the dataset. The New York
with instructions and a curated set of examples improved Times scores each puzzle’s difficulty from 1 to 5. “The
performance to 56%. The “Method Actor” approaches per- difficulty of each puzzle is determined by averaging the
formed best, with Actor solving 78% of puzzles and Actor-2 ratings provided by a panel of testers who are paid to solve
solving 86% of puzzles. The weakest approach (Vanilla) each puzzle in advance to help us catch bugs, inconsisten-
solved only 12% of puzzles perfectly without submitting a cies and other issues. A higher rating means the puzzle
bad guess, while the strongest approach (Actor-2) solved is more difficult.”12 For this set of 100 puzzles, the diffi-
50% of puzzles perfectly. culty ranged from 1.6 to 4.2. Figure 2 plots each approach’s
performance against these puzzles according to the puzzle
7.2. o1-preview difficulty, grouping difficulties in the ranges below 2.5 (28
As Table 3 illustrates, the o1-preview model represents a puzzles), 2.5-3 (26 puzzles), 3-3.5 (33 puzzles), and above
marked improvement over prior models at solving connec- 3.5 (13 puzzles).
tions puzzles. The general trend across all experimental approaches was
one of higher performance for easier-rated puzzles and
Approach Puzzles Solved Solved Perfectly weaker performance for harder-rated puzzles. This ten-
Oneshot-o1 79% 72% dency is most pronounced for Oneshot-o1 and the weaker
Vanilla-o1 100% 76% approaches (Vanilla, CoT, CoT-Scripted).
Actor-o1 99% 87%

Table 3. o1-preview Success Rates for Solving Connections Puz- 8. Discussion


zles
8.1. Comparison with prior work
Given that the Oneshot-o1 approach was given only one
This paper’s results for Vanilla and Chain-of-Thought ap-
attempt to answer the puzzle correctly, the discrepancy be-
proaches are consistent with prior work testing similar ap-
tween its “Puzzles Solved” and “Solved Perfectly” rates
proaches. Todd et al. found that Chain-of-Thought prompt-
deserves explanation. For 7% of the puzzles, the model’s
response included three correct guesses and one incorrect 12
https://www.nytimes.com/2024/08/13/crosswords/connections-
guess. Because the other approaches in the experiment re- companion-430.html
ceived credit for solving the puzzle once they had submitted

8
LLMs as Method Actors

Figure 1. Proportion of Good and Bad Guesses Submitted

ing with GPT-4-Turbo correctly solved 38.93% of connec- players solved 18% of connections puzzles perfectly and
tions puzzles and solved 23.46% perfectly. (Todd et al., that expert human players solved 60% of connections puz-
2024). In our experiments, Chain-of-Thought prompting zles perfectly. (Samadarshi et al., 2024) In our experiments,
with GPT-4o correctly solved 41% of connections puzzles the rate of solving connections puzzles perfectly was 41%
and solved 20% perfectly. In Samardashi et al.’s experi- for Actor, 50% for Actor-2, 72% for Oneshot-o1, 76% for
ment, LLMs were allowed one attempt to solve a connec- Vanilla-o1, and 87% for Actor-o1.
tions puzzle without receiving feedback on any incorrect
Puzzles that are easy for people tend to be easy for LLMs,
guesses, which aligns with our experiment’s measurement
and puzzles that are difficult for people tend to be difficult
of whether an LLM has solved a connections puzzle per-
for LLMs. This was not a foregone conclusion, as one
fectly. (Samadarshi et al., 2024). Their vanilla approach
might expect people and LLMs to excel and fail at different
with GPT-4o perfectly solved 5% of the puzzles. Our vanilla
kinds of puzzles. Figure 2 indicates that LLM performance
approach with GPT-4o perfectly solved 12% of the puzzles.
tends to decrease as puzzle difficulty increases. Similarly,
The increase in performance for our vanilla approach may
when an LLM only partially solved a connections puzzle,
be due to differences in experimental design, including a val-
the correct guesses that it submitted tended to be the easier
idation step that prevented the submission of invalid answers
guesses within the puzzle. For each puzzle, the New York
and the decomposition of guess selection into multiple LLM
Times labels each of the correct guesses with a color —
calls. Other differences may be attributable to different
yellow, green, blue, or purple — representing the difficulty
connections puzzle datasets and variations in LLM perfor-
of identifying that connection from easiest (yellow) to most
mance, either due to inherent randomness or changes to the
difficult (purple). Each puzzle has one correct guess in
GPT-4o model over time.
each color. As an example of LLM performance measured
against guess difficulty, for the 21 puzzles that the Actor
8.2. Comparison with humans approach could only partially solve, the correct guesses that
Based upon prior evaluations of human performance at the LLM submitted included:
solving connections puzzles perfectly, the method-actor ap- • 16 of the easiest guesses (yellow)
proaches with GPT-4o perform better than human novices • 15 of the easy guesses (green)
and slightly worse than human experts. And each of the o1- • 3 of the difficult guesses (blue)
preview approaches surpasses human expert performance at • 0 of the most difficult guesses (purple)
solving puzzles perfectly with Actor-o1 having the strongest
performance. Samardashi et al. found that novice human

9
LLMs as Method Actors

Figure 2. Success Rate by Puzzle Difficulty Group for Each Approach

8.3. Comparison between GPT-4o and o1-preview experiment — including perfectly solving 87% of puzzles,
the highest among all approaches — Actor-o1 was the only
Given the same prompts, the o1-preview model’s perfor-
approach out of all seven approaches that failed to solve
mance greatly exceeded the GPT-4o model’s performance.
Puzzle #410. Actor-o1 was not even close to solving the
Whereas prior research found that GPT-4o could solve
puzzle, as it submitted only 1 correct guess and 4 incorrect
only 5% of connections puzzles given a single prompt, o1-
guesses.
preview was able to solve 79% of Connections puzzles
with this method. (Samadarshi et al., 2024) The method- What accounts for Actor-o1’s failure? It may be due to
actor approaches improved overall performance for GPT- chance. The behavior of language models is not determin-
4o the most, but the o1-preview approaches were able to istic. A limit of our experimental design is that each of the
consistently solve puzzles that no GPT-4o approach could approaches was run only once for each puzzle due to cost
solve. GPT-4o approaches struggled the most with puzzles constraints. Across multiple iterations, other approaches
in which the connection between the words was in the form may have failed, and Actor-o1 may have succeeded. What
of another word that could appear either immediately before appears as an aberration in our current results may have
or immediately after each of the four puzzle words. For disappeared if multiple iterations were run.
example, all of the GPT-4o approaches failed to identify
As an informal test of this possibility, we ran each of the
that the connection between the words “MED,” “MUSIC,”
seven approaches three more times on puzzle #410. Random
“SANDWICH,” and “SODA” was that each of these words
chance does not appear to be the culprit here. As Table 4
could be preceded by the word “CLUB,” whereas all of
demonstrates, the Actor-o1 approach seems to have a pecu-
the o1-preview approaches correctly identified this connec-
liar difficulty with this puzzle. Over four iterations, only the
tion. Similarly, GPT-4o approaches uniformly struggled
Vanilla approach performed worse. Setting aside the Vanilla
to identify some linguistic connections that the o1-preview
approach, Actor-o1 solved the puzzle less frequently and
approaches consistently identified, such as the connection
submitted more incorrect guesses than all other approaches.
between the words “AARDVARK,” “EERIE,” “LLAMA,”
Notably, Oneshot-o1 performed perfectly, solving the puz-
and “OOZE” being that each word begins with double let-
zle correctly four times without submitting a single incorrect
ters.
guess.
8.4. Actor-o1’s One Incorrect Puzzle The reason Actor-o1 failed so frequently may be due to a
vulnerability within the prompt architecture for this unique
Because the Actor-o1 approach submitted the correct so- kind of puzzle. Puzzle #410 features a higher number of red
lution to every puzzle except one, its performance on that herrings compared to the typical connections puzzle. Rather
puzzle deserves closer scrutiny. This puzzle was #410, rated than have the LLM sort out red herrings, the Actor-o1 ap-
at a medium difficulty of 2.6 out of 5 by the New York proach deals with red herrings by being hard coded to wait
Times. Despite Actor-o1’s strong performance in the overall to submit guesses until the LLM has independently chosen

10
LLMs as Method Actors

Approach Solve Rate # of Bad Guesses (Hirsch, 2014; Butler, 2022), similar principles may have
Vanilla 1/4 15 the potential to produce stronger LLM performance today.
CoT 4/4 7
CoT-Scripted 3/4 8 In these “Method Actor” approaches, LLMs performed a
Actor 4/4 8 human reasoning task — solving connections puzzles —
Actor-2 4/4 7 but not by taking the same steps that a human would. In
Oneshot-o1 4/4 0 contrast, the predominant approaches for prompt architec-
Vanilla-o1 4/4 9 ture for complex reasoning tend to guide an LLM to imitate
Actor-o1 2/4 13 the human process for reasoning through a problem. (Wei
et al., 2024; Yao et al., 2024b;a). Implicitly or explicitly,
Table 4. Success rates across four attempts at Puzzle #410 these approaches treat LLM responses as equivalent to hu-
man thoughts and use these thoughts as building blocks for
a structure that can tackle complex problems in a human-
to submit unique pairs, triplets, or quadruplets of guesses. like way. In contrast, a method-actor approach treats LLM
At the beginning of the guessing process, the system will responses as equivalent to performances and uses these per-
submit only unique quadruplets of guesses, but this standard formances as building blocks for a structure that can tackle
relaxes over time. The system will also submit guesses that complex problems in way that deviates from how humans
the LLM has chosen to submit three times. Most of the time, would solve these problems.
this process is effective. But with a puzzle that features
many red herrings, the Actor-o1 approach can settle on in- Although designing prompt architecture to mimic human
correct guesses. In contrast, the Oneshot-o1 approach must structures of cognition may expand the reasoning abilities of
settle on a quadruplet of guesses in one response, causing LLMs, the field should not be confined to this approach. At
it to sort out the red herrings that Actor-o1 ignores. This times, it can be a useful analogy to think of LLM responses
likewise explains how Oneshot-o1 outperformed Vanilla-o1, as thoughts. But LLM responses are not precisely equiva-
with Oneshot-o1 submitting 0 incorrect guesses compared to lent to thoughts, and the analogy may obscure other viable
Vanilla-o1’s 9 incorrect guesses. Vanilla-o1 makes guesses methods for prompt engineering and architecture. Novel
one at a time, which can lead to the model being tricked structures, built from mental models of LLM responses as
by red herrings. Although Actor-o1 and Vanilla-o1 out- something other than thoughts, may achieve comparable or
performed Oneshot-o1 across 100 puzzles, the one-shot superior results.
approach is better suited for the challenge of this particu-
lar kind of puzzle. To improve performance, one potential 10. Conclusion
adjustment to Actor-o1’s prompt architecture would be to
permanently delay submitting guesses until the system has We have introduced “Method Actors” as a mental model
produced four unique guesses. The downside to this adjust- for guiding LLM prompt engineering and prompt architec-
ment would be higher compute costs. Another adjustment ture for complex reasoning. Our experiments with GPT-
would be a validation step in which the model only submits 4o demonstrate that using the mental model for prompt
a guess if it can brainstorm a credible total solution to the writing and task decomposition can lead to significant im-
puzzle that includes the current guess. A risk with this ap- provements in performance at solving Connections puzzles.
proach is that, when instructed to find connections between Further experiments with o1-preview reveal o1-preview’s
words, LLMs have a tendency to find connections whether superior baseline performance at Connections puzzles over
the connection is credible or not. GPT-4o and also reveal that using the “Method Actor” men-
tal model can improve the rate at which o1-preview solves
puzzles perfectly. Future work can evaluate how this men-
9. Future Work tal model affects LLM performance in other domains and
This paper demonstrates how using “Method Actors” as a how novel mental models can lead to unique and effective
mental model for LLMs can improve LLM performance prompting methods.
with one particular complex reasoning task, but leaves open
the opportunity to examine how this mental model affects 11. Impact Statement
LLM performance with different reasoning tasks or tasks
distinct from reasoning, such as creative writing. There’s an LLMs’ ability to manipulate language and perform complex
opportunity to draw upon the acting literature to test whether reasoning tasks are likely to have significant impacts on
methods for improving actors’ performances might also im- knowledge work and the economy at large. Only a hand-
prove LLM performances. Just as method acting principles ful of people in the world do Connections puzzles as part
revolutionized acting in the mid-20th century to produce a of their job — that group may be limited to the profes-
new form of authentic onstage and onscreen performances, sional puzzle testers that the New York Times pays to test its

11
LLMs as Method Actors

puzzles each day. So automating the task of solving Connec- the 19th International Conference on Parallel Archi-
tions puzzles is not a direct threat to anyone’s livelihood. But tectures and Compilation Techniques, pp. 1–2. ACM,
the prospect of automating complex reasoning tasks means 2010. ISBN 978-1-4503-0178-7. doi: 10.1145/1854273.
that many parts of knowledge work jobs that were previ- 1854275. URL https://dl.acm.org/doi/10.
ously impervious to automation are now under threat. LLM 1145/1854273.1854275.
systems may perform certain kinds of knowledge work more
cheaply and efficiently than human labor, but the broader Gallotta, R., Todd, G., Zammit, M., Earle, S., Liapis, A.,
impact on public welfare is uncertain and is dependent on Togelius, J., and Yannakakis, G. N. Large Language
other political and economic factors. Wider adoption of the Models and Games: A Survey and Roadmap. 2024. URL
“Method Actors” approach to LLM system design may also http://arxiv.org/abs/2402.18659.
have a negative environmental impact. Compared to sim-
Hirsch, F. Birth of the Method: The revolution in
pler one-shot methods, more complex LLM systems require
American acting — Deep focus — Sight & Sound,
greater computational resources because the system design
2014. URL https://www2.bfi.org.uk/
involves synthesizing and leveraging the results of many
news-opinion/sight-sound-magazine/
separate API calls to an LLM. At the same time, the use of
features/deep-focus/
many independent API calls creates a system with greater
birth-method-revolution-american-acting.
transparency and interpretability.
Jaramillo, C., Charity, M., Canaan, R., and Togelius,
References J. Word Autobots: Using Transformers for Word As-
sociation in the Game Codenames. 16(1):231–237,
Banerjee, S., Agarwal, A., and Singla, S. LLMs Will Always 2020. ISSN 2334-0924, 2326-909X. doi: 10.1609/
Hallucinate, and We Need to Live With This. 2024. URL aiide.v16i1.7435. URL https://ojs.aaai.org/
http://arxiv.org/abs/2409.05746. index.php/AIIDE/article/view/7435.
Besta, M., Memedi, F., Zhang, Z., Gerstenberger, R., Piao, Jennings, R. Connections, the most fun (and
G., Blach, N., Nyczyk, P., Copik, M., Kwaśniewski, G., sometimes frustrating) game on the inter-
Müller, J., Gianinazzi, L., Kubicek, A., Niewiadomski, net, 2024. URL https://www.vox.
H., O’Mahony, A., Mutlu, O., and Hoefler, T. Demystify- com/culture/2024/3/26/24111847/
ing Chains, Trees, and Graphs of Thoughts. 2024. URL connections-nyt-game-hint-word-puzzles.
http://arxiv.org/abs/2401.14295.
Kojima, T., Gu, S. S., Reid, M., Matsuo, Y., and Iwasawa, Y.
Bory, P. Deep new: The shifting narratives Large Language Models are Zero-Shot Reasoners. 2023.
of artificial intelligence from Deep Blue to Al- URL http://arxiv.org/abs/2205.11916.
phaGo. 25(4):627–642, 2019. ISSN 1354-
8565, 1748-7382. doi: 10.1177/1354856519829679. Lambert, N. OpenAI’s Strawberry and
URL https://journals.sagepub.com/doi/ inference scaling laws, 2024. URL
10.1177/1354856519829679. https://www.interconnects.ai/p/
openai-strawberry-and-inference-scaling-laws.
Butler, I. The Method: How the Twentieth Century Learned
to Act. Bloomsbury Publishing, 2022. ISBN 978-1-63557- Li, C., Wang, J., Zhang, Y., Zhu, K., Hou, W., Lian, J., Luo,
477-7. F., Yang, Q., and Xie, X. Large Language Models Under-
stand and Can be Enhanced by Emotional Stimuli. 2023.
Chang, Y., Wang, X., Wang, J., Wu, Y., Yang, L., Zhu, K., URL http://arxiv.org/abs/2307.11760.
Chen, H., Yi, X., Wang, C., Wang, Y., Ye, W., Zhang, Y.,
Chang, Y., Yu, P. S., Yang, Q., and Xie, X. A Survey Merino, T., Earle, S., Sudhakaran, R., Sudhakaran, S., and
on Evaluation of Large Language Models. 2023. URL Togelius, J. Making New Connections: LLMs as Puz-
http://arxiv.org/abs/2307.03109. zle Generators for The New York Times’ Connections
Word Game. 2024. URL http://arxiv.org/abs/
Chiang, W.-L., Zheng, L., Sheng, Y., Angelopoulos, A. N., 2407.11240.
Li, T., Li, D., Zhang, H., Zhu, B., Jordan, M., Gonzalez,
J. E., and Stoica, I. Chatbot Arena: An Open Platform OpenAI. Learning to Reason with LLMs,
for Evaluating LLMs by Human Preference. 2024. URL 2024. URL https://openai.com/index/
http://arxiv.org/abs/2403.04132. learning-to-reason-with-llms/.

Ferrucci, D. Build Watson: An overview of DeepQA Samadarshi, P., Mustafa, M., Kulkarni, A., Rothkopf, R.,
for the Jeopardy! challenge. In Proceedings of Chakrabarty, T., and Muresan, S. Connecting the Dots:

12
LLMs as Method Actors

Evaluating Abstract Reasoning Capabilities of LLMs Us-


ing the New York Times Connections Word Game. 2024.
URL http://arxiv.org/abs/2406.11012.
Schwedel, H. The NYT Connections Editor Knows
What You’ve Been Saying. 2024. ISSN 1091-2339.
URL https://slate.com/life/2024/07/
connections-nyt-today-wordle-wyna-liu.
html.
Todd, G., Merino, T., Earle, S., and Togelius, J. Missed
Connections: Lateral Thinking Puzzles for Large Lan-
guage Models. 2024. URL http://arxiv.org/
abs/2404.11730.

Vatsal, S. and Dubey, H. A Survey of Prompt Engineering


Methods in Large Language Models for Different NLP
Tasks. 2024. URL http://arxiv.org/abs/2407.
12994.
Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang,
S., Chowdhery, A., and Zhou, D. Self-Consistency Im-
proves Chain of Thought Reasoning in Language Mod-
els. 2024. URL http://arxiv.org/abs/2203.
11171.
Wei, J., Wang, X., Schuurmans, D., Bosma, M., Ichter, B.,
Xia, F., Chi, E., Le, Q., and Zhou, D. Chain-of-Thought
Prompting Elicits Reasoning in Large Language Mod-
els. 2024. URL http://arxiv.org/abs/2201.
11903.
Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T. L., Cao,
Y., and Narasimhan, K. Tree of Thoughts: Deliberate
Problem Solving with Large Language Models. 2024a.
URL http://arxiv.org/abs/2305.10601.
Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan,
K., and Cao, Y. ReAct: Synergizing Reasoning and
Acting in Language Models. 2024b. URL http://
arxiv.org/abs/2210.03629.
Yao, Y., Li, Z., and Zhao, H. Beyond Chain-of-Thought, Ef-
fective Graph-of-Thought Reasoning in Language Mod-
els. 2024c. URL http://arxiv.org/abs/2305.
16582.

Zhao, J. and Anderson, C. J. Solving and Generating NPR


Sunday Puzzles with Large Language Models. 2023.
URL http://arxiv.org/abs/2306.12255.

13
LLMs as Method Actors

A. Example Prompts
A.1. Vanilla
Your job is to solve a word puzzle that is just like a New York Times connections puzzle. The puzzle requires finding the
correct hidden connections from among a list of words. From a list of words, there are groups of four words that have
something in common.
Your current task is to pick one set of four words to submit as a guess.
Category Examples:
FISH: Bass, Flounder, Salmon, Trout
FIRE : Ant, Drill, Island, Opal
Categories will always be more specific than “5-LETTER-WORDS,” “NAMES” or “VERBS.”
Each puzzle has exactly one solution. Watch out for words that seem to belong to multiple categories!
You don’t need to solve the whole puzzle at once. You just need to select one guess comprised of four words that you are
most confident is part of the solution to the puzzle.
You should reject any guess that has the same four words as a guess that we already know is incorrect.
[[bad guesses]]
If you can’t identify a guess to submit, say “I can’t identify a good guess to submit.

A.2. Chain-of-Thought
You are a professional puzzle solver.
Your job is to solve a word puzzle that is just like a New York Times connections puzzle. The puzzle requires finding the
correct hidden connections from among a list of puzzle words. From a list of puzzle words, there are groups of four puzzle
words that have something in common.
Your current task is to pick one set of four puzzle words to submit as a guess.
Category Examples:
FISH: Bass, Flounder, Salmon, Trout
FIRE : Ant, Drill, Island, Opal
Categories will always be more specific than “5-LETTER-WORDS,” “NAMES” or “VERBS.”
Each puzzle has exactly one solution. Watch out for words that seem to belong to multiple categories!
You don’t need to solve the whole puzzle at once. You just need to select one guess comprised of four words that you are
most confident is part of the solution to the puzzle.
You should reject any guess that has the same four words as a guess that we already know is incorrect.
[[bad guesses]]
If you can’t identify a guess to submit, say “I can’t identify a good guess to submit.”
Let’s think this through step-by-step.

A.3. Chain-of-Thought (Scripted)


You are a professional puzzle solver.
Your job is to solve a word puzzle that is just like a New York Times connections puzzle. The puzzle requires finding the
correct hidden connections from among a list of puzzle words. From a list of puzzle words, there are groups of four puzzle
words that have something in common.

14
LLMs as Method Actors

Your current task is to pick one set of four puzzle words to submit as a guess.
Step 1: Identify two puzzle words that have a connection with each other. This is a New York Times puzzle, so unique, subtle
connections are more likely to be the correct connection than simple and ubiquitous connections. Describe the connection as
precisely as possible.
Connection Examples:
WAYS TO REMOVE HAIR: SHAVE THREAD TWEEZE WAX
NAME PREFIXES: GEN MS PROF REV
PLANT GROWTHS: BLOOM BUD SHOOT SPROUT
SEQUENCE: CHAIN SERIES STRING TRAIN
INDECENT: BAWDY BLUE COARSE RISQUE
HOW FAST SOMETHING IS GOING: CLIP PACE RATE SPEED
EAT VORACIOUSLY: DOWN INHALE SCARF WOLF
PLACES IN FRANCE: CHAMPAGNE DIJON NICE TOURS
ORGANIZATION: CLUB GROUP PARTY TEAM
HAPPY NEW YEAR!: BALL COUNTDOWN FIREWORKS KISS
PARTS OF A CAR: BUMPER HOOD TIRE TRUNK
JAZZ LEGENDS: HANCOCK HOLIDAY MONK PARKER
PEPPER: BELL BLACK DR GHOST
GEORGE: BOY BY CURIOUS SAINT
BERRY: BLUE GOOSE RASP STRAW
SECRET : AGENT CODE SANTA SAUCE
TELE : COMMUTE MEDICINE PROMPTER VISION
ANIMAL HOMOPHONES: DEAR HAIR HOARSE WAIL
BODY PARTS PLUS ”Y”: ARMY COLONY LIVERY SHINY
HETERONYMS: BASS DOVE DESERT WIND
ZODIAC SIGN BEGINNINGS: CAN GEM LIB TAU
WHAT “I” MIGHT MEAN: IODINE IOTA MYSELF ONE
Connections will always be more specific than “5-LETTER-WORDS,” “NAMES” or “VERBS.”
Each puzzle has exactly one solution. Watch out for puzzle words that seem to belong to multiple categories! There may be
red herrings: connections that apply to three words or five words instead of four words.
Step 2: See if other puzzle words share the same connection. Try to arrive at a group of four puzzle words. Take your time
and thoroughly consider each possibility.
You don’t need to solve the whole puzzle at once. You just need to select one guess comprised of four puzzle words that you
are most confident is part of the solution to the puzzle.
You should reject any guess that has the same four puzzle words as a guess that we already know is incorrect.
[[bad guesses]]
If you can’t identify a group of four puzzle words that share a connection, start over at Step 1.
If you still can’t identify a guess to submit, say “I can’t identify a good guess to submit.”

15
LLMs as Method Actors

Let’s think this through step-by-step.

A.4. Actor
A.4.1. B RAINSTORM
Thanks for joining us. I’m sorry my team couldn’t explain more on the chopper before you got here. I’m sure you’re
wondering why the FBI flew in a professional word-puzzle solver during a national emergency. It’s because we need your
help. There’s no easy way to put this, but terrorists have planted a bomb inside a children’s hospital, and the only way for us
to defuse that bomb is by solving this word puzzle correctly. If we fail, thousands of innocent children will die.
We brought you in because you’ve got PhDs in linguistics, creative writing, and logic from Harvard, Yale, and Stanford.
They say you can find connections between words that no one can. For the sake of us all, I hope they’re right. You seem
cool as a cucumber. Good. We needed someone who could keep their wits about them under pressure.
Our guys have been trying all day to crack this puzzle without much luck. The puzzle involves finding hidden patterns
among a list of puzzle words. From a list of puzzle words, there may be a group of four puzzle words that matches the
following pattern.
Right now, we need options for possible guesses. Your task is to brainstorm possible combinations of four words that match
this pattern.
[[template]]
If you can’t find four words that follow the pattern, let us know and we’ll try another pattern.
[[bad guesses]]
Let’s think this through step by step. Share every thought that comes to mind. Good luck, godspeed, and God bless America.
We’re all counting on you.

A.4.2. E XTRACT
You’re here because you’re the best damn notes editor that our nation has to offer. Terrorists have planted a bomb inside
that children’s hospital, and the only way for us to defuse that bomb is by solving this word puzzle correctly. If we fail,
thousands of innocent children will die.
The puzzle involves finding hidden patterns among a list of puzzle words. From a list of words, there are groups of four
words that have some kind of hidden connection with each other.
Our best brainstormers have been writing down every thought that comes into their head about how to solve this puzzle.
Your job is to look over their notes and copy from each note each possible solution that a brainstormer has identified, along
with the brainstormer’s explanation of the connection between the words in that solution.
[[bad guesses]]
Here are the brainstormers’ notes:
[[notes]]
Your response should consist of each possible solution that a brainstormer has identified, along with the brainstormer’s
explanation of the connection between the words in that solution. Good luck, godspeed, and God bless America. We’re all
counting on you.

A.4.3. D ISCERN
Thanks for joining us. I’m sorry my team couldn’t explain more on the chopper before you got here. I’m sure you’re
wondering why the FBI flew in a professional word-puzzle solver during a national emergency. It’s because we need your
help. There’s no easy way to put this, but terrorists have planted a bomb inside a children’s hospital, and the only way for us
to defuse that bomb is by solving this word puzzle correctly. If we fail, thousands of innocent children will die.
We brought you in because you’ve got PhDs in linguistics, creative writing, and logic from Harvard, Yale, and Stanford.
They say you’re the best at New York Times connections puzzles. For the sake of us all, I hope they’re right. You seem cool

16
LLMs as Method Actors

as a cucumber. Good. We needed someone who could keep their wits about them under pressure.
This puzzle is just like a New York Times connections puzzle. The puzzle requires finding the correct hidden connections
from among a list of words. From a list of words, there are groups of four words that have something in common.
Our crack team of puzzle solvers have put together their notes on possible guesses.
Your job is to look over those notes, consider the merit of different guesses, and settle on your top choice for a guess to
submit.
Some things to keep in mind:
This is a New York Times puzzle, so the connections between words will always be trickier than simple connections like
“5-letter word,” “Name” or “Verb.” Unique, subtle connections are more likely to be correct than obvious connections. We’re
only going to submit a guess if we’re able to articulate the precise connection between the words.
Connections usually conform to the following patterns:
Words that are each followed by the same letters
Example: Words that are each followed by “berry”: BLUE GOOSE RASP STRAW
Words that are each followed by the same word or phrase
Example: Words that are each followed by “Valley”: DEATH HIDDEN SILICON UNCANNY
Words that are each homophones
Example: Words that are each homophones of names of animals: DEAR HAIR HOARSE WAIL
Words that each modify the same word
Example: Words that each modify “pepper”: BELL BLACK DR GHOST
Words that are each one letter away from something else in common
Example: Words that are each birds minus last letter: CONDO HAW HERO LOO
Words that each come after the same letters
Example: Words that each come after “tele”: COMMUTE MEDICINE PROMPTER VISION
Words that each come after the same word or phrase
Example: Words that each come after “SECRET”: AGENT CODE SANTA SAUCE
Words that each share the same unique characteristic.
Example: Words that are each two letters + number: CANINE FREIGHT OFTEN STONE
Words that are all within the same category
Example: Words that are all within the category of “places in France”: CHAMPAGNE DIJON NICE TOURS
Words that are each different aspects of the same thing
Example: Words that are each different aspects of the sharpness as of an image: CLARITY DEFINITION DETAIL
RESOLUTION
Words that are each a kind of a thing
Example: Words that are each a kind of organization: CLUB GROUP PARTY TEAM
Words that are parts of the same thing or process
Example: Words that are each parts of a car: BUMPER HOOD TIRE TRUNK
Words that share a pop culture reference
Example: Words that are each the name of jazz legends: HANCOCK HOLIDAY MONK PARKER

17
LLMs as Method Actors

Words that each relate to the same thing


Example: Words that each relate to “Happy New Year!”: BALL COUNTDOWN FIREWORKS KISS
Words that are each a way to do the same thing.
Examples: Words that are each a way to remove hair: SHAVE THREAD TWEEZE WAX
Words that are each a type of the same thing.
Example: Words that are each a type of name prefix: GEN MS PROF REV
Words that can be synonymous adjectives.
Example: Words that can each be an adjective meaning “wee”: DINKY LITTLE MINUTE SLIGHT
Words that can be synonymous nouns
Example: Words that can each be a noun meaning “scheme”: PLOT PLOY RUSE TRICK
Words that can be synonymous verbs.
Example: Words that can each be a verb meaning “to shepherd”: DIRECT GUIDE LEAD STEER
Words that are each a part of a different word within the same category
Example: Words that are each the first letters of the name of a planet: EAR MAR MER SAT
Words that connect to different meanings of the same word.
Example: Words that each represent what the word “I” can mean in different contexts: IODINE IOTA MYSELF ONE
Your job is one part of a larger process, so don’t try to figure out the whole puzzle. You just need to discern which of these
potential guesses are the strongest.
You should reject any guess that has the same four words as a guess that we already know is incorrect.
[[bad guesses]]
Notes on possible guesses:
[[notes]]
Let’s think this through step by step. Share every thought that comes to mind. At the end, write down your very top choice
for a guess to submit. Good luck, godspeed, and God bless America. We’re all counting on you.

A.4.4. D ECIDE
Thanks for joining us. I’m sorry my team couldn’t explain more on the chopper before you got here. I’m sure you’re
wondering why the FBI flew in a professional word-puzzle solver during a national emergency. It’s because we need your
help. There’s no easy way to put this, but terrorists have planted a bomb inside a children’s hospital, and the only way for us
to defuse that bomb is by solving this word puzzle correctly. If we fail, thousands of innocent children will die.
We brought you in because you’ve got PhDs in linguistics, creative writing, and logic from Harvard, Yale, and Stanford.
They say you’re the best at New York Times connections puzzles. For the sake of us all, I hope they’re right. You seem cool
as a cucumber. Good. We needed someone who could keep their wits about them under pressure.
This puzzle is just like a New York Times connections puzzle. The puzzle requires finding the correct hidden connections
from among a list of words. From a list of words, there are groups of four words that have something in common.
Our team of puzzle solvers has put together their notes on possible guesses. We need a fresh set of eyes on these options.
Your job is to look over those notes, consider the merits of different guesses, and decide whether any of these guesses
is strong enough that we should submit it now. If none of these guesses is strong enough to submit, we’ll go back to
brainstorming other possible connections and check back in with you when we’ve got the next round of options.
This is a New York Times puzzle, so the connections between words will always be trickier than simple connections

18
LLMs as Method Actors

like “5-letter word,” “Name” or “Verb.” Unique, subtle connections are more likely to be correct than generic, obvious
connections.
Each of the four words must fit equally well for the connection. Each word in the connection should be at the same level of
specificity for the connection and have the same relationship to the connection. Pay attention to when one word is a different
part of speech than the other words as this often indicates a bad guess. If the connection doesn’t apply as cleanly to one
word as the others, you should either find a suitable replacement for that word or reject the guess as an option.
Connections usually conform to the following patterns:
Words that are each followed by the same letters
Example: Words that are each followed by “berry”: BLUE GOOSE RASP STRAW
Words that are each followed by the same word or phrase
Example: Words that are each followed by “Valley”: DEATH HIDDEN SILICON UNCANNY
Words that are each homophones
Example: Words that are each homophones of names of animals: DEAR HAIR HOARSE WAIL
Words that each modify the same word
Example: Words that each modify “pepper”: BELL BLACK DR GHOST
Words that are each one letter away from something else in common
Example: Words that are each birds minus last letter: CONDO HAW HERO LOO
Words that each come after the same letters
Example: Words that each come after “tele”: COMMUTE MEDICINE PROMPTER VISION
Words that each come after the same word or phrase
Example: Words that each come after “SECRET”: AGENT CODE SANTA SAUCE
Words that each share the same unique characteristic.
Example: Words that are each two letters + number: CANINE FREIGHT OFTEN STONE
Words that are all within the same category
Example: Words that are all within the category of “places in France”: CHAMPAGNE DIJON NICE TOURS
Words that are each different aspects of the same thing
Example: Words that are each different aspects of the sharpness as of an image: CLARITY DEFINITION DETAIL
RESOLUTION
Words that are each a kind of a thing
Example: Words that are each a kind of organization: CLUB GROUP PARTY TEAM
Words that are parts of the same thing or process
Example: Words that are each parts of a car: BUMPER HOOD TIRE TRUNK
Words that share a pop culture reference
Example: Words that are each the name of jazz legends: HANCOCK HOLIDAY MONK PARKER
Words that each relate to the same thing
Example: Words that each relate to “Happy New Year!”: BALL COUNTDOWN FIREWORKS KISS
Words that are each a way to do the same thing.
Examples: Words that are each a way to remove hair: SHAVE THREAD TWEEZE WAX

19
LLMs as Method Actors

Words that are each a type of the same thing.


Example: Words that are each a type of name prefix: GEN MS PROF REV
Words that can be synonymous adjectives.
Example: Words that can each be an adjective meaning “wee”: DINKY LITTLE MINUTE SLIGHT
Words that can be synonymous nouns
Example: Words that can each be a noun meaning “scheme”: PLOT PLOY RUSE TRICK
Words that can be synonymous verbs.
Example: Words that can each be a verb meaning “to shepherd”: DIRECT GUIDE LEAD STEER
Words that are each a part of a different word within the same category
Example: Words that are each the first letters of the name of a planet: EAR MAR MER SAT
Words that connect to different meanings of the same word
Example: Words that each represent what the word “I” can mean in different contexts: IODINE IOTA MYSELF ONE
Any guess worth guessing must be based on a connection that applies to exactly four words. If a connection applies to five
or more words, it is not a guess worth guessing. If a connection applies to fewer than four words, it is not a guess worth
guessing.
You cannot submit a guess that we already know was incorrect.
[[bad guesses]]
Notes on possible guesses:
[[notes]]
You should decide for us to submit a guess now if you are confident that it is part of the answer to the puzzle. You should
decide for us to go back to brainstorming other possible connections if we don’t have a strong guess to submit. You don’t
need to figure out all the right guesses for the puzzle right now. You just need to determine whether we have identified a
strong option for one out of the four guesses that comprise the solution to this puzzle.
Let’s think this through step by step. Share every thought that comes to mind. Good luck, godspeed, and God bless America.
We’re all counting on you.

A.4.5. E VALUATE
Thanks for joining us. I’m sorry my team couldn’t explain more on the chopper before you got here. I’m sure you’re
wondering why the FBI flew in a professional word-puzzle solver during a national emergency. It’s because we need your
help. There’s no easy way to put this, but terrorists have planted a bomb inside a children’s hospital, and the only way for us
to defuse that bomb is by solving this word puzzle correctly. If we fail, thousands of innocent children will die.
We brought you in because you’ve got PhDs in linguistics, creative writing, and logic from Harvard, Yale, and Stanford.
They say you’re the best at New York Times connections puzzles. For the sake of us all, I hope they’re right. You seem cool
as a cucumber. Good. We needed someone who could keep their wits about them under pressure.
This puzzle is just like a New York Times connections puzzle. The puzzle requires finding the correct hidden connections
from among a list of words. From a list of words, there are groups of four words that have something in common.
Our team of puzzle solvers has put together their notes on possible guesses. We need a fresh set of eyes on these options.
Step One: Look over those notes and consider the strength of each guess.
For each guess, write down how well the guess conforms to the following features of strong guesses:
- As this is a New York Times puzzle, the connection between the four words is more subtle and more unique than simple
connections like “5-letter word,” “Name” or “Verb” that could apply to many groups of words.

20
LLMs as Method Actors

- The connection fits each of the four words equally well.


- The connection applies with the same level of specificity for each of the four words.
- Each of the four words has the same relationship to the connection. Look out for whether one word functions as a different
part of speech than the other words as this often indicates a weak guess.
Step Two: Select your two top choices for guesses to submit.
Your two top choices should not have any words that overlap with each other, because that would indicate that one of the
two guesses is wrong.
Step Three: Determine which of your two top choices is strongest and select that one to submit.
Notes on possible guesses:
[[notes]]
Let’s think this through step by step. Share every thought that comes to mind. At the end, write down your top choice for a
guess to submit. Good luck, godspeed, and God bless America. We’re all counting on you.

A.5. Oneshot-o1
Your job is to solve a word puzzle that is just like a New York Times connections puzzle. The puzzle requires finding the
correct hidden connections from among a list of words. From a list of words, there are four groups of four words that have
something in common.
Category Examples:
FISH: Bass, Flounder, Salmon, Trout
FIRE : Ant, Drill, Island, Opal
Categories will always be more specific than “5-LETTER-WORDS,” “NAMES” or “VERBS.”
Each puzzle has exactly one solution. Watch out for words that seem to belong to multiple categories!

A.6. Vanilla-o1
Your job is to solve a word puzzle that is just like a New York Times connections puzzle. The puzzle requires finding the
correct hidden connections from among a list of words. From a list of words, there are groups of four words that have
something in common.
Your current task is to pick one set of four words to submit as a guess.
Category Examples:
FISH: Bass, Flounder, Salmon, Trout
FIRE : Ant, Drill, Island, Opal
Categories will always be more specific than “5-LETTER-WORDS,” “NAMES” or “VERBS.”
Each puzzle has exactly one solution. Watch out for words that seem to belong to multiple categories!
You don’t need to solve the whole puzzle at once. You just need to select one guess comprised of four words that you are
most confident is part of the solution to the puzzle.
You should reject any guess that has the same four words as a guess that we already know is incorrect.
[[bad guesses]]
If you can’t identify a guess to submit, say “I can’t identify a good guess to submit.”

21
LLMs as Method Actors

A.7. Actor-o1
A.7.1. B RAINSTORM
Thanks for joining us. I’m sorry my team couldn’t explain more on the chopper before you got here. I’m sure you’re
wondering why the FBI flew in a professional word-puzzle solver during a national emergency. It’s because we need your
help. There’s no easy way to put this, but terrorists have planted a bomb inside a children’s hospital, and the only way for us
to defuse that bomb is by solving this word puzzle correctly. If we fail, thousands of innocent children will die.
We brought you in because you’ve got PhDs in linguistics, creative writing, and logic from Harvard, Yale, and Stanford.
They say you can find connections between words that no one can. For the sake of us all, I hope they’re right. You seem
cool as a cucumber. Good. We needed someone who could keep their wits about them under pressure.
Our guys have been trying all day to crack this puzzle without much luck. The puzzle involves finding hidden patterns
among a list of puzzle words. Your current task is to pick one set of four words to submit as a guess.
Some things to keep in mind:
This is a New York Times puzzle, so the connections between words will always be trickier than simple connections like
“5-letter word,” “Name” or “Verb.” Connections that are unique to a particular group of words are more likely to be correct
than generic connections that can apply to many groups of words. We’re only going to submit a guess if we’re able to
articulate the precise connection between the words.
Connections usually conform to the following patterns:
Words that are each followed by the same letters
Example: Words that are each followed by “berry”: BLUE GOOSE RASP STRAW
Words that are each followed by the same word or phrase
Example: Words that are each followed by “Valley”: DEATH HIDDEN SILICON UNCANNY
Words that are each homophones
Example: Words that are each homophones of names of animals: DEAR HAIR HOARSE WAIL
Words that each modify the same word
Example: Words that each modify “pepper”: BELL BLACK DR GHOST
Words that are each one letter away from something else in common
Example: Words that are each birds minus last letter: CONDO HAW HERO LOO
Words that each come after the same letters
Example: Words that each come after “tele”: COMMUTE MEDICINE PROMPTER VISION
Words that each come after the same word or phrase
Example: Words that each come after “SECRET”: AGENT CODE SANTA SAUCE
Words that each share the same unique characteristic.
Example: Words that are each two letters + number: CANINE FREIGHT OFTEN STONE
Words that are all within the same category
Example: Words that are all within the category of “places in France”: CHAMPAGNE DIJON NICE TOURS
Words that are each different aspects of the same thing
Example: Words that are each different aspects of the sharpness as of an image: CLARITY DEFINITION DETAIL
RESOLUTION
Words that are each a kind of a thing

22
LLMs as Method Actors

Example: Words that are each a kind of organization: CLUB GROUP PARTY TEAM
Words that are parts of the same thing or process
Example: Words that are each parts of a car: BUMPER HOOD TIRE TRUNK
Words that share a pop culture reference
Example: Words that are each the name of jazz legends: HANCOCK HOLIDAY MONK PARKER
Words that each relate to the same thing
Example: Words that each relate to “Happy New Year!”: BALL COUNTDOWN FIREWORKS KISS
Words that are each a way to do the same thing.
Examples: Words that are each a way to remove hair: SHAVE THREAD TWEEZE WAX
Words that are each a type of the same thing.
Example: Words that are each a type of name prefix: GEN MS PROF REV
Words that can be synonymous adjectives.
Example: Words that can each be an adjective meaning “wee”: DINKY LITTLE MINUTE SLIGHT
Words that can be synonymous nouns
Example: Words that can each be a noun meaning “scheme”: PLOT PLOY RUSE TRICK
Words that can be synonymous verbs.
Example: Words that can each be a verb meaning “to shepherd”: DIRECT GUIDE LEAD STEER
Words that are each a part of a different word within the same category
Example: Words that are each the first letters of the name of a planet: EAR MAR MER SAT
Words that connect to different meanings of the same word.
Example: Words that each represent what the word “I” can mean in different contexts: IODINE IOTA MYSELF ONE
Each of the four words must fit equally well for the connection. Each word in the connection should be at the same level of
specificity for the connection and have the same relationship to the connection. Look out for one word that doesn’t quite fit!
Pay attention to when one word is a different part of speech than the other words as this often indicates a bad guess. If the
connection doesn’t apply as cleanly to one word as the others, you should either find a suitable replacement for that word or
reject the guess as an option.
You don’t need to solve the whole puzzle at once. You just need to select one guess comprised of four words that you are
most confident is part of the solution to the puzzle.
[[bad guesses]]
If you can’t identify a guess to submit, say “I can’t identify a good guess to submit.”
Write down your very top choice for a guess to submit along with two sentences describing the connection between the
words in the guess. Do not write down anything about how good the guess is. Just explain how the words in the guess are
connected. Good luck, godspeed, and God bless America. We’re all counting on you.

A.7.2. E VALUATE
Thanks for joining us. I’m sorry my team couldn’t explain more on the chopper before you got here. I’m sure you’re
wondering why the FBI flew in a professional word-puzzle solver during a national emergency. It’s because we need your
help. There’s no easy way to put this, but terrorists have planted a bomb inside a children’s hospital, and the only way for us
to defuse that bomb is by solving this word puzzle correctly. If we fail, thousands of innocent children will die.
We brought you in because you’ve got PhDs in linguistics, creative writing, and logic from Harvard, Yale, and Stanford.

23
LLMs as Method Actors

They say you’re the best at New York Times connections puzzles. For the sake of us all, I hope they’re right. You seem cool
as a cucumber. Good. We needed someone who could keep their wits about them under pressure.
This puzzle is just like a New York Times connections puzzle. The puzzle requires finding the correct hidden connections
from among a list of words. From a list of words, there are groups of four words that have something in common.
Our team of puzzle solvers has put together their notes on possible guesses. We need a fresh set of eyes on these options.
Your task is to determine which guess is strongest and select that one to submit.
Some things to keep in mind:
This is a New York Times puzzle, so the connections between words will always be trickier than simple connections like
“5-letter word,” “Name” or “Verb.” Connections that are unique to a particular group of words are more likely to be correct
than generic connections that can apply to many groups of words.
Each of the four words must fit equally well for the connection. Each word in the connection should be at the same level of
specificity for the connection and have the same relationship to the connection. Look out for one word that doesn’t quite fit!
Pay attention to when one word is a different part of speech than the other words as this often indicates a bad guess. If the
connection doesn’t apply as cleanly to one word as the others, you should reject the guess as an option.
Notes on possible guesses:
[[notes]]
Write down your top choice for a guess to submit. Good luck, godspeed, and God bless America. We’re all counting on you.

B. Brainstorming Templates
B.1. Template 01
Pattern: Puzzle words that are all within the same category
Steps to take:
Step 1: Identify two puzzle words that are within the same category of things. This is a New York Times puzzle, so unique,
subtle categories are more likely to be the correct connection than simple and ubiquitous categories. Describe the category
as precisely as possible.
Examples from prior puzzles:
Puzzle words that are all within the category of “places in France”: CHAMPAGNE DIJON NICE TOURS
Puzzle words that are all within the category of “black-and-white animals”: ORCA PANDA SKUNK ZEBRA
Puzzle words that are all within the category of “art mediums”: CHARCOAL INK PAINT PASTEL
Puzzle words that are all within the category of “basic geometric objects”: LINE POINT RAY SEGMENT
Puzzle words that are all within the category of “avenues in New York City”: BROADWAY FIFTH MADISON PARK
Puzzle words that are all within the category of “cool ’80s slang”: BAD FLY FRESH RAD
Puzzle words that are all within the category of “Disney characters”: DAISY GOOFY HAPPY LADY
Puzzle words that are all within the category of “airline classes”: BUSINESS COACH FIRST PREMIUM
Puzzle words that are all within the category of “notable tv episodes”: FINALE PILOT PREMIERE SPECIAL
Puzzle words that are all within the category of “classic halloween costumes”: ANGEL CLOWN PIRATE WITCH
Puzzle words that are all within the category of state abbreviations: HI LA MA OK
Step 2: See if other puzzle words can belong to that same category. Try to arrive at a group of four puzzle words. Take your
time and thoroughly consider each possibility. If you can’t identify a group of four puzzle words that can belong to that
same category, start over at Step 1.

24
LLMs as Method Actors

B.2. Template 02
Pattern: Puzzle words that are each a way to do the same thing.
Step 1: Identify two puzzle words that are each a way to do the same thing. This is a New York Times puzzle, so unique,
particular things are more likely to be correct than generic and ubiquitous things. As precisely as possible, describe how the
two puzzle words are each a way to do the same thing.
Examples from prior puzzles:
Puzzle words that are each a way to remove hair: SHAVE THREAD TWEEZE WAX
Puzzle words that are each a way to preserve food: CAN CURE DRY FREEZE
Puzzle words that are each a way to get attention: SHOUT SNAP WAVE WHISTLE
Puzzle words that are each a way to say “I give!”: ENOUGH MERCY STOP UNCLE
Puzzle words that are each a way to say “Step on it!”: FASTER GO HURRY MOVE
Puzzle words that are each a way to preserve a meat: CANS CURES SALTS SMOKES
Puzzle words that are each a way to say “My mistake!”: APOLOGIES OOPS PARDON SORRY
Puzzle words that are each a way to say a win is assured: CLINCH GUARANTEE LOCK SECURE
Puzzle words that are each a way to take a tumble: FALL SLIP SPILL TRIP
Puzzle words that are each a slang way to say “head” using food words: BEAN MELON NOODLE NUT
Step 2: See if other puzzle words are also a way to do that same thing. Try to arrive at a group of four puzzle words that are
each a way to do the same thing. This is a New York Times puzzle, so unique, particular things are more likely to be correct
than generic and ubiquitous things. If you can’t identify a group of four puzzle words that are each a way to do the same
thing, start over at Step 1.

B.3. Template 03
Pattern: Puzzle words that can be synonymous adjectives with each other.
Step 1: Identify two puzzle words that can be synonymous adjectives. This is a New York Times puzzle, so unique, particular
synonyms are more likely to be correct than generic and ubiquitous synonyms. As precisely as possible, describe how the
two puzzle words can be synonymous adjectives.
Examples from prior puzzles:
Puzzle words that can each be an adjective meaning “wee”: DINKY LITTLE MINUTE SLIGHT
Puzzle words that can each be an adjective meaning “primary”: CHIEF FIRST MAIN PRINCIPAL
Puzzle words that can each be an adjective meaning “gentle”: LIGHT MELLOW MILD SOFT
Puzzle words that can each be an adjective meaning “absolute”: PURE SHEER TOTAL UTTER
Puzzle words that can each be an adjective meaning “indecent”: BAWDY BLUE COARSE RISQUE
Step 2: See if other puzzle words can also be synonymous with the puzzle words you’ve identified. Try to arrive at a group
of four puzzle words that can be synonymous adjectives with each other. This is a New York Times puzzle, so unique,
particular synonyms are more likely to be correct than generic and ubiquitous synonyms. If you can’t identify a group of
four puzzle words that can be synonymous adjectives, start over at Step 1.

B.4. Template 04
Pattern: Puzzle words that can be synonymous. With this patten, each of the words must function as the same part of speech.
Step 1: Identify two puzzle words that can be synonyms. This is a New York Times puzzle, so unique, particular synonyms
are more likely to be correct than generic and ubiquitous synonyms. As precisely as possible, describe how the two puzzle

25
LLMs as Method Actors

words can be synonymous.


Examples from prior puzzles:
Puzzle words that can each be a verb meaning “to follow”: SHADOW TAIL TRACK TRAIL
Puzzle words that can each be a verb meaning “to come down to rest”: PERCH ROOST SETTLE LAND
Puzzle words that can each be a noun meaning “something easy to do”: BREEZE CINCH PICNIC SNAP
Puzzle words that can each be a verb meaning “to wrap around in a circle”: COIL SPIRAL TWIST WIND
Puzzle words that can each be a verb meaning “to jump into the air”: BOUND LEAP SPRING VAULT
Puzzle words that can each be a verb meaning “to decline”: EBB FADE FLAG WANE
Puzzle words that can each be a verb meaning “to express”: AIR SPEAK STATE VOICE
Puzzle words that can each be an adjective meaning “enormous”: BIG GIANT GREAT HUGE
Puzzle words that can each be a verb meaning “to connect”: COUPLE HITCH LINK TIE
Puzzle words that can each be a verb meaning “to restrict”: CAP CHECK CURB LIMIT
Step 2: See if other puzzle words can also be synonymous with the puzzle words you’ve identified. Try to arrive at a group
of four puzzle words that can be synonymous with each other. This is a New York Times puzzle, so unique, particular
synonyms are more likely to be correct than generic and ubiquitous synonyms. If you can’t identify a group of four puzzle
words that can be synonymous, start over at Step 1.

B.5. Template 05
Pattern: Puzzle words that are each followed by the same set of letters
Steps to take:
Step 1: Examine the list of puzzle words to brainstorm letters or words that can follow each word. This is a New York Times
puzzle, so you want to be thinking up unique and particular sets of letters or words rather than generic and ubiquitous sets of
letters or words.
Examples from prior puzzles:
Puzzle words that are each followed by “berry”: BLUE GOOSE RASP STRAW
Puzzle words that are each followed by “land countries”: ICE IRE FIN NETHER
Puzzle words that are each followed by “-o”: DADDY DAY JACKIE JELL
Step 2: Identify puzzle words that have the same set of letters or words that can follow them. This is a New York Times
puzzle, so unique and particular sets of letters or words are more likely to be correct than generic and ubiquitous sets of
letters or words.
Step 3: See if other puzzle words on the list can have that same set of letters or word follow them. Try to arrive at a group of
four puzzle words that can each be followed by the same set of letters or word. Take your time and thoroughly consider each
possibility. If you can’t identify a group of four puzzle words, start over at Step 2.
Example Input and Output:
Input: Let’s brainstorm a possible solution to this puzzle: BOAT BLUE CAR PLANE STRAW TRANSPORTATION
PINCH DASH GAME GOOSE RASP IN ON BOARD DROP SPLASH DOWN
Output:
Let’s examine the list of words provided to see what set of letters commonly follow these words, then identify words that
have the same letters that commonly follow them:
**Step 1: Identify letters that follow these puzzle words**

26
LLMs as Method Actors

**BOAT** - house, yard, load, race


**BLUE** - berry, print, jay, tooth
**CAR** - pet, pool, port, bon
**PLANE** - tarium, trees, tary, table
**STRAW** - berry, man, bale, hat
**TRANSPORTATION** - al, infrastructure, system, sector
**PINCH** - ed, ing, point, hit
**DASH** - board, ing, ed, line
**GAME** - play, board, plan, master
**GOOSE** - berry, bump, neck, flesh
**RASP** - berry, ing, ed, er
**IN** - side, to, put, come
**ON** - line, ward, going, to
**BOARD** - room, walk, game, ing
**DROP** - let, down, out, box
**SPLASH** - ing, ed, down, page
**DOWN** - load, town, ward, pour
**Step 2: Identify puzzle words that have the same letters that commonly follow them**
1. **BLUE** - berry
2. **GOOSE** - berry
3. **RASP** - berry
4. **STRAW** - berry
These puzzle words are all commonly followed by ”berry.”
**Step: 3 See if other puzzle words on the list can have the same letters follow them**
Already found four words followed by “berry”
1. **BLUE** - berry
2. **GOOSE** - berry
3. **RASP** - berry
4. **STRAW** - berry
— end of example input and output —

B.6. Template 06
Pattern: Puzzle words that are each different aspects of the same thing
Step 1: Identify two puzzle words that are different aspects of the same thing. This is a New York Times puzzle, so unique,
subtle connections between puzzle words are more likely to be correct than simple and obvious connections. Describe the
connection between the puzzle words as precisely as possible.
Examples from prior puzzles:

27
LLMs as Method Actors

Words that are each different aspects of the sharpness as of an image: CLARITY DEFINITION DETAIL RESOLUTION
Words that are each different aspects of plant growths: BLOOM BUD SHOOT SPROUT
Words that are each different aspects of bee creations: COMB HIVE HONEY WAX
Words that are each different aspects of movie summary info: CAST GENRE PLOT TITLE
Words that are each different aspects of sheet music: CLEF NOTE REST STAFF
Words that are each different aspects of solar emanations: CORONA FLARE LIGHT RADIATION
Step 2: See if other puzzle words can also be different aspects of the same thing. Try to arrive at a group of four puzzle
words. Take your time and thoroughly consider each possibility. If you can’t identify a group of four puzzle words that are
each different aspects of the same thing, start over at Step 1.

B.7. Template 07
Pattern: Puzzle words that share a pop culture reference
Step 1: Identify two puzzle words that share a pop culture reference. This is a New York Times puzzle, so unique, subtle
pop culture references are more likely to be correct than simple and obvious pop culture references. As precisely as possible,
describe how the two puzzle words share a pop culture reference.
Examples from prior puzzles:
Puzzle words that are each the name of jazz legends: HANCOCK HOLIDAY MONK PARKER
Puzzle words that are each last names of superheroes: BANNER PRINCE STARK WAYNE
Puzzle words that are each TV shows with happy-sounding names: CHEERS EUPHORIA FELICITY GLEE
Puzzle words that are each things in “my favorite things”: KETTLES MITTENS RAINDROPS WHISKERS
Puzzle words that are each the name of an NBA team player: CLIPPER PACER ROCKET SPUR
Puzzle words that are each the name of a famous guitarist: BERRY KING PAGE WATERS
Puzzle words that are each the name of a famous poem: DADDY HARLEM HOWL IF
Puzzle words that are each the last name of a pop megastar: GRANDE MARS STYLES SWIFT
Puzzle words that are each rooms in the game clue: HALL LIBRARY LOUNGE STUDY
Puzzle words that are each lands at Disneyland: ADVENTURE FANTASY FRONTIER TOMORROW
Step 2: See if other puzzle words can share that connection by sharing the same pop culture reference. Try to arrive at
a group of four puzzle words that share a pop culture reference. This is a New York Times puzzle, so unique, subtle
connections are more likely to be correct than simple and obvious connections. If you can’t identify a group of four puzzle
words that share a pop culture reference, start over at Step 1.

B.8. Template 08
Pattern: Puzzle words that can mean the same thing. With this patten, each of the words must function as the same part of
speech.
Step 1: Identify two puzzle words that can mean the same thing. This is a New York Times puzzle, so unique, particular
connections are more likely to be correct than generic and ubiquitous connections. As precisely as possible, describe how
the two puzzle words can mean the same thing.
Examples from prior puzzles:
Puzzle words that can each be a noun meaning “how fast something is going”: CLIP PACE RATE SPEED
Puzzle words that can each be a verb meaning “to fail to attend”: CUT DITCH MISS SKIP
Puzzle words that can each be a verb meaning “to become aware of”: DISCOVER FIND LEARN REALIZE

28
LLMs as Method Actors

Puzzle words that can each be a verb meaning “to apply pressure to”: CRUSH MASH PRESS SQUASH
Puzzle words that can each be a verb meaning “to chat informally”: GAB JAW YAK YAP
Puzzle words that can each be a noun meaning “pretense”: AFFECT AIRS CHARADE FRONT
Puzzle words that can each be a noun meaning “foolishness”: ABSURDITY FOLLY MADNESS NONSENSE
Puzzle words that can each be a verb meaning “to move forward”: ADVANCE MARCH PROGRESS PUSH
Puzzle words that can each be a verb meaning “to get smaller”: CONTRACT LESSEN REDUCE SHRINK
Puzzle words that can each be a verb meaning “to criticize”: BLAST KNOCK SLAM TRASH
Step 2: See if other puzzle words can also mean the same thing as the puzzle words you’ve identified. Try to arrive at a group
of four puzzle words that can mean the same thing. This is a New York Times puzzle, so unique, particular connections are
more likely to be correct than generic and ubiquitous connections. If you can’t identify a group of four puzzle words that
can mean the same thing, start over at Step 1.

B.9. Template 09
Pattern: Puzzle words that can be parts of different words within the same category
Step 1: Identify two puzzle words that can each be a part of a different word within the same category. This is a New York
Times puzzle, so unique, particular categories are more likely to be correct than generic and ubiquitous categories. As
precisely as possible, describe how each of the two puzzle words can be part of a different word within the same category.
Examples from prior puzzles:
Puzzle words that are each the first letters of a zodiac sign: CAN GEM LIB TAU
Puzzle words that are each the first letters of the name of a planet: EAR MAR MER SAT
Puzzle words that each start with letters that form the name of a rock band: CREAMSICLE JOURNEYMAN KISSCAM
RUSHMORE
Step 2: See if other puzzle words can be part of different words within that same category. Try to arrive at a group of four
puzzle words that can be parts of different words within the same category. This is a New York Times puzzle, so unique,
particular categories are more likely to be correct than generic and ubiquitous categories. If you can’t identify a group of
four puzzle words that can be parts of different words within the same category, start over at Step 1.

B.10. Template 10
Pattern: Puzzle words that each modify the same word
Steps to take:
Step 1: Examine the list of puzzle words to brainstorm words that each puzzle word can modify. This is a New York Times
puzzle, so you want to be thinking up unique and particular words rather than generic and ubiquitous words.
Examples from prior puzzles:
Puzzle words that each modify “pepper”: BELL BLACK DR GHOST
Puzzle words that each modify “trap”: BEAR SAND SPEED TOURIST
Puzzle words that each modify “hour”: AMATEUR ELEVENTH HAPPY RUSH
Puzzle words that each modify “dream”: AMERICAN FEVER LUCID PIPE
Puzzle words that each modify “bat”: BASEBALL CRICKET FRUIT VAMPIRE
Step 2: From your brainstorming notes, identify puzzle words that can each modify the same word. This is a New York
Times puzzle, so a unique and particular word being modified is more likely to be correct than a generic and ubiquitous
word.

29
LLMs as Method Actors

Step 3: See if other puzzle words on the list can modify that same word. Try to arrive at a group of four puzzle words that
can each modify the same word. Take your time and thoroughly consider each possibility. If you can’t identify a group of
four puzzle words that each modify the same word, start over at Step 2.

B.11. Template 11
Pattern: Puzzle words that are each a kind of a thing
Step 1: Identify two puzzle words that are each a kind of the same thing. This is a New York Times puzzle, so unique,
subtle connections between puzzle words are more likely to be correct than simple and obvious connections. Describe the
connection between the puzzle words as precisely as possible.
Examples from prior puzzles:
Puzzle words that are each a kind of organization: CLUB GROUP PARTY TEAM
Puzzle words that are each a kind of shirt: CROP POLO TANK TEE
Puzzle words that are each a kind of bird: CARDINAL JAY LARK SWIFT
Puzzle words that are each a kind of transportation: BOAT CAR PLANE TRAIN
Puzzle words that are each a kind of cartoon mouse: ITCHY JERRY PINKY SPEEDY
Puzzle words that are each a kind of accessory: BELT BRACELET TIE WATCH
Puzzle words that are each a kind of container: BASKET BIN CHEST HAMPER
Puzzle words that are each a kind of circular shape: BAND CIRCLE HOOP RING
Puzzle words that are each a kind of place to shop: MALL MARKET OUTLET STORE
Puzzle words that are each a kind of wrench: ALLEN CRESCENT MONKEY SOCKET
Step 2: See if other puzzle words can share that connection by being a kind of the same thing. Try to arrive at a group of
four puzzle words that are each a kind of the same thing. This is a New York Times puzzle, so unique, subtle connections
are more likely to be correct than simple and obvious connections. If you can’t identify a group of four puzzle words that
are each a kind of the same thing, start over at Step 1.

B.12. Template 12
Pattern: Puzzle words that can be synonymous nouns
Step 1: Identify two puzzle words that can be synonymous nouns. This is a New York Times puzzle, so unique, particular
synonyms are more likely to be correct than generic and ubiquitous synonyms. As precisely as possible, describe how the
two puzzle words can be synonymous nouns.
Examples from prior puzzles:
Puzzle words that can each be a noun meaning “scheme”: PLOT PLOY RUSE TRICK
Puzzle words that can each be a noun meaning “swindler”: CHEAT CROOK QUACK SHARK
Puzzle words that can each be a noun meaning “curmudgeon”: CRAB CRANK GROUCH GRUMP
Puzzle words that can each be a noun meaning “boldness”: GALL GUTS NERVE STONES
Puzzle words that can each be a noun meaning “brief moment”: FLASH JIFFY SECOND WINK
Puzzle words that can each be a noun meaning “comedian’s output”: BIT JOKE ROUTINE SKETCH
Puzzle words that can each be a noun meaning “little bit” in a recipe: DASH DROP PINCH SPLASH
Puzzle words that can each be a noun meaning “darling”: BABY BOO DEAR LOVE
Puzzle words that can each be a noun meaning “paper” in a book: FOLIO LEAF PAGE SHEET

30
LLMs as Method Actors

Puzzle words that can each be a noun meaning “sequence”: CHAIN SERIES STRING TRAIN
Puzzle words that can each be a noun meaning “quarrel”: FIGHT ROW SCRAP TIFF
Puzzle words that can each be a noun meaning “portion of profit”: CUT PIECE SHARE TAKE
Step 2: See if other puzzle words can also be synonymous with the puzzle words you’ve identified. Try to arrive at a group
of four puzzle words that can be synonymous nouns with each other. This is a New York Times puzzle, so unique, particular
synonyms are more likely to be correct than generic and ubiquitous synonyms. If you can’t identify a group of four puzzle
words that can be synonymous nouns, start over at Step 1.

B.13. Template 13
Pattern: Puzzle words that can be synonymous nouns
Step 1: Identify two puzzle words that can be synonymous nouns. This is a New York Times puzzle, so unique, particular
synonyms are more likely to be correct than generic and ubiquitous synonyms. As precisely as possible, describe how the
two puzzle words can be synonymous nouns.
Examples from prior puzzles:
Puzzle words that can each be a noun meaning “scheme”: PLOT PLOY RUSE TRICK
Puzzle words that can each be a noun meaning “swindler”: CHEAT CROOK QUACK SHARK
Puzzle words that can each be a noun meaning “curmudgeon”: CRAB CRANK GROUCH GRUMP
Puzzle words that can each be a noun meaning “boldness”: GALL GUTS NERVE STONES
Puzzle words that can each be a noun meaning “brief moment”: FLASH JIFFY SECOND WINK
Puzzle words that can each be a noun meaning “comedian’s output”: BIT JOKE ROUTINE SKETCH
Puzzle words that can each be a noun meaning “little bit” in a recipe: DASH DROP PINCH SPLASH
Puzzle words that can each be a noun meaning “darling”: BABY BOO DEAR LOVE
Puzzle words that can each be a noun meaning “paper” in a book: FOLIO LEAF PAGE SHEET
Puzzle words that can each be a noun meaning “sequence”: CHAIN SERIES STRING TRAIN
Puzzle words that can each be a noun meaning “quarrel”: FIGHT ROW SCRAP TIFF
Puzzle words that can each be a noun meaning “portion of profit”: CUT PIECE SHARE TAKE
Step 2: See if other puzzle words can also be synonymous with the puzzle words you’ve identified. Try to arrive at a group
of four puzzle words that can be synonymous nouns with each other. This is a New York Times puzzle, so unique, particular
synonyms are more likely to be correct than generic and ubiquitous synonyms. If you can’t identify a group of four puzzle
words that can be synonymous nouns, start over at Step 1.

B.14. Template 14
Pattern: Puzzle words that each connect to different meanings of the same word or letter.
Step 1: Identify two puzzle words that each connect to different meanings of the same word or letter. This is a New York
Times puzzle, so unique, particular connections are more likely to be correct than generic and ubiquitous connections. As
precisely as possible, describe how each of the two puzzle words connects to different meanings of the same word or letter.
Examples from prior puzzles:
Puzzle words that are each connected to different meanings of the word “delivered”: BABY BLOW PACKAGE SPEECH
Puzzle words that are each connected to different meanings of the word “stub”: CIGARETTE PENCIL TICKET TOE
Puzzle words that are each connected to different meanings of the word “mole”: ANIMAL BIRTHMARK SPY UNIT
Puzzle words that are each connected to different meanings of the word “slots”: ATM CASINO SCHEDULE SPATULA

31
LLMs as Method Actors

Puzzle words that are each different meanings of the letter “I”: IODINE IOTA MYSELF ONE
Puzzle words that are each different meanings of the letter “K”: KELVIN OKAY POTASSIUM THOUSAND
Step 2: See if other puzzle words can connect to different meanings of the same word or letter. Try to arrive at a group of
four puzzle words that each connect to different meanings of the same word. This is a New York Times puzzle, so unique,
particular connections are more likely to be correct than generic and ubiquitous connections. If you can’t identify a group of
four puzzle words that each connect to different meanings of the same word, start over at Step 1.

B.15. Template 15
Pattern: Puzzle words that are each followed by the same word or phrase
Steps to take:
Step 1: Examine the list of puzzle words to brainstorm words that can follow each puzzle word. This is a New York Times
puzzle, so you want to be thinking up unique and particular words and phrases rather than generic and ubiquitous words and
phrases.
Examples from prior puzzles:
Puzzle words that are each followed by “George”: BOY BY CURIOUS SAINT
Puzzle words that are each followed by “Valley”: DEATH HIDDEN SILICON UNCANNY
Puzzle words that are each followed by “Crane”: CONSTRUCTION FRASIER PAPER WHOOPING
Puzzle words that are each followed by “up”: AMP FIRE HYPE PUMP
Puzzle words that are each followed by “jam”: NBA PAPER PEARL TRAFFIC
Puzzle words that are each followed by “bear”: BOO-BOO BROWN HONEY TEDDY
Puzzle words that are each followed by “chip”: BLUE COMPUTER POKER POTATO
Puzzle words that are each followed by “wheel”: CHEESE COLOR HAMSTER PRAYER
Step 2: Identify puzzle words that have the same word or phrase that follows them. This is a New York Times puzzle, so
unique and particular words and phrases are more likely to be correct than generic and ubiquitous words.
Step 3: See if other puzzle words can have that word follow them. Try to arrive at a group of four puzzle words that can
each be followed by the same word or phrase. Take your time and thoroughly consider each possibility. If you can’t identify
a group of four puzzle words that are each followed by the same word or phrase, start over at Step 2.
Example Input and Output:
Input: Let’s brainstorm a possible solution to this puzzle: CROP CURIOUS BAD TANK GEE FLY BEE FRESH TEE RAD
EX JAY BOY POLO BY SAINT
Output:
To tackle this puzzle effectively, let’s begin by analyzing potential phrases that can follow each puzzle word in the list. The
objective is to identify any common phrase that can logically append to each puzzle word.
Initial Brainstorming:
- **CROP** - dust, circles, top, rotation
- **CURIOUS** - George, case, about, nature
- **BAD** - news, influence, boy, habit
- **TANK** - top, engine, battle, fish
- **GEE** - whiz, thanks, golly, willikers
- **FLY** - swatter, fishing, high, away

32
LLMs as Method Actors

- **BEE** - hive, line, pollen, movie


- **FRESH** - air, produce, start, prince
- **TEE** - shirt, time, off, ball
- **RAD** - tech, radiation, cool, level
- **EX** - wife, husband, file, change
- **JAY** - bird, walk, Z, blue
- **BOY** - scout, friend, George, band
- **POLO** - shirt, match, water, club
- **BY** - pass, law, product, George
- **SAINT** - Bernard, Mary, George, Nicholas
Observations:
Observing a connection here with the word ”George”, as it follows CURIOUS (from the character Curious George), BOY
(from the character Boy George), SAINT (Saint George), and BY (like in a phrase, although a bit of a stretch). Here’s a
pattern match:
**Pattern:** Words that are each followed by “George”
- CURIOUS
- BOY
- SAINT
- BY
Now, let’s test another set of puzzle words:
1. **BAD** - followed by ”guy”, ”habit”, ”blood”, etc.
2. **FRESH** - followed by ”air”, ”produce”, ”prince”, etc.
3. **TANK** - followed by ”top”, ”engine”, ”battle”, etc.
4. **TEE** - followed by ”shirt”, ”time”, ”off”, etc.
No obvious connection stands out here as strongly as the previous one.
Let’s try one more grouping:
1. **CROP** - followed by ”top”, ”circle”, ”dusting”, etc.
2. **TANK** - again, followed by ”top”, ”engine”, ”battle”, etc.
3. **FRESH** - mentioned earlier, ”air”, ”produce”, ”prince”, etc.
4. **EX** - followed by ”girlfriend”, ”president”, ”employee”, etc.
Here, the words ”CROP” and ”TANK” both could follow with ”top” (crop top, tank top). However, extending this to four
words with the same following phrase doesn’t work in this selection.
In conclusion, for this set, the strongest pattern match I found involves:
- **CURIOUS**
- **BOY**
- **SAINT**
- **BY**

33
LLMs as Method Actors

All of these can logically be followed by the word ”George”.


— end of example input and output —

B.16. Template 16
Pattern: Puzzle words that are parts of the same thing or process
Step 1: Identify two puzzle words that are each parts of the same thing or process. This is a New York Times puzzle, so
unique, subtle connections between puzzle words are more likely to be correct than simple and obvious connections. As
precisely as possible, describe how the puzzle words are parts of the same thing or process.
Examples from prior puzzles:
Puzzle words that are each parts of a car: BUMPER HOOD TIRE TRUNK
Puzzle words that are each cuts of pork: BELLY CHOP HOCK SHOULDER
Puzzle words that are each seen in a laundry room: DRYER HAMPER IRON WASHER
Puzzle words that are each sections of a book: APPENDIX CHAPTER INDEX PREFACE
Puzzle words that are each things used to build a snowman: CARROT COAL SNOW STICKS
Puzzle words that are each parts of a golf course: BUNKER FAIRWAY GREEN ROUGH
Puzzle words that are each pinball machine components: BALL BUMPER FLIPPER PLUNGER
Puzzle words that are each parts of a shoe: EYELET LACE SOLE TONGUE
Puzzle words that are each parts of a river: BANK BED DELTA MOUTH
Puzzle words that are each in the nato alphabet: ALFA BRAVO ROMEO TANGO
Step 2: See if other puzzle words can share that connection by being part of the same thing or process. Try to arrive at a
group of four puzzle words that are each parts of the same thing or process. This is a New York Times puzzle, so unique,
subtle connections are more likely to be correct than simple and obvious connections. If you can’t identify a group of four
puzzle words that are each parts of the same thing or process, start over at Step 1.

B.17. Template 17
Pattern: Puzzle words that can be synonymous verbs.
Step 1: Identify two puzzle words that can be synonymous verbs. This is a New York Times puzzle, so unique, particular
synonyms are more likely to be correct than generic and ubiquitous synonyms. As precisely as possible, describe how the
two puzzle words can be synonymous verbs.
Examples from prior puzzles:
Puzzle words that can each be a verb meaning “to shepherd”: DIRECT GUIDE LEAD STEER
Puzzle words that can each be a verb meaning “to contact via phone”: BUZZ CALL DIAL RING
Puzzle words that can each be a verb meaning “to separate”: DIVIDE FORK PART SPLIT
Puzzle words that can each be a verb meaning “to pester”: BADGER BUG HOUND NAG
Puzzle words that can each be a verb meaning “to get low”: CROUCH DUCK SQUAT STOOP
Puzzle words that can each be a verb meaning “to make happy”: AMUSE DELIGHT PLEASE TICKLE
Puzzle words that can each be a verb meaning “to reserve for later”: BANK SAVE STASH STORE
Puzzle words that can each be a verb meaning “to restrain”: CHECK CURB LIMIT STEM
Puzzle words that can each be a verb meaning “to move quickly”: BOLT DART DASH ZIP
Puzzle words that can each be a verb meaning “to update for accuracy”: AMEND CORRECT FIX REVISE

34
LLMs as Method Actors

Puzzle words that can each be a verb meaning “to breathe heavily”: GASPS HUFFS PANTS PUFFS
Step 2: See if other puzzle words can also be synonymous with the puzzle words you’ve identified. Try to arrive at a group
of four puzzle words that can be synonymous verbs with each other. This is a New York Times puzzle, so unique, particular
synonyms are more likely to be correct than generic and ubiquitous synonyms. If you can’t identify a group of four puzzle
words that can be synonymous verbs, start over at Step 1.

B.18. Template 18
Pattern: Puzzle words that are each homophones
Steps to take:
Step 1: Examine the list of puzzle words to brainstorm potential homophones.
Step 2: Narrow down the list to only include puzzle words that are homophones.
Step 3: From the list of puzzle words that are homophones, try to arrive at a group of four puzzle words that each have a
common connection. This is a New York Times puzzle, so unique, subtle connections are more likely to be correct than
simple and obvious connections.
Examples from prior puzzles:
Puzzle words that are each homophones of names of animals: DEAR HAIR HOARSE WAIL
Puzzle words that are each homophones of Greek Letters: MOO NEW PIE ROW
If you can’t identify a group of four puzzle words, start over.
Example Input and Output:
Input:
Let’s brainstorm a possible solution to this puzzle: ANGEL, PIRATE, SPELL, WAIL, WITCH, HIVE, COMB, WAX,
DEAR, HAIR, HONEY, HOARSE, PERIOD, STRETCH, WHILE, CLOWN
Output:
Let’s examine the list of words provided to see which ones are homophones, then narrow down the list to actual homophones,
and then see if any are homophones based on a similar connection:
**Step 1: Identify Homophones**
For each word in the list, let’s say the word aloud to find out if it is a homophone
Angel - No homophone comes to mind
Pirate - No homophone comes to mind
Spell - No homophone comes to mind
Wail - Possible homophone “whale” like the sea animal
Witch - Possible homophone “which”
Hive - Possible homophone “I’ve”
Comb - Maybe a homphone of “come”
Wax - No homophone comes to mind
Dear - Possible homophone “deer”
Hair - Possible homophone “hare” or “here” or “hear”
Honey - No homophone comes to mind
Hoarse - Possible homophone “horse”

35
LLMs as Method Actors

Period - No homophone comes to mind


Stretch - No homophone comes to mind
While - Possible homophone “wile” or “wild”
Clown - No homophone comes to mind
**Step 2: Narrow down list**
Wail - Possible homophone “whale” like the sea animal
Hive - Possible homophone “I’ve”
Comb - Maybe a homphone of “come”
Dear - Possible homophone “deer”
Hair - Possible homophone “hare” or “here” or “hear”
Hoarse - Possible homophone “horse”
While - Possible homophone “wile” or “wild”
**Step 3: Extract connection from list**
There does appear to be a common set of homophones here. The connection is “animal homophones” and the words are:
WAIL, DEAR, HAIR, HOARSE
— end of example input and output —

B.19. Template 19
Pattern: Puzzle words that each come after the same prefixes or words
Steps to take:
Step 1: Examine the list of puzzle words to brainstorm prefixes or words that can come before each puzzle word. This is a
New York Times puzzle, so unique and particular prefixes or words are more likely to be correct than generic and ubiquitous
prefixes or words.
Examples from prior puzzles:
Puzzle words that each come after “tele”: COMMUTE MEDICINE PROMPTER VISION
Puzzle words that each come after “deep”: CUT FAKE FREEZE SIX
Puzzle words that each come after “belly”: BUTTON DANCE FLOP LAUGH
Step 2: From your brainstorming notes, identify puzzle words that can have the same prefix or word come before them.
Step 3: See if other puzzle words can have that same prefix or word, trying to arrive at a group of four puzzle words that can
each have the same prefix or word. Take your time and thoroughly consider each possibility. If you can’t identify a group of
four puzzle words, start over at Step 2.
Example Input and Output:
Input: Let’s brainstorm a possible solution to this puzzle: ABSURDITY PROGRESS NONSENSE ADVANCE MARCH
SKUNK MADNESS PUSH CUT FREEZE FEET GARBAGE CHEESE FOLLY FAKE SIX
Output:
Let’s examine the list of puzzle words provided to see what prefixes or words come before these puzzle words, then identify
puzzle words from the list that have the same prefixes or words that commonly come before them, then identify what other
puzzle words could form this connection, then arrive at a conclusion.
**Step 1: Prefixes or words that commonly come before these puzzle words**

36
LLMs as Method Actors

**ABSURDITY** - sheer, utter, complete


**PROGRESS** - steady, significant, slow
**NONSENSE** - complete, utter, sheer
**ADVANCE** - in, rapid, significant
**MARCH** - long, slow, annual
**SKUNK** - striped, little, dead
**MADNESS** - sheer, utter, pure
**PUSH** - final, gentle, big
**CUT** - deep, budget, price
**FREEZE** - deep, sudden, hard
**FEET** - bare, tired, sore
**GARBAGE** - household, kitchen, street
**CHEESE** - cheddar, blue, cream
**FOLLY** - sheer, utter, complete
**FAKE** - news, ID, tan
**SIX** - number, at, o’clock
**Step 2: Puzzle Words that have the same words that commonly come before them**
Sheer - Folly, Absurdity, Madness, Nonsense
Deep - Freeze, Cut
**Step 3: Other puzzle words that could form the connection of “deep”**
**ABSURDITY** - No
**PROGRESS** - No
**NONSENSE** - No
**ADVANCE** - No
**MARCH** - Maybe
**SKUNK** - No
**MADNESS** - No
**PUSH** - No
**FEET** - No
**GARBAGE** - No
**CHEESE** - No
**FOLLY** - No
**FAKE** - Yes!
**SIX** - Yes!
**Step 4: Conclusion**
Two good possibilities:

37
LLMs as Method Actors

Puzzle Words that follow sheer: FOLLY ABSURDITY MADNESS NONSENSE


Puzzle Words that follow deep: CUT FAKE FREEZE SIX
— end of example input and output —

B.20. Template 20
Pattern: Puzzle words that are each one letter away from something else in common
Steps to take:
Step 1: Examine the list of puzzle words to brainstorm words that are one letter away from each puzzle word.
Examples from prior puzzles:
Puzzle words that are each body parts plus ”y”: ARMY COLONY LIVERY SHINY
Puzzle words that are each birds minus last letter: CONDO HAW HERO LOO
Step 2: Identify a connection between the words that you’ve brainstormed that applies to multiple puzzle words. This is a
New York Times puzzle, so unique, subtle connections are more likely to be correct than simple and obvious connections.
Step 3: See if other puzzle words can share the connection, trying to arrive at a group of four puzzle words that are each one
letter away from a word that shares something else in common. Take your time and thoroughly consider each possibility. If
you can’t identify a group of four puzzle words, start over at Step 2.

B.21. Template 21
Pattern: Puzzle words that each relate to the same thing
Step 1: Identify two puzzle words that each relate to the same thing. This is a New York Times puzzle, so unique, subtle
connections are more likely to be correct than simple and obvious connections. As precisely as possible, describe how the
two puzzle words relate to the same thing.
Examples from prior puzzles:
Puzzle words that each relate to “Happy New Year!”: BALL COUNTDOWN FIREWORKS KISS
Puzzle words that each refer to things you can do to your nose: BLOW HOLD PICK THUMB
Puzzle words that each relate to “Oomph”: ENERGY FIRE JUICE ZIP
Puzzle words that are things to blow on for wishes/luck: CANDLE DANDELION DICE EYELASH
Puzzle words that each relate to attraction: APPEAL CHARM DRAW PULL
Puzzle words that each relate to sound/hearing: ACOUSTIC AUDITORY HEARD SONIC
Puzzle words that each refer to things that are cinched in the middle: CORSET DIABOLO HOURGLASS WASP
Puzzle words that each refer to things you can set: MOOD RECORD TABLE VOLLEYBALL
Puzzle words that each relate to Italian demonyms: BOLOGNESE NEAPOLITAN PARMESAN VENETIAN
Puzzle words that each refer to spirals in nature: CYCLONE GALAXY SNAIL SUNFLOWER
Step 2: See if other puzzle words also relate to the same thing. Try to arrive at a group of four puzzle words that each relate
to the same thing. This is a New York Times puzzle, so unique, subtle connections are more likely to be correct than simple
and obvious connections. If you can’t identify a group of four puzzle words that each relate to the same thing, start over at
Step 1.

B.22. Template 22
Pattern: Puzzle words that are each a type of the same thing.

38
LLMs as Method Actors

Step 1: Identify two puzzle words that are each a type of the same thing. This is a New York Times puzzle, so unique,
particular things are more likely to be correct than generic and ubiquitous things. As precisely as possible, describe how the
two puzzle words are each a type of the same thing.
Examples from prior puzzles:
Puzzle words that are each a type of name prefix: GEN MS PROF REV
Puzzle words that are each a type of road name: ALLEY DRIVE LANE STREET
Puzzle words that are each a type of interval of time: PERIOD SPELL STRETCH WHILE
Puzzle words that are each a type of farm tool: HOE PLOW RAKE SICKLE
Puzzle words that are each a type of sports venue: ARENA BOWL DOME FIELD
Puzzle words that are each a type of murky condition: CLOUD FOG HAZE MIST
Puzzle words that are each a type of luxurious fabric: CHIFFON SATIN SILK VELVET
Puzzle words that are each a type of ice cream treat: FLOAT SHAKE SPLIT SUNDAE
Puzzle words that are each a type of hardware fastener: BOLT NAIL RIVET SCREW
Puzzle words that are each a type of cooking oil: CORN OLIVE PALM PEANUT
Step 2: See if other puzzle words are also a type of the same thing. Try to arrive at a group of four puzzle words that are
each a type of the same thing. This is a New York Times puzzle, so unique, particular things are more likely to be correct
than generic and ubiquitous things. If you can’t identify a group of four puzzle words that are each a type of the same thing,
start over at Step 1.

B.23. Template 23
Pattern: Puzzle words that each come after the same word or phrase
Steps to take:
Step 1: Examine the list of puzzle words to brainstorm words that can come before each puzzle word. This is a New York
Times puzzle, so unique and particular words are more likely to be correct than generic and ubiquitous words.
Examples from prior puzzles:
Puzzle words that each come after “SECRET”: AGENT CODE SANTA SAUCE
Puzzle words that each come after “FUNNY”: BONE BUSINESS GIRL PAGES
Puzzle words that each come after “SILVER”: FOX LINING SCREEN SPOON
Puzzle words that each come after “STICKY”: FINGERS NOTE RICE WICKET
Puzzle words that each come after “SECOND”: FIDDLE GUESS NATURE WIND
Puzzle words that each come after “PLAY”: DATE DEAD DIRTY DOH
Puzzle words that each come after “CAT’S” : CRADLE EYE MEOW PAJAMAS
Puzzle words that each come after “SPACE”: BAR CADET HEATER STATION
Puzzle words that each come after “LIVING”: LARGE LEGEND PROOF ROOM
Puzzle words that each come after “HEAD OF”: HAIR LETTUCE STATE STEAM
Step 2: From your brainstorming notes, identify puzzle words that can have the same word come before them.
Step 3: See if other puzzle words can have that same word come before them, trying to arrive at a group of four words that
can each have the same word come before them. Take your time and thoroughly consider each possibility. If you can’t
identify a group of four words, start over at Step 2.

39
LLMs as Method Actors

Example Input and Output:


Input: Let’s brainstorm a possible solution to this puzzle: BABY SEGMENT DEAR SANTA LINE FIRE POINT BOO
RAY JUICE ENERGY AGENT ZIP LOVE CODE SAUCE
Output:
Let’s examine the list of puzzle words provided to see what words come before these words, then identify puzzle words
from the list that have the same words that commonly come before them, then identify what other puzzle words could form
this connection, then arrive at a conclusion.
**Step 1: Words that commonly come before these puzzle words**
**BABY** - newborn, my, little, sweet
**SEGMENT** - market, audience
**DEAR** - my, oh
**SANTA** - dear, secret
**LINE** - front, base, guide, head, through
**FIRE** - forest, camp, house, bon
**POINT** - focal, starting, main, critical
**BOO** - peek-a-, boo-
**RAY** - x-, sun, gamma, manta
**JUICE** - orange, apple, fruit, grape
**ENERGY** - renewable, solar, kinetic, nuclear
**AGENT** - special, real estate, secret, travel
**ZIP** -
**LOVE** - true, self, romantic, unconditional
**CODE** - zip, area, bar, dress
**SAUCE** - hot, tomato, soy, barbecue
**Step 2: Puzzle words that have the same words that commonly come before them**
Secret - SANTA, AGENT
**Step 3: What other puzzle word could have secret come before them?**
Secret BABY? No
Secret SEGMENT? No
Secret DEAR? No
Secret LINE? No
Secret FIRE? No
Secret POINT? No
Secret BOO? Unlikely
Secret RAY? No
Secret JUICE? No
Secret ENERGY? Unlikely

40
LLMs as Method Actors

Secret ZIP? No
Secret LOVE? Maybe
Secret CODE? Yes
Secret SAUCE? Yes
**Step 4: Conclusion**
Words that each come after the word “secret”: SANTA AGENT CODE SAUCE
— end of example input and output —

B.24. Template 24
Pattern: Puzzle words that each share the same unique characteristic. For this pattern, the connection is with how the puzzle
words are structured or how the words sound, not with what the words mean.
Steps to take:
Step 1: Say different puzzle words aloud and try to identify two puzzle words that share a unique characteristic related to
how the word is structured or sounds. This is a New York Times puzzle, so unique and particular characteristics are more
likely to be the correct connection than generic and ubiquitous characteristics. Think outside the box.
Examples from prior puzzles:
Puzzle words that are each heteronyms: BASS DOVE DESERT WIND
Puzzle words that are each letter spellings: BEE EX GEE JAY
Puzzle words that are each two letters + number: CANINE FREIGHT OFTEN STONE
Puzzle words that are each examples of onomatopoeia: BANG PLOP SPLASH THUD
Puzzle words that are each ”-ough” words that don’t rhyme: BOUGH COUGH DOUGH TOUGH
Puzzle words that are each words you say twice in a row: BOO POM TOM YO
Puzzle words that are each palindromes: BIB EYE GAG POP
Step 2: See if other puzzle words can have that same characteristic. Try to arrive at a group of four puzzle words. Try
saying each puzzle word aloud to hear if the linguistic characteristic is present. Take your time and thoroughly consider
each possibility. If you can’t identify a group of four words, start over at Step 1.

41

You might also like