2211.11559v1 (1)
2211.11559v1 (1)
2211.11559v1 (1)
Figure 1. V IS P ROG is a modular and interpretable neuro-symbolic system for compositional visual reasoning. Given a few examples
of natural language instructions and the desired high-level programs, V IS P ROG generates a program for any new instruction using in-
context learning in GPT-3 and then executes the program on the input image(s) to obtain the prediction. V IS P ROG also summarizes the
intermediate outputs into an interpretable visual rationale (Fig. 4). We demonstrate V IS P ROG on tasks that require composing a diverse
set of modules for image understanding and manipulation, knowledge retrieval, and arithmetic and logical operations.
trast, V IS P ROG uses a powerful language model (GPT-3) from in-context learning which only involves a feedforward pass.
Our key contributions include - (i) V IS P ROG - a sys- (CLIP [23]), and audio-language (mSLAM [3]), to perform
tem that uses the in-context learning ability of a language a number of zero-shot tasks, including image captioning,
model to generate visual programs from natural language video-to-text retrieval, and robot planning. However, in
instructions for compositional visual tasks (Sec. 3); (ii) SMs the composition is pre-determined and fixed for each
demonstrating the flexibility of V IS P ROG on complex vi- task. In contrast, V IS P ROG determines how to compose
sual tasks such as factual knowledge object tagging and lan- models for each instance by generating programs based
guage guided image editing (Secs. 4.3 and 4.4) that have on the instruction, question, or statement. We demonstrate
eluded or seen limited success with a single end-to-end V IS P ROG’s ability to handle complex instructions that
model; and (iii) producing visual rationales for these tasks involve diverse capabilities (20 modules) and varied
and showing their utility for error analysis and user-driven input (text, image, and image pairs), intermediate (text,
instruction tuning to improve V IS P ROG’s performance sig- image, bounding boxes, segmentation masks), and out-
nificantly (Sec. 5.3). put modalities (text and images). Similar to V IS P ROG,
ProgPrompt [29] is a concurrent work that demonstrates
2. Related Work the ability of LLMs to generate python-like situated robot
action plans from natural language instructions. While
Neuro-symbolic approaches have seen renewed momen- ProgPrompt modules (such as “find” or “grab”) take strings
tum owing to the incredible understanding, generation, and (typically object names) as input, V IS P ROG programs
in-context learning capabilities of large language models are more general. In each step in a V IS P ROG program, a
(LLMs). We now discuss previous program generation and module could accept multiple arguments including strings,
execution approaches for visual tasks, recent work in using numbers, arithmetic and logical expressions, or arbitrary
LLMs for vision, and advances in reasoning methods for python objects (such as list() or dict() instances con-
language tasks. taining bounding boxes or segmentation masks) produced
by previous steps.
Program generation and execution for visual tasks.
Neural module networks (NMN) [2] pioneered modular Reasoning via Prompting in NLP. There is a growing
and compositional approaches for the visual question an- body of literature [14, 17] on using LLMs for language
swering (VQA) task. NMNs compose neural modules into reasoning tasks via prompting. Chain-of-Thought (CoT)
an end-to-end differentiable network. While early attempts prompting [32], where a language model is prompted with
use off-the-shelf parsers [2], recent methods [9,10,12] learn in-context examples of inputs, chain-of-thought rationales
the layout generation model jointly with the neural modules (a series of intermediate reasoning steps), and outputs, has
using R EINFORCE [33] and weak answer supervision. shown impressive abilities for solving math reasoning prob-
While similar in spirit to NMNs, V IS P ROG has several ad- lems. While CoT relies on the ability of LLMs to both gen-
vantages over NMNs. First, V IS P ROG generates high-level erate a reasoning path and execute it, approaches similar to
programs that invoke trained state-of-the-art neural models V IS P ROG have been applied to language tasks, where a de-
and other python functions at intermediate steps as opposed composer pompt [15] is used first to generate a sequence of
to generating end-to-end neural networks. This makes it sub-tasks which are then handled by sub-task handlers.
easy to incorporate symbolic, non-differentiable modules.
Second, V IS P ROG leverages the in-context learning ability 3. Visual Programming
of LLMs [5] to generate programs by prompting the LLM
(GPT-3) with a natural language instruction (or a visual Over the last few years, the AI community has pro-
question or a statement to be verified) along with a few duced high-performance, task-specific models for many vi-
examples of similar instructions and their corresponding sion and language tasks such as object detection, segmenta-
programs thereby removing the need to train specialized tion, VQA, captioning, and text-to-image generation. While
program generators for each task. each of these models solves a well-defined but narrow prob-
lem, the tasks we usually want to solve in the real world are
LLMs for visual tasks. LLMs and in-context learning often broader and loosely defined.
have been applied to visual tasks. PICa [34] uses LLMs To solve such practical tasks, one has to either collect
for a knowledge-based VQA [20] task. PICa represents the a new task-specific dataset, which can be expensive, or
visual information in images as text via captions, objects, meticulously compose a program that invokes multiple
and attributes and feeds this textual representation to neural models, image processing subroutines (e.g. image
GPT-3 along with the question and in-context examples to resizing, cropping, filtering, and colorspace conversions),
directly generate the answer. Socratic models (SMs) [36], and other computation (e.g. database lookup, or arithmetic
compose pretrained models from different modalities and logical operations). Manually creating these programs
such as language (BERT [7], GPT-2 [24]), vision-language for the infinitely long tail of complex tasks we encounter
grams often use output variables from past steps as inputs
to future steps. We use descriptive module names (e.g. “Se-
lect”, “ColorPop”, “Replace”), argument names (e.g. “im-
age”, “object”, “query”), and variable names (e.g. “IM-
AGE”, “OBJ”) to allow GPT-3 to understand the input and
output type, and function of each module. During execu-
tion the output variables may be used to store arbitrary data
types. For instance “OBJ”s are list of objects in the image,
with mask, bounding box, and text (e.g. category name) as-
sociated with each object.
These in-context examples are fed into GPT-3 along
with a new natural language instruction. Without observing
the image or its content, V IS P ROG generates a program
(bottom of Fig. 3) that can be executed on the input
image(s) to perform the described task.
class VisProgModule():
def __init__(self):
# load a trained model; move to GPU
4. Tasks
V IS P ROG provides a flexible framework that can be ap-
plied to a diverse range of complex visual tasks. We eval-
uate V IS P ROG on 4 tasks that require capabilities ranging
from spatial reasoning, reasoning about multiple images,
knowledge retrieval, and image generation and manipula-
tion. Fig. 5 summarizes the inputs, outputs, and modules
used for these tasks. We now describe these tasks, their
evaluation settings, and the choice of in-context examples.
Prompts. We manually annotate 31 random questions from Evaluation. We create a small validation set by sampling
the balanced train set with desired V IS P ROG programs. An- 250 random samples from the NLVRV 2 dev set to guide
notating questions with programs is easy and requires writ- prompt selection, and test generalization on NLVRV 2’s full
ing down the chain of reasoning required to answer that par- public test set.
ticular question. We provide a smaller subset of in context Prompts. We sample and annotate V IS P ROG programs for
examples to GPT-3, randomly sampled from this list to re- 16 random statements in the NLVRV 2 train set. Since some
duce the cost of answering each GQA question. of these examples are redundant (similar program structure)
we also create a curated subset of 12 examples by removing
4.2. Zero-Shot Reasoning on Image Pairs 4 redundant ones.
VQA models are trained to answer questions about a sin-
4.3. Factual Knowledge Object Tagging
gle image. In practice, one might require a system to an-
swer questions about a collection of images. For example, We often want to identify people and objects in images
a user may ask a system to parse their vacation photo album whose names are unknown to us. For instance, we might
and answer the question: “Which landmark did we visit, the want to identify celebrities, politicians, characters in TV
day after we saw the Eiffel Tower?”. Instead of assembling shows, flags of countries, logos of corporations, popular
an expensive dataset and training a multi-image model, we cars and their manufacturers, species of organisms, and so
demonstrate the ability of V IS P ROG to use a single-image on. Solving this task requires not only localizing people,
VQA system to solve a task involving multiple images with- faces, and objects but also looking up factual knowledge in
out training on multi-image examples. an external knowledge base to construct a set of categories
We showcase this ability on the NLVRV 2 [30] bench- for classification, such as names of the characters on a TV
mark, which involves verifying statements about image show. We refer to this task as Factual Knowledge Object
pairs. Typically, tackling the NLVRV 2 challenge requires Tagging or Knowledge Tagging for short.
training custom architectures that take image pairs as input For solving Knowledge Tagging, V IS P ROG uses GPT-3
on NLVRV 2’s train set. Instead, V IS P ROG achieves this by as an implicit knowledge base that can be queried with nat-
decomposing a complex statement into simpler questions ural language prompts such as “List the main characters on
about individual images and a python expression involving the TV show Big Bang Theory separated by commas.” This
arithmetic and logical operators and answers to the image- generated category list can then be used by a CLIP image
level questions. The VQA model V I LT- VQA is used to get classification module that classifies image regions produced
image-level answers, and the python expression is evaluated by localization and face detection modules. V IS P ROG’s
to verify the statement. program generator automatically determines whether to use
GQA NLVRv2
a face detector or an open-vocabulary localizer depending
on the context in the natural language instruction. V IS P ROG
also estimates the maximum size of the category list re-
trieved. For instance, “Tag the logos of the top 5 german
Accuracy
car companies” generates a list of 5 categories, while “Tag
the logos of german car companies” produces a list of arbi-
trary length determined by GPT-3 with a cut-off at 20. This
allows users to easily control the noise in the classification # context examples # context examples
process by tweaking their instructions.
Evaluation. To evaluate V IS P ROG on this task, we anno- Figure 7. Performance improves with number of in-context ex-
tate 100 tagging instructions across 46 images that require amples on GQA and NLVRV 2 validation sets. The error bars
external knowledge to tag 253 object instances including represent 95% confidence interval across 5 runs. Predictions from
personalities across pop culture, politics, sports, and art, as the same runs are used for majority voting. (Sec. 5.1)
well as a varieties of objects (e.g. cars, flags, fruits, appli-
ances, furniture etc.). For each instruction, we measure both
localization and tagging performance via precision (fraction for the object replacement sub-task which uses Stable Diffu-
of predicted boxes that are correct) and recall (fraction of sion as long as the generated image is semantically correct.
ground truth objects that are correctly predicted). Tagging Prompts. Similar to knowledge tagging, we create 10 in-
metrics require both the predicted bounding box and the as- context examples for this task with no associated images.
sociated tag or class label to be correct, while localization
ignores the tag. To determine localization correctness, we 5. Experiments and Analysis
use an IoU threshold of 0.5. We summarize localization Our experiments evaluate the effect of number of
and tagging performance by F1 scores (harmonic mean of prompts on GQA and NLVR performance (Sec. 5.1), gen-
the average precision and recall across instructions). eralization of V IS P ROG on the four tasks comparing vari-
Prompts. We create 14 in-context examples for this task. ous prompting strategies (Sec. 5.2), analyze the sources of
Note that the instructions for these examples were halluci- error for each task (Fig. 8), and study the utility of visual
nated i.e. no images were associated with these examples. rationales for diagnosing errors and improving V IS P ROG’s
performance through instruction tuning (Sec. 5.3).
4.4. Image Editing with Natural Language
5.1. Effect of prompt size
Text to image generation has made impressive strides
over the last few years with models like DALL-E [26], Fig. 7 shows that validation performance increases pro-
Parti [35], and Stable Diffusion [28]. However, it is still be- gressively with the number of in-context examples used in
yond the capability of these models to handle prompts like the prompts for both GQA and NLVR. Each run randomly
”Hide the face of Daniel Craig with :p” (de-identification selects a subset of the annotated in-context examples based
or privacy preservation), or ”Create a color pop of Daniel on a random seed. We also find that majority voting across
Craig and blur the background” (object highlighting) even the random seeds leads to consistently better performance
though these are relatively simple to achieve programmat- than the average performance across runs. This is consis-
ically using a combination of face detection, segmentation tent with findings in Chain-of-Thought [32] reasoning liter-
and image processing modules. Achieving a sophisticated ature for math reasoning problems [31]. On NLVR, the per-
edit such as ”Replace Barack Obama with Barack Obama formance of V IS P ROG saturates with fewer prompts than
wearing sunglasses” (object replacement), first requires GQA. We believe this is because NLVRV 2 programs re-
identifying the object of interest, generating a mask of the quire fewer modules and hence fewer demonstrations for
object to be replaced and then invoking an image inpaint- using those modules than GQA.
ing model (we use Stable Diffusion) with the original im-
age, mask specifying the pixels to replace, and a description
5.2. Generalization
of the new pixels to generate at that location. V IS P ROG, GQA. In Tab. 1 we evaluate different prompting strategies
when equipped with the necessary modules and example on the GQA testdev set. For the largest prompt size
programs, can handle very complex instructions with ease. evaluated on the val set (24 in-context examples), we
Evaluation. To test V IS P ROG on the image editing instruc- compare the random strategy consisting of the V IS P ROG’s
tions for de-identification, object highlighting, and object best prompt chosen amongst 5 runs on the validation set
replacement, we collect 107 instructions across 65 images. (each run randomly samples in-context examples from 31
We manually score the predictions for correctness and re- annotated examples) and the majority voting strategy which
port accuracy. Note that we do not penalize visual artifacts takes maximum consensus predictions for each question
Prompting Context examples GQA NLVR
Method Runs Accuracy
strategy per run
V I LT- VQA - 1 - 47.8
V IS P ROG curated 1 20 50.0
V IS P ROG random 1 24 48.2
V IS P ROG voting 5 24 50.5
Figure 9. Instruction tuning using visual rationales. By revealing the reason for failure, V IS P ROG allows a user to modify the original
instruction to improve performance.
Instruction tuning. To be useful, a visual rationale must ul- We thank Kanchan Aggarwal for helping with the anno-
timately allow users to improve the performance of the sys- tation process for the image editing and knowledge tagging
tem on their task. For knowledge tagging and image editing tasks. We are also grateful to the amazing Hugging Face
tasks, we study if visual rationales can help a user mod- ecosystem for simplifying the use of state-of-the-art neural
ify or tune the instructions to achieve better performance. models for implementing V IS P ROG modules.
Fig. 9 shows that modified instructions: (i) result in a bet-
ter query for the localization module (e.g. “kitchen appli- References
ance” instead of “item”); (ii) provide a more informative
query for knowledge retrieval (e.g. “most recent CEO of [1] Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, An-
IBM” instead of “CEO of IBM”); (iii) provide a category toine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur
Mensch, Katie Millican, Malcolm Reynolds, Roman Ring,
name (e.g. “table-merged”) for the Select module to restrict
Eliza Rutherford, Serkan Cabi, Tengda Han, Zhitao Gong,
search to segmented regions belonging to the specified cat-
Sina Samangooei, Marianne Monteiro, Jacob Menick, Se-
egory; or (iv) control the number of classification categories bastian Borgeaud, Andy Brock, Aida Nematzadeh, Sa-
for knowledge tagging through the max argument in the List hand Sharifzadeh, Mikolaj Binkowski, Ricardo Barreira,
module. Tables 3 and 4 show that instruction tuning re- Oriol Vinyals, Andrew Zisserman, and Karen Simonyan.
sults in significant gains for knowledge tagging and image Flamingo: a visual language model for few-shot learning.
editing tasks. ArXiv, abs/2204.14198, 2022. 2
[2] Jacob Andreas, Marcus Rohrbach, Trevor Darrell, and Dan [15] Tushar Khot, H. Trivedi, Matthew Finlayson, Yao Fu, Kyle
Klein. Neural module networks. 2016 IEEE Conference Richardson, Peter Clark, and Ashish Sabharwal. Decom-
on Computer Vision and Pattern Recognition (CVPR), pages posed prompting: A modular approach for solving complex
39–48, 2016. 2, 3 tasks. ArXiv, abs/2210.02406, 2022. 3
[3] Ankur Bapna, Colin Cherry, Yu Zhang, Ye Jia, Melvin John- [16] Wonjae Kim, Bokyung Son, and Ildoo Kim. Vilt: Vision-
son, Yong Cheng, Simran Khanuja, Jason Riesa, and Alexis and-language transformer without convolution or region su-
Conneau. mslam: Massively multilingual joint pre-training pervision. In ICML, 2021. 2, 5
for speech and text. ArXiv, abs/2202.01374, 2022. 3 [17] Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yutaka
[4] Gary Bradski. The opencv library. Dr. Dobb’s Journal: Soft- Matsuo, and Yusuke Iwasawa. Large language models are
ware Tools for the Professional Programmer, 25(11):120– zero-shot reasoners. ArXiv, abs/2205.11916, 2022. 3
123, 2000. 2 [18] Jian Li, Yabiao Wang, Changan Wang, Ying Tai, Jianjun
[5] Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Sub- Qian, Jian Yang, Chengjie Wang, Jilin Li, and Feiyue Huang.
biah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakan- Dsfd: Dual shot face detector. 2019 IEEE/CVF Conference
tan, Pranav Shyam, Girish Sastry, Amanda Askell, Sand- on Computer Vision and Pattern Recognition (CVPR), pages
hini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, T. J. 5055–5064, 2019. 2
Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, [19] Jiasen Lu, Christopher Clark, Rowan Zellers, Roozbeh Mot-
Jeff Wu, Clemens Winter, Christopher Hesse, Mark Chen, taghi, and Aniruddha Kembhavi. Unified-io: A unified
Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, model for vision, language, and multi-modal tasks. ArXiv,
Jack Clark, Christopher Berner, Sam McCandlish, Alec Rad- abs/2206.08916, 2022. 2
ford, Ilya Sutskever, and Dario Amodei. Language models [20] Kenneth Marino, Mohammad Rastegari, Ali Farhadi, and
are few-shot learners. ArXiv, abs/2005.14165, 2020. 2, 3 Roozbeh Mottaghi. Ok-vqa: A visual question answering
[6] Bowen Cheng, Alexander G. Schwing, and Alexander Kir- benchmark requiring external knowledge. 2019 IEEE/CVF
illov. Per-pixel classification is not all you need for semantic Conference on Computer Vision and Pattern Recognition
segmentation. In NeurIPS, 2021. 2 (CVPR), pages 3190–3199, 2019. 3
[7] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina [21] Matthias Minderer, Alexey A. Gritsenko, Austin Stone,
Toutanova. Bert: Pre-training of deep bidirectional trans- Maxim Neumann, Dirk Weissenborn, Alexey Dosovitskiy,
formers for language understanding. ArXiv, abs/1810.04805, Aravindh Mahendran, Anurag Arnab, Mostafa Dehghani,
2019. 3 Zhuoran Shen, Xiao Wang, Xiaohua Zhai, Thomas Kipf, and
Neil Houlsby. Simple open-vocabulary object detection with
[8] Tanmay Gupta, Amita Kamath, Aniruddha Kembhavi, and
vision transformers. ArXiv, abs/2205.06230, 2022. 2
Derek Hoiem. Towards general purpose vision systems.
ArXiv, abs/2104.00743, 2021. 2 [22] Zoe Papakipos and Joanna Bitton. Augly: Data augmenta-
tions for robustness. ArXiv, abs/2201.06494, 2022. 12
[9] Ronghang Hu, Jacob Andreas, Trevor Darrell, and Kate
[23] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya
Saenko. Explainable neural computation via stack neural
Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry,
module networks. In ECCV, 2018. 2, 3
Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learn-
[10] Ronghang Hu, Jacob Andreas, Marcus Rohrbach, Trevor ing transferable visual models from natural language super-
Darrell, and Kate Saenko. Learning to reason: End-to-end vision. In International Conference on Machine Learning,
module networks for visual question answering. 2017 IEEE pages 8748–8763. PMLR, 2021. 2, 3, 12
International Conference on Computer Vision (ICCV), pages
[24] Alec Radford, Jeff Wu, Rewon Child, David Luan, Dario
804–813, 2017. 2, 3
Amodei, and Ilya Sutskever. Language models are unsuper-
[11] Drew A. Hudson and Christopher D. Manning. Gqa: A new vised multitask learners. 2019. 3
dataset for real-world visual reasoning and compositional [25] Colin Raffel, Noam M. Shazeer, Adam Roberts, Kather-
question answering. 2019 IEEE/CVF Conference on Com- ine Lee, Sharan Narang, Michael Matena, Yanqi Zhou,
puter Vision and Pattern Recognition (CVPR), pages 6693– Wei Li, and Peter J. Liu. Exploring the limits of trans-
6702, 2019. 5 fer learning with a unified text-to-text transformer. ArXiv,
[12] Justin Johnson, Bharath Hariharan, Laurens van der Maaten, abs/1910.10683, 2020. 2
Judy Hoffman, Li Fei-Fei, C. Lawrence Zitnick, and Ross B. [26] Aditya Ramesh, Mikhail Pavlov, Gabriel Goh, Scott Gray,
Girshick. Inferring and executing programs for visual rea- Chelsea Voss, Alec Radford, Mark Chen, and Ilya Sutskever.
soning. 2017 IEEE International Conference on Computer Zero-shot text-to-image generation. ArXiv, abs/2102.12092,
Vision (ICCV), pages 3008–3017, 2017. 2, 3 2021. 7
[13] Amita Kamath, Christopher Clark, Tanmay Gupta, Eric [27] Scott Reed, Konrad Zolna, Emilio Parisotto, Sergio Gomez
Kolve, Derek Hoiem, and Aniruddha Kembhavi. Webly su- Colmenarejo, Alexander Novikov, Gabriel Barth-Maron,
pervised concept expansion for general purpose vision mod- Mai Gimenez, Yury Sulsky, Jackie Kay, Jost Tobias Sprin-
els. In ECCV, 2022. 2 genberg, Tom Eccles, Jake Bruce, Ali Razavi, Ashley D. Ed-
[14] Tushar Khot, Kyle Richardson, Daniel Khashabi, and Ashish wards, Nicolas Manfred Otto Heess, Yutian Chen, Raia Had-
Sabharwal. Learning to solve complex tasks by talking to sell, Oriol Vinyals, Mahyar Bordbar, and Nando de Freitas.
agents. ArXiv, abs/2110.08542, 2021. 3 A generalist agent. ArXiv, abs/2205.06175, 2022. 2
[28] Robin Rombach, A. Blattmann, Dominik Lorenz, Patrick
Esser, and Björn Ommer. High-resolution image synthesis
with latent diffusion models. 2022 IEEE/CVF Conference
on Computer Vision and Pattern Recognition (CVPR), pages
10674–10685, 2022. 2, 7
[29] Ishika Singh, Valts Blukis, Arsalan Mousavian, Ankit Goyal,
Danfei Xu, Jonathan Tremblay, Dieter Fox, Jesse Thoma-
son, and Animesh Garg. Progprompt: Generating situ-
ated robot task plans using large language models. ArXiv,
abs/2209.11302, 2022. 3
[30] Alane Suhr, Stephanie Zhou, Iris Zhang, Huajun Bai, and
Yoav Artzi. A corpus for reasoning about natural language
grounded in photographs. ArXiv, abs/1811.00491, 2019. 6
[31] Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le,
Ed Chi, and Denny Zhou. Self-consistency improves
chain of thought reasoning in language models. ArXiv,
abs/2203.11171, 2022. 7
[32] Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten
Bosma, Ed Chi, Quoc Le, and Denny Zhou. Chain of thought
prompting elicits reasoning in large language models. ArXiv,
abs/2201.11903, 2022. 3, 7
[33] Ronald J Williams. Simple statistical gradient-following al-
gorithms for connectionist reinforcement learning. Machine
learning, 8(3):229–256, 1992. 2, 3
[34] Zhengyuan Yang, Zhe Gan, Jianfeng Wang, Xiaowei Hu, Yu-
mao Lu, Zicheng Liu, and Lijuan Wang. An empirical study
of gpt-3 for few-shot knowledge-based vqa. In AAAI, 2022.
3
[35] Jiahui Yu, Yuanzhong Xu, Jing Yu Koh, Thang Luong, Gun-
jan Baid, Zirui Wang, Vijay Vasudevan, Alexander Ku, Yin-
fei Yang, Burcu Karagol Ayan, Benton C. Hutchinson, Wei
Han, Zarana Parekh, Xin Li, Han Zhang, Jason Baldridge,
and Yonghui Wu. Scaling autoregressive models for content-
rich text-to-image generation. ArXiv, abs/2206.10789, 2022.
7
[36] Andy Zeng, Maria Attarian, Brian Ichter, Krzysztof Choro-
manski, Adrian Wong, Stefan Welker, Federico Tombari,
Aveek Purohit, Michael Ryoo, Vikas Sindhwani, Johnny
Lee, Vincent Vanhoucke, and Pete Florence. Socratic mod-
els: Composing zero-shot multimodal reasoning with lan-
guage. arXiv, 2022. 3
A. Appendix
This appendix includes
• Task prompts for V IS P ROG(Sec. A.1)
• Module implementation details (Sec. A.2)
• Many more qualitative results with visual rationales
for both successful and failure cases can be found at
https://prior.allenai.org/projects/visprog.
Select. The module takes a query and a category argu- Figure 12. Image editing prompt. Note that the prompt includes
ment. When the category is provided, the selection is a mapping of emojis to their names in the AugLy [22] library that
only performed over the regions that have been identified is used to implement Emoji module. The third example shows
as belonging to that category by a previous module in the how to provide the category value for the Select module.
program (typically the Seg module). If category is None,
the selection is performed over all regions. The query is
the text to be used for region-text scoring to perform the Classify. The Classify module takes lists of object regions
selection. We use CLIP-ViT [23] to select the region with and categories and tries to assign one of the categories to
the maximum score for the query. When the query contains each region. For simplicity, we assume the images in the
multiple phrases separated by commas, the highest-scoring tagging task has at most 1 instance of each category. The
region is selected for each phrase. Classify module operates differently based on whether
the category list has 1 or more elements. If the category list
Figure 13. Knowledge tagging prompt. Note that the prompt has
an additional placeholder to configure the default max value for
List module. While the first example infers max from a natural
instruction, the third example demonstrates how a user might min-
imally augment a natural instruction to provide argument values.
Figure 14. Prompt for the List module. list_max denotes the
default maximum list length and new_query is the placeholder for
the new retrieval query