0% found this document useful (0 votes)
1 views13 pages

2211.11559v1 (1)

Download as pdf or txt
Download as pdf or txt
Download as pdf or txt
You are on page 1/ 13

Visual Programming: Compositional visual reasoning without training

Tanmay Gupta, Aniruddha Kembhavi


PRIOR @ Allen Institute for AI
https://prior.allenai.org/projects/visprog
arXiv:2211.11559v1 [cs.CV] 18 Nov 2022

Figure 1. V IS P ROG is a modular and interpretable neuro-symbolic system for compositional visual reasoning. Given a few examples
of natural language instructions and the desired high-level programs, V IS P ROG generates a program for any new instruction using in-
context learning in GPT-3 and then executes the program on the input image(s) to obtain the prediction. V IS P ROG also summarizes the
intermediate outputs into an interpretable visual rationale (Fig. 4). We demonstrate V IS P ROG on tasks that require composing a diverse
set of modules for image understanding and manipulation, knowledge retrieval, and arithmetic and logical operations.

Abstract image processing subroutines, or python functions to pro-


duce intermediate outputs that may be consumed by subse-
quent parts of the program. We demonstrate the flexibility
We present V IS P ROG, a neuro-symbolic approach to
of V IS P ROG on 4 diverse tasks - compositional visual ques-
solving complex and compositional visual tasks given nat-
tion answering, zero-shot reasoning on image pairs, factual
ural language instructions. V IS P ROG avoids the need
knowledge object tagging, and language-guided image edit-
for any task-specific training. Instead, it uses the in-
ing. We believe neuro-symbolic approaches like V IS P ROG
context learning ability of large language models to gener-
are an exciting avenue to easily and effectively expand the
ate python-like modular programs, which are then executed
scope of AI systems to serve the long tail of complex tasks
to get both the solution and a comprehensive and inter-
that people may wish to perform.
pretable rationale. Each line of the generated program may
invoke one of several off-the-shelf computer vision models,
1. Introduction
The pursuit of general purpose AI systems has lead to
the development of capable end-to-end trainable models
[1, 5, 8, 13, 19, 25, 27], many of which aspire to provide a
simple natural language interface for a user to interact with
the model. The predominant approach to building these
systems has been massive-scale unsupervised pretraining
followed by supervised multitask training. However, this
approach requires a well curated dataset for each task that
makes it challenging to scale to the infinitely long tail of Figure 2. Modules currently supported in V IS P ROG. Red
complex tasks we would eventually like these systems to modules use neural models (OWL-ViT [21], DSFD [18], Mask-
Former [6], CLIP [23], ViLT [16], and Stable Diffusion [28]).
perform. In this work, we explore the use of large language
Blue modules use image processing and other python subroutines.
models to tackle the long tail of complex tasks by decom-
These modules are invoked in programs generated from natural
posing these tasks described in natural language into sim- language instructions. Adding new modules to extend V IS P ROG’s
pler steps that may be handled by specialized end-to-end capabilities is straightforward (Code. 1).
trained models or other programs.
Imagine instructing a vision system to “Tag the 7 main and a small number of in-context examples to create com-
characters on the TV show Big Bang Theory in this image.” plex programs without requiring any training1 . Programs
To perform this task, the system first needs to understand created by V IS P ROG also use a higher-level of abstraction
the intent of the instruction and then perform a sequence than NMNs and invoke trained state-of-the-art models and
of steps - detect the faces, retrieve list of main characters non-neural python subroutines (Fig. 2). These advantages
on Big Bang Theory from a knowledge base, classify faces make V IS P ROG an easy-to-use, performant, and modular
using the list of characters, and tag the image with recog- neuro-symbolic system.
nized character’s faces and names. While different vision V IS P ROG is also highly interpretable. First, V IS P ROG
and language systems exist to perform each of these steps, produces easy-to-understand programs which a user can
executing this task described in natural language is beyond verify for logical correctness. Second, by breaking down
the scope of end-to-end trained systems. the prediction into simple steps, V IS P ROG allows a user to
We introduce V IS P ROG which inputs visual data (a sin- inspect the outputs of intermediate steps to diagnose errors
gle image or a set of images) along with a natural language and if required, intervene in the reasoning process. Alto-
instruction, generates a sequence of steps, a visual pro- gether, an executed program with intermediate step results
gram if you will, and then executes these steps to produce (e.g. text, bounding boxes, segmentation masks, generated
the desired output. Each line in a visual program invokes images, etc.) linked together to depict the flow of informa-
one among a wide range of modules currently supported tion serves as a visual rationale for the prediction.
by the system. Modules may be off-the-shelf computer vi- To demonstrate its flexibility, we use V IS P ROG for 4 dif-
sion models, language models, image processing subrou- ferent tasks that share some common skills (e.g. for im-
tines in OpenCV [4], or arithmetic and logical operators. age parsing) while also requiring some degree of special-
Modules consume inputs that are produced by executing ized reasoning and visual manipulation capabilities. These
previous lines of code and output intermediate results that tasks are - (i) compositional visual question answering; (ii)
can be consumed downstream. In the example above, the zero-shot natural language visual reasoning (NLVR) on im-
visual program generated by V IS P ROG invokes a face de- age pairs; (iii) factual knowledge object tagging from natu-
tector [18], GPT-3 [5] as a knowledge retrieval system, and ral language instructions; and (iv) language-guided image
CLIP [23] as an open-vocabulary image classifier to pro- editing. We emphasize that neither the language model
duce the desired output (see Fig. 1). nor any of the modules are finetuned in any way. Adapt-
V IS P ROG improves upon previous methods for gener- ing V IS P ROG to any task is as simple as providing a few
ating and executing programs for vision applications. For in-context examples consisting of natural language instruc-
the visual question answering (VQA) task, Neural Module tions and the corresponding programs. While easy to use,
Networks (NMN) [2,9,10,12] compose a question-specific, V IS P ROG shows an impressive gain of 2.7 points over a
end-to-end trainable network from specialized, differen- base VQA model on the compositional VQA task, strong
tiable neural modules. These approaches either use brittle, zero-shot accuracy of 62.4% on NLVR without ever train-
off-the-shelf semantic parsers to deterministically compute ing on image pairs, and delightful qualitative and quantita-
the layout of modules, or learn a layout generator through tive results on knowledge tagging and image editing tasks.
weak answer supervision via R EINFORCE [33]. In con- 1 We use “training” to refer to gradient-based learning to differentiate it

trast, V IS P ROG uses a powerful language model (GPT-3) from in-context learning which only involves a feedforward pass.
Our key contributions include - (i) V IS P ROG - a sys- (CLIP [23]), and audio-language (mSLAM [3]), to perform
tem that uses the in-context learning ability of a language a number of zero-shot tasks, including image captioning,
model to generate visual programs from natural language video-to-text retrieval, and robot planning. However, in
instructions for compositional visual tasks (Sec. 3); (ii) SMs the composition is pre-determined and fixed for each
demonstrating the flexibility of V IS P ROG on complex vi- task. In contrast, V IS P ROG determines how to compose
sual tasks such as factual knowledge object tagging and lan- models for each instance by generating programs based
guage guided image editing (Secs. 4.3 and 4.4) that have on the instruction, question, or statement. We demonstrate
eluded or seen limited success with a single end-to-end V IS P ROG’s ability to handle complex instructions that
model; and (iii) producing visual rationales for these tasks involve diverse capabilities (20 modules) and varied
and showing their utility for error analysis and user-driven input (text, image, and image pairs), intermediate (text,
instruction tuning to improve V IS P ROG’s performance sig- image, bounding boxes, segmentation masks), and out-
nificantly (Sec. 5.3). put modalities (text and images). Similar to V IS P ROG,
ProgPrompt [29] is a concurrent work that demonstrates
2. Related Work the ability of LLMs to generate python-like situated robot
action plans from natural language instructions. While
Neuro-symbolic approaches have seen renewed momen- ProgPrompt modules (such as “find” or “grab”) take strings
tum owing to the incredible understanding, generation, and (typically object names) as input, V IS P ROG programs
in-context learning capabilities of large language models are more general. In each step in a V IS P ROG program, a
(LLMs). We now discuss previous program generation and module could accept multiple arguments including strings,
execution approaches for visual tasks, recent work in using numbers, arithmetic and logical expressions, or arbitrary
LLMs for vision, and advances in reasoning methods for python objects (such as list() or dict() instances con-
language tasks. taining bounding boxes or segmentation masks) produced
by previous steps.
Program generation and execution for visual tasks.
Neural module networks (NMN) [2] pioneered modular Reasoning via Prompting in NLP. There is a growing
and compositional approaches for the visual question an- body of literature [14, 17] on using LLMs for language
swering (VQA) task. NMNs compose neural modules into reasoning tasks via prompting. Chain-of-Thought (CoT)
an end-to-end differentiable network. While early attempts prompting [32], where a language model is prompted with
use off-the-shelf parsers [2], recent methods [9,10,12] learn in-context examples of inputs, chain-of-thought rationales
the layout generation model jointly with the neural modules (a series of intermediate reasoning steps), and outputs, has
using R EINFORCE [33] and weak answer supervision. shown impressive abilities for solving math reasoning prob-
While similar in spirit to NMNs, V IS P ROG has several ad- lems. While CoT relies on the ability of LLMs to both gen-
vantages over NMNs. First, V IS P ROG generates high-level erate a reasoning path and execute it, approaches similar to
programs that invoke trained state-of-the-art neural models V IS P ROG have been applied to language tasks, where a de-
and other python functions at intermediate steps as opposed composer pompt [15] is used first to generate a sequence of
to generating end-to-end neural networks. This makes it sub-tasks which are then handled by sub-task handlers.
easy to incorporate symbolic, non-differentiable modules.
Second, V IS P ROG leverages the in-context learning ability 3. Visual Programming
of LLMs [5] to generate programs by prompting the LLM
(GPT-3) with a natural language instruction (or a visual Over the last few years, the AI community has pro-
question or a statement to be verified) along with a few duced high-performance, task-specific models for many vi-
examples of similar instructions and their corresponding sion and language tasks such as object detection, segmenta-
programs thereby removing the need to train specialized tion, VQA, captioning, and text-to-image generation. While
program generators for each task. each of these models solves a well-defined but narrow prob-
lem, the tasks we usually want to solve in the real world are
LLMs for visual tasks. LLMs and in-context learning often broader and loosely defined.
have been applied to visual tasks. PICa [34] uses LLMs To solve such practical tasks, one has to either collect
for a knowledge-based VQA [20] task. PICa represents the a new task-specific dataset, which can be expensive, or
visual information in images as text via captions, objects, meticulously compose a program that invokes multiple
and attributes and feeds this textual representation to neural models, image processing subroutines (e.g. image
GPT-3 along with the question and in-context examples to resizing, cropping, filtering, and colorspace conversions),
directly generate the answer. Socratic models (SMs) [36], and other computation (e.g. database lookup, or arithmetic
compose pretrained models from different modalities and logical operations). Manually creating these programs
such as language (BERT [7], GPT-2 [24]), vision-language for the infinitely long tail of complex tasks we encounter
grams often use output variables from past steps as inputs
to future steps. We use descriptive module names (e.g. “Se-
lect”, “ColorPop”, “Replace”), argument names (e.g. “im-
age”, “object”, “query”), and variable names (e.g. “IM-
AGE”, “OBJ”) to allow GPT-3 to understand the input and
output type, and function of each module. During execu-
tion the output variables may be used to store arbitrary data
types. For instance “OBJ”s are list of objects in the image,
with mask, bounding box, and text (e.g. category name) as-
sociated with each object.
These in-context examples are fed into GPT-3 along
with a new natural language instruction. Without observing
the image or its content, V IS P ROG generates a program
(bottom of Fig. 3) that can be executed on the input
image(s) to perform the described task.

class VisProgModule():
def __init__(self):
# load a trained model; move to GPU

def html(self,inputs: List,output: Any):


Figure 3. Program generation in V IS P ROG. # return an html string visualizing step I/O

def parse(self,step: str):


daily not only requires programming expertise but is also # parse step and return list of input values
# and variables, and output variable name
slow, labor intensive, and ultimately insufficient to cover
the space of all tasks. What if, we could describe the task def execute(self,step: str,state: Dict):
in natural language and have an AI system generate and inputs, input_var_names, output_var_name = \
self.parse(step)
execute the corresponding visual program without any
training? # get values of input variables from state
for var_name in input_var_names:
Large language models for visual programming. Large inputs.append(state[var_name])
language models such as GPT-3 have shown a remark- # perform computation using the loaded model
able ability to generalize to new samples for a task hav- output = some_computation(inputs)
ing seen a handful of input and output demonstrations in-
context. For example, prompting GPT-3 with two English- # update state
state[output_var_name] = output
to-French translation examples and a new English phrase
good morning -> bonjour # visual summary of the step computation
good day -> bonne journée step_html = self.html(inputs,output)
good evening -> return output, step_html

Code 1. Implementation of a V IS P ROG module.


produces the French translation “bonsoir”. Note that we did
not have to finetune GPT-3 to perform the task of trans- Modules. V IS P ROG currently supports 20 modules (Fig. 2)
lation on the thrid phrase. V IS P ROG uses this in-context for enabling capabilities such as image understanding,
learning ability of GPT-3 to output visual programs for nat- image manipulation (including generation), knowledge
ural language instructions. retrieval, and performing arithmetic and logical operations.
Similar to English and French translation pairs in the ex- In V IS P ROG, each module is implemented as a Python
ample above, we prompt GPT-3 with pairs of instructions class (Code. 1) that has methods to: (i) parse the line to
and the desired high-level program. Fig. 3 shows such a extract the input argument names and values, and the output
prompt for an image editing task. The programs in the in- variable name; (ii) execute the necessary computation that
context examples are manually written and can typically be may involve trained neural models and update the program
constructed without an accompanying image. Each line of state with the output variable name and value; and (iii)
a V IS P ROG program, or a program step, consists of the summarize the step’s computation visually using html (used
name of a module, module’s input argument names and later to create a visual rationale). Adding new modules to
their values, and an output variable name. V IS P ROG pro- V IS P ROG simply requires implementing and registering a
Figure 5. We evaluate V IS P ROG on a diverse set of tasks. The
tasks span a variety of inputs and outputs and reuse modules (Loc,
FaceDet, VQA) whenever possible.

4. Tasks
V IS P ROG provides a flexible framework that can be ap-
plied to a diverse range of complex visual tasks. We eval-
uate V IS P ROG on 4 tasks that require capabilities ranging
from spatial reasoning, reasoning about multiple images,
knowledge retrieval, and image generation and manipula-
tion. Fig. 5 summarizes the inputs, outputs, and modules
used for these tasks. We now describe these tasks, their
evaluation settings, and the choice of in-context examples.

4.1. Compositional Visual Question Answering


V IS P ROG is compositional by construction which makes
Figure 4. Visual rationales generated by V IS P ROG. These ra- it suitable for the compositional, multi-step visual ques-
tionales visually summarize the input and output of each computa- tion answering task: GQA [11]. Modules for the GQA
tional step in the generated program during inference for an image
task include those for open vocabulary localization, a VQA
editing (top) and NLVR task (bottom).
module, functions for cropping image regions given bound-
ing box co-ordinates or spatial prepositions (such as above,
module class, while the execution of the programs using left, etc.), module to count boxes, and a module to evaluate
this module is handled automatically by the V IS P ROG Python expressions. For example, consider the question:
interpreter, which is described next. “Is the small truck to the left or to the right of the people
that are wearing helmets?”. V IS P ROG first localizes “peo-
ple wearing helmets”, crops the region to the left (or right)
Program Execution. The program execution is handled of these people, checks if there is a “small truck” on that
by an interpreter. The interpreter initializes the program side, and return “left” if so and “right” otherwise. V IS P ROG
state (a dictionary mapping variables names to their values) uses the question answering module based on V I LT [16],
with the inputs, and steps through the program line-by-line but instead of simply passing the complex original question
while invoking the correct module with the inputs speci- to V I LT, V IS P ROG invokes it for simpler tasks like identi-
fied in that line. After executing each step, the program fying the contents within an image patch. As a result, our
state is updated with the name and value of the step’s output. resulting V IS P ROG for GQA is not only more interpretable
than V I LT but also more accurate (Tab. 1). Alternatively,
Visual Rationale. In addition to performing the necessary one could completely eliminate the need for a QA model
computation, each module class also implements a method like ViLT and use other systems like CLIP and object de-
called html() to visually summarize the inputs and outputs tectors, but we leave that for future investigation.
of the module in an HTML snippet. The interpreter sim- Evaluation. In order to limit the money spent on gener-
ply stitches the HTML summary of all program steps into ating programs with GPT-3, we create a subset of GQA
a visual rationale (Fig. 4) that can be used to analyze the for evaluation. Each question in GQA is annotated with
logical correctness of the program as well as inspect the in- a question type. To evaluate on a diverse set of question
termediate outputs. The visual rationales also enable users types (∼ 100 detailed types), we randomly sample up to k
to understand reasons for failure and tweak the natural lan- samples per question type from the balanced val (k = 5)
guage instructions minimally to improve performance. See and testdev (k = 20) sets.
Sec. 5.3 for more details.
Figure 6. Qualitative results for image editing (top) and knowledge tagging tasks (bottom).

Prompts. We manually annotate 31 random questions from Evaluation. We create a small validation set by sampling
the balanced train set with desired V IS P ROG programs. An- 250 random samples from the NLVRV 2 dev set to guide
notating questions with programs is easy and requires writ- prompt selection, and test generalization on NLVRV 2’s full
ing down the chain of reasoning required to answer that par- public test set.
ticular question. We provide a smaller subset of in context Prompts. We sample and annotate V IS P ROG programs for
examples to GPT-3, randomly sampled from this list to re- 16 random statements in the NLVRV 2 train set. Since some
duce the cost of answering each GQA question. of these examples are redundant (similar program structure)
we also create a curated subset of 12 examples by removing
4.2. Zero-Shot Reasoning on Image Pairs 4 redundant ones.
VQA models are trained to answer questions about a sin-
4.3. Factual Knowledge Object Tagging
gle image. In practice, one might require a system to an-
swer questions about a collection of images. For example, We often want to identify people and objects in images
a user may ask a system to parse their vacation photo album whose names are unknown to us. For instance, we might
and answer the question: “Which landmark did we visit, the want to identify celebrities, politicians, characters in TV
day after we saw the Eiffel Tower?”. Instead of assembling shows, flags of countries, logos of corporations, popular
an expensive dataset and training a multi-image model, we cars and their manufacturers, species of organisms, and so
demonstrate the ability of V IS P ROG to use a single-image on. Solving this task requires not only localizing people,
VQA system to solve a task involving multiple images with- faces, and objects but also looking up factual knowledge in
out training on multi-image examples. an external knowledge base to construct a set of categories
We showcase this ability on the NLVRV 2 [30] bench- for classification, such as names of the characters on a TV
mark, which involves verifying statements about image show. We refer to this task as Factual Knowledge Object
pairs. Typically, tackling the NLVRV 2 challenge requires Tagging or Knowledge Tagging for short.
training custom architectures that take image pairs as input For solving Knowledge Tagging, V IS P ROG uses GPT-3
on NLVRV 2’s train set. Instead, V IS P ROG achieves this by as an implicit knowledge base that can be queried with nat-
decomposing a complex statement into simpler questions ural language prompts such as “List the main characters on
about individual images and a python expression involving the TV show Big Bang Theory separated by commas.” This
arithmetic and logical operators and answers to the image- generated category list can then be used by a CLIP image
level questions. The VQA model V I LT- VQA is used to get classification module that classifies image regions produced
image-level answers, and the python expression is evaluated by localization and face detection modules. V IS P ROG’s
to verify the statement. program generator automatically determines whether to use
GQA NLVRv2
a face detector or an open-vocabulary localizer depending
on the context in the natural language instruction. V IS P ROG
also estimates the maximum size of the category list re-
trieved. For instance, “Tag the logos of the top 5 german

Accuracy
car companies” generates a list of 5 categories, while “Tag
the logos of german car companies” produces a list of arbi-
trary length determined by GPT-3 with a cut-off at 20. This
allows users to easily control the noise in the classification # context examples # context examples
process by tweaking their instructions.
Evaluation. To evaluate V IS P ROG on this task, we anno- Figure 7. Performance improves with number of in-context ex-
tate 100 tagging instructions across 46 images that require amples on GQA and NLVRV 2 validation sets. The error bars
external knowledge to tag 253 object instances including represent 95% confidence interval across 5 runs. Predictions from
personalities across pop culture, politics, sports, and art, as the same runs are used for majority voting. (Sec. 5.1)
well as a varieties of objects (e.g. cars, flags, fruits, appli-
ances, furniture etc.). For each instruction, we measure both
localization and tagging performance via precision (fraction for the object replacement sub-task which uses Stable Diffu-
of predicted boxes that are correct) and recall (fraction of sion as long as the generated image is semantically correct.
ground truth objects that are correctly predicted). Tagging Prompts. Similar to knowledge tagging, we create 10 in-
metrics require both the predicted bounding box and the as- context examples for this task with no associated images.
sociated tag or class label to be correct, while localization
ignores the tag. To determine localization correctness, we 5. Experiments and Analysis
use an IoU threshold of 0.5. We summarize localization Our experiments evaluate the effect of number of
and tagging performance by F1 scores (harmonic mean of prompts on GQA and NLVR performance (Sec. 5.1), gen-
the average precision and recall across instructions). eralization of V IS P ROG on the four tasks comparing vari-
Prompts. We create 14 in-context examples for this task. ous prompting strategies (Sec. 5.2), analyze the sources of
Note that the instructions for these examples were halluci- error for each task (Fig. 8), and study the utility of visual
nated i.e. no images were associated with these examples. rationales for diagnosing errors and improving V IS P ROG’s
performance through instruction tuning (Sec. 5.3).
4.4. Image Editing with Natural Language
5.1. Effect of prompt size
Text to image generation has made impressive strides
over the last few years with models like DALL-E [26], Fig. 7 shows that validation performance increases pro-
Parti [35], and Stable Diffusion [28]. However, it is still be- gressively with the number of in-context examples used in
yond the capability of these models to handle prompts like the prompts for both GQA and NLVR. Each run randomly
”Hide the face of Daniel Craig with :p” (de-identification selects a subset of the annotated in-context examples based
or privacy preservation), or ”Create a color pop of Daniel on a random seed. We also find that majority voting across
Craig and blur the background” (object highlighting) even the random seeds leads to consistently better performance
though these are relatively simple to achieve programmat- than the average performance across runs. This is consis-
ically using a combination of face detection, segmentation tent with findings in Chain-of-Thought [32] reasoning liter-
and image processing modules. Achieving a sophisticated ature for math reasoning problems [31]. On NLVR, the per-
edit such as ”Replace Barack Obama with Barack Obama formance of V IS P ROG saturates with fewer prompts than
wearing sunglasses” (object replacement), first requires GQA. We believe this is because NLVRV 2 programs re-
identifying the object of interest, generating a mask of the quire fewer modules and hence fewer demonstrations for
object to be replaced and then invoking an image inpaint- using those modules than GQA.
ing model (we use Stable Diffusion) with the original im-
age, mask specifying the pixels to replace, and a description
5.2. Generalization
of the new pixels to generate at that location. V IS P ROG, GQA. In Tab. 1 we evaluate different prompting strategies
when equipped with the necessary modules and example on the GQA testdev set. For the largest prompt size
programs, can handle very complex instructions with ease. evaluated on the val set (24 in-context examples), we
Evaluation. To test V IS P ROG on the image editing instruc- compare the random strategy consisting of the V IS P ROG’s
tions for de-identification, object highlighting, and object best prompt chosen amongst 5 runs on the validation set
replacement, we collect 107 instructions across 65 images. (each run randomly samples in-context examples from 31
We manually score the predictions for correctness and re- annotated examples) and the majority voting strategy which
port accuracy. Note that we do not penalize visual artifacts takes maximum consensus predictions for each question
Prompting Context examples GQA NLVR
Method Runs Accuracy
strategy per run
V I LT- VQA - 1 - 47.8
V IS P ROG curated 1 20 50.0
V IS P ROG random 1 24 48.2
V IS P ROG voting 5 24 50.5

Table 1. GQA testdev results. We report performance on a subset


Knowledge Image
of the original GQA testdev set as described in Sec. 4.1. Tagging Editing

Prompting Context examples


Method Finetuned Runs Accuracy
strategy per run
V I LT- NLVR - 3 1 - 76.3
V IS P ROG curated 7 1 12 61.8
V IS P ROG random 7 1 16 61.3
V IS P ROG voting 7 5 16 62.4
Figure 8. Sources of error in V IS P ROG.
Table 2. NLVRV 2 test results. V IS P ROG performs NLVR zero-
shot i.e. without training any module on image pairs. V I LT- NLVR, ever training on image pairs, we report V I LT- NLVR, a
a V I LT model finetuned on NLVRV 2, serves as an upper bound. V I LT model finetuned on NLVRV 2 as an upper bound on
performance. While several points behind the upper bound,
Tagging Localization V IS P ROG shows strong zero-shot performance using only a
Instructions
precision recall F1 precision recall F1
single-image VQA model for image understanding, and an
Original 69.0 59.1 63.7 87.2 74.9 80.6 LLM for reasoning. Note that, V IS P ROG uses V I LT- VQA
Modified 77.6 73.9 75.7 87.4 82.5 84.9
for its VQA module which is trained on VQAV 2 a single
image question answer task, but not NLVRV 2.
Table 3. Knowledge tagging results. The table shows perfor-
mance on original instructions as well as modified instructions
created after inspecting visual rationales to understand instance- Knowledge Tagging. Tab. 3 shows localization and
specific sources of errors. tagging performance for the Knowledge Tagging task. All
instructions for this task not only require open vocabulary
Original Modified
localization but also querying a knowledge base to fetch
the categories to tag localized objects with. This makes
Accuracy 59.8 66.4
it an impossible task for object detectors alone. With the
Table 4. Image editing results. We manually evaluate each pre-
original instructions, V IS P ROG achieves an impressive
diction for semantic correctness. 63.7% F1 score for tagging, which involves both correctly
localizing and naming the objects, and 80.6% F1 score for
localization alone. Visual rationales in V IS P ROG allow
across 5 runs. While “random” prompts only slightly further performance gains by modifying the instructions.
outperform V I LT- VQA, voting leads to a significant gain See Fig. 6 for qualitative examples and Sec. 5.3 for more
of 2.7 points. This is because voting across multiple runs, details on instruction tuning.
each with a different set of in-context examples, effec-
tively increases the total number of in-context examples Image Editing. Tab. 4 shows the performance on the
seen for each prediction. We also evaluate a manually language-guided image editing task. Fig. 6 shows the wide
curated prompt consisting of 20 examples - 16 from the range of manipulations possible with the current set of mod-
31 annotated examples, and 4 additional hallucinated ules in V IS P ROG including face manipulations, highlight-
examples meant to provide a better coverage for failure ing one or more objects in the image via stylistic effects
cases observed in the validation set. The curated prompt like color popping and background blur, and changing scene
performs just as well as the voting strategy while using context by replacing key elements in the scene (e.g. desert).
5× less compute, highlighting the promise of prompt
5.3. Utility of Visual Rationales
engineering.
Error Analysis. Rationales generated by V IS P ROG
NLVR. Tab. 2 shows performance of V IS P ROG on the allow a thorough analysis of failure modes as shown in
NLVRV 2 test set and compares random, voting, and Fig. 8. For each task, we manually inspect rationales
curated prompting strategies as done with GQA. While for ∼ 100 samples to break down the sources of errors.
V IS P ROG performs the NLVR task zero-shot without Such analysis provides a clear path towards improving
Original: Tag the CEO of IBM Original: Tag the item that is
Original: Tag the Triwizard Tournament Champions
used to make coffee

Reason for failure:


# The knowledge query returns
one of the previous CEOs of IBM
LIST0 = List(query=‘CEO of IBM’,
max=1) Reason for failure:
# List restricts the output
length to 3
LIST0 = List(
query=‘Triwizard Tournament
Modified: Tag the most recent CEO of IBM Champions’,
max=3)
Reason for success:
# The knowledge query returns the Reason for failure:
current CEO of IBM # Localization modules fails to
LIST0 = List( detect any objects
query=‘most recent CEO of IBM’, OBJ0 = Loc(
max=1) image=IMAGE, object=‘item’)

Modified: Tag the kitchen appliance


Modified: Tag the 4 Triwizard Tournament Champions
that is used to make coffee
Original: Replace the coffee Modified: Replace the coffee
table with a glass-top modern table (table-merged) with a
coffee table glass-top modern coffee table

Reason for success:


# List outputs all 4 champions
LIST0 = List(
query=‘Triwizard Tournament
Champions’,
max=4)

Reason for success:


# Localization modules detects
Reason for failure: Reason for success: multiple appliances
# The selection module selects # The category restricts the which are then filtered by Select
an incorrect region (rug) search space OBJ0 = Loc(
OBJ1 = Select( OBJ1 = Select( image=IMAGE,
query=‘coffee table’, query=‘coffee table’, object=‘kitchen appliance that
category=None) category=‘table-merged’) makes coffee’)

Figure 9. Instruction tuning using visual rationales. By revealing the reason for failure, V IS P ROG allows a user to modify the original
instruction to improve performance.

performance of V IS P ROG on various tasks. For instance, 6. Conclusion


since incorrect programs are the leading source of errors
on GQA affecting 16% of samples, performance on GQA V IS P ROG proposes visual programming as a simple
may be improved by providing more in-context examples and effective way of bringing the reasoning capabilities of
similar to the instructions that V IS P ROG currently fails LLMs to bear on complex visual tasks. V IS P ROG demon-
on. Performance may also be improved by upgrading strates strong performance while generating highly inter-
models used to implement the high-error modules to more pretable visual rationales. Investigating better prompting
performant ones. For example, replacing the V I LT- VQA strategies and exploring new ways of incorporating user
model with a better VQA model for NLVR could improve feedback to improve the performance of neuro-symbolic
performance by up to 24%. Similarly, improving models systems such as V IS P ROG is an exciting direction for build-
used to implement “List” and “Select” modules, the major ing the next generation of general-purpose vision systems.
sources of error for knowledge tagging and image editing
tasks, could significantly reduce errors. 7. Acknowledgement

Instruction tuning. To be useful, a visual rationale must ul- We thank Kanchan Aggarwal for helping with the anno-
timately allow users to improve the performance of the sys- tation process for the image editing and knowledge tagging
tem on their task. For knowledge tagging and image editing tasks. We are also grateful to the amazing Hugging Face
tasks, we study if visual rationales can help a user mod- ecosystem for simplifying the use of state-of-the-art neural
ify or tune the instructions to achieve better performance. models for implementing V IS P ROG modules.
Fig. 9 shows that modified instructions: (i) result in a bet-
ter query for the localization module (e.g. “kitchen appli- References
ance” instead of “item”); (ii) provide a more informative
query for knowledge retrieval (e.g. “most recent CEO of [1] Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, An-
IBM” instead of “CEO of IBM”); (iii) provide a category toine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur
Mensch, Katie Millican, Malcolm Reynolds, Roman Ring,
name (e.g. “table-merged”) for the Select module to restrict
Eliza Rutherford, Serkan Cabi, Tengda Han, Zhitao Gong,
search to segmented regions belonging to the specified cat-
Sina Samangooei, Marianne Monteiro, Jacob Menick, Se-
egory; or (iv) control the number of classification categories bastian Borgeaud, Andy Brock, Aida Nematzadeh, Sa-
for knowledge tagging through the max argument in the List hand Sharifzadeh, Mikolaj Binkowski, Ricardo Barreira,
module. Tables 3 and 4 show that instruction tuning re- Oriol Vinyals, Andrew Zisserman, and Karen Simonyan.
sults in significant gains for knowledge tagging and image Flamingo: a visual language model for few-shot learning.
editing tasks. ArXiv, abs/2204.14198, 2022. 2
[2] Jacob Andreas, Marcus Rohrbach, Trevor Darrell, and Dan [15] Tushar Khot, H. Trivedi, Matthew Finlayson, Yao Fu, Kyle
Klein. Neural module networks. 2016 IEEE Conference Richardson, Peter Clark, and Ashish Sabharwal. Decom-
on Computer Vision and Pattern Recognition (CVPR), pages posed prompting: A modular approach for solving complex
39–48, 2016. 2, 3 tasks. ArXiv, abs/2210.02406, 2022. 3
[3] Ankur Bapna, Colin Cherry, Yu Zhang, Ye Jia, Melvin John- [16] Wonjae Kim, Bokyung Son, and Ildoo Kim. Vilt: Vision-
son, Yong Cheng, Simran Khanuja, Jason Riesa, and Alexis and-language transformer without convolution or region su-
Conneau. mslam: Massively multilingual joint pre-training pervision. In ICML, 2021. 2, 5
for speech and text. ArXiv, abs/2202.01374, 2022. 3 [17] Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yutaka
[4] Gary Bradski. The opencv library. Dr. Dobb’s Journal: Soft- Matsuo, and Yusuke Iwasawa. Large language models are
ware Tools for the Professional Programmer, 25(11):120– zero-shot reasoners. ArXiv, abs/2205.11916, 2022. 3
123, 2000. 2 [18] Jian Li, Yabiao Wang, Changan Wang, Ying Tai, Jianjun
[5] Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Sub- Qian, Jian Yang, Chengjie Wang, Jilin Li, and Feiyue Huang.
biah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakan- Dsfd: Dual shot face detector. 2019 IEEE/CVF Conference
tan, Pranav Shyam, Girish Sastry, Amanda Askell, Sand- on Computer Vision and Pattern Recognition (CVPR), pages
hini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, T. J. 5055–5064, 2019. 2
Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, [19] Jiasen Lu, Christopher Clark, Rowan Zellers, Roozbeh Mot-
Jeff Wu, Clemens Winter, Christopher Hesse, Mark Chen, taghi, and Aniruddha Kembhavi. Unified-io: A unified
Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, model for vision, language, and multi-modal tasks. ArXiv,
Jack Clark, Christopher Berner, Sam McCandlish, Alec Rad- abs/2206.08916, 2022. 2
ford, Ilya Sutskever, and Dario Amodei. Language models [20] Kenneth Marino, Mohammad Rastegari, Ali Farhadi, and
are few-shot learners. ArXiv, abs/2005.14165, 2020. 2, 3 Roozbeh Mottaghi. Ok-vqa: A visual question answering
[6] Bowen Cheng, Alexander G. Schwing, and Alexander Kir- benchmark requiring external knowledge. 2019 IEEE/CVF
illov. Per-pixel classification is not all you need for semantic Conference on Computer Vision and Pattern Recognition
segmentation. In NeurIPS, 2021. 2 (CVPR), pages 3190–3199, 2019. 3
[7] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina [21] Matthias Minderer, Alexey A. Gritsenko, Austin Stone,
Toutanova. Bert: Pre-training of deep bidirectional trans- Maxim Neumann, Dirk Weissenborn, Alexey Dosovitskiy,
formers for language understanding. ArXiv, abs/1810.04805, Aravindh Mahendran, Anurag Arnab, Mostafa Dehghani,
2019. 3 Zhuoran Shen, Xiao Wang, Xiaohua Zhai, Thomas Kipf, and
Neil Houlsby. Simple open-vocabulary object detection with
[8] Tanmay Gupta, Amita Kamath, Aniruddha Kembhavi, and
vision transformers. ArXiv, abs/2205.06230, 2022. 2
Derek Hoiem. Towards general purpose vision systems.
ArXiv, abs/2104.00743, 2021. 2 [22] Zoe Papakipos and Joanna Bitton. Augly: Data augmenta-
tions for robustness. ArXiv, abs/2201.06494, 2022. 12
[9] Ronghang Hu, Jacob Andreas, Trevor Darrell, and Kate
[23] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya
Saenko. Explainable neural computation via stack neural
Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry,
module networks. In ECCV, 2018. 2, 3
Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learn-
[10] Ronghang Hu, Jacob Andreas, Marcus Rohrbach, Trevor ing transferable visual models from natural language super-
Darrell, and Kate Saenko. Learning to reason: End-to-end vision. In International Conference on Machine Learning,
module networks for visual question answering. 2017 IEEE pages 8748–8763. PMLR, 2021. 2, 3, 12
International Conference on Computer Vision (ICCV), pages
[24] Alec Radford, Jeff Wu, Rewon Child, David Luan, Dario
804–813, 2017. 2, 3
Amodei, and Ilya Sutskever. Language models are unsuper-
[11] Drew A. Hudson and Christopher D. Manning. Gqa: A new vised multitask learners. 2019. 3
dataset for real-world visual reasoning and compositional [25] Colin Raffel, Noam M. Shazeer, Adam Roberts, Kather-
question answering. 2019 IEEE/CVF Conference on Com- ine Lee, Sharan Narang, Michael Matena, Yanqi Zhou,
puter Vision and Pattern Recognition (CVPR), pages 6693– Wei Li, and Peter J. Liu. Exploring the limits of trans-
6702, 2019. 5 fer learning with a unified text-to-text transformer. ArXiv,
[12] Justin Johnson, Bharath Hariharan, Laurens van der Maaten, abs/1910.10683, 2020. 2
Judy Hoffman, Li Fei-Fei, C. Lawrence Zitnick, and Ross B. [26] Aditya Ramesh, Mikhail Pavlov, Gabriel Goh, Scott Gray,
Girshick. Inferring and executing programs for visual rea- Chelsea Voss, Alec Radford, Mark Chen, and Ilya Sutskever.
soning. 2017 IEEE International Conference on Computer Zero-shot text-to-image generation. ArXiv, abs/2102.12092,
Vision (ICCV), pages 3008–3017, 2017. 2, 3 2021. 7
[13] Amita Kamath, Christopher Clark, Tanmay Gupta, Eric [27] Scott Reed, Konrad Zolna, Emilio Parisotto, Sergio Gomez
Kolve, Derek Hoiem, and Aniruddha Kembhavi. Webly su- Colmenarejo, Alexander Novikov, Gabriel Barth-Maron,
pervised concept expansion for general purpose vision mod- Mai Gimenez, Yury Sulsky, Jackie Kay, Jost Tobias Sprin-
els. In ECCV, 2022. 2 genberg, Tom Eccles, Jake Bruce, Ali Razavi, Ashley D. Ed-
[14] Tushar Khot, Kyle Richardson, Daniel Khashabi, and Ashish wards, Nicolas Manfred Otto Heess, Yutian Chen, Raia Had-
Sabharwal. Learning to solve complex tasks by talking to sell, Oriol Vinyals, Mahyar Bordbar, and Nando de Freitas.
agents. ArXiv, abs/2110.08542, 2021. 3 A generalist agent. ArXiv, abs/2205.06175, 2022. 2
[28] Robin Rombach, A. Blattmann, Dominik Lorenz, Patrick
Esser, and Björn Ommer. High-resolution image synthesis
with latent diffusion models. 2022 IEEE/CVF Conference
on Computer Vision and Pattern Recognition (CVPR), pages
10674–10685, 2022. 2, 7
[29] Ishika Singh, Valts Blukis, Arsalan Mousavian, Ankit Goyal,
Danfei Xu, Jonathan Tremblay, Dieter Fox, Jesse Thoma-
son, and Animesh Garg. Progprompt: Generating situ-
ated robot task plans using large language models. ArXiv,
abs/2209.11302, 2022. 3
[30] Alane Suhr, Stephanie Zhou, Iris Zhang, Huajun Bai, and
Yoav Artzi. A corpus for reasoning about natural language
grounded in photographs. ArXiv, abs/1811.00491, 2019. 6
[31] Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le,
Ed Chi, and Denny Zhou. Self-consistency improves
chain of thought reasoning in language models. ArXiv,
abs/2203.11171, 2022. 7
[32] Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten
Bosma, Ed Chi, Quoc Le, and Denny Zhou. Chain of thought
prompting elicits reasoning in large language models. ArXiv,
abs/2201.11903, 2022. 3, 7
[33] Ronald J Williams. Simple statistical gradient-following al-
gorithms for connectionist reinforcement learning. Machine
learning, 8(3):229–256, 1992. 2, 3
[34] Zhengyuan Yang, Zhe Gan, Jianfeng Wang, Xiaowei Hu, Yu-
mao Lu, Zicheng Liu, and Lijuan Wang. An empirical study
of gpt-3 for few-shot knowledge-based vqa. In AAAI, 2022.
3
[35] Jiahui Yu, Yuanzhong Xu, Jing Yu Koh, Thang Luong, Gun-
jan Baid, Zirui Wang, Vijay Vasudevan, Alexander Ku, Yin-
fei Yang, Burcu Karagol Ayan, Benton C. Hutchinson, Wei
Han, Zarana Parekh, Xin Li, Han Zhang, Jason Baldridge,
and Yonghui Wu. Scaling autoregressive models for content-
rich text-to-image generation. ArXiv, abs/2206.10789, 2022.
7
[36] Andy Zeng, Maria Attarian, Brian Ichter, Krzysztof Choro-
manski, Adrian Wong, Stefan Welker, Federico Tombari,
Aveek Purohit, Michael Ryoo, Vikas Sindhwani, Johnny
Lee, Vincent Vanhoucke, and Pete Florence. Socratic mod-
els: Composing zero-shot multimodal reasoning with lan-
guage. arXiv, 2022. 3
A. Appendix
This appendix includes
• Task prompts for V IS P ROG(Sec. A.1)
• Module implementation details (Sec. A.2)
• Many more qualitative results with visual rationales
for both successful and failure cases can be found at
https://prior.allenai.org/projects/visprog.

A.1. Task Prompts


We show the prompt structures for GQA (Figure 10),
NLVR (Figure 11), knowledge tagging (Figure 13), and
language-guided image editing (Figure 12) tasks with 3 in-
context examples each.

Figure 11. NLVR prompt

Figure 10. GQA prompt

A.2. Module Details


To help understand the generated programs better, we
now provide a few implementation details about some of
the modules.

Select. The module takes a query and a category argu- Figure 12. Image editing prompt. Note that the prompt includes
ment. When the category is provided, the selection is a mapping of emojis to their names in the AugLy [22] library that
only performed over the regions that have been identified is used to implement Emoji module. The third example shows
as belonging to that category by a previous module in the how to provide the category value for the Select module.
program (typically the Seg module). If category is None,
the selection is performed over all regions. The query is
the text to be used for region-text scoring to perform the Classify. The Classify module takes lists of object regions
selection. We use CLIP-ViT [23] to select the region with and categories and tries to assign one of the categories to
the maximum score for the query. When the query contains each region. For simplicity, we assume the images in the
multiple phrases separated by commas, the highest-scoring tagging task has at most 1 instance of each category. The
region is selected for each phrase. Classify module operates differently based on whether
the category list has 1 or more elements. If the category list
Figure 13. Knowledge tagging prompt. Note that the prompt has
an additional placeholder to configure the default max value for
List module. While the first example infers max from a natural
instruction, the third example demonstrates how a user might min-
imally augment a natural instruction to provide argument values.

has only 1 element, the category is assigned to the region


with the highest CLIP score, similar to the Select module.
When more than one category is provided, first, each
region is assigned the category with the best score. Due
to classification errors, this can lead to multiple regions
being assigned the same category. Therefore, for each of
the assigned categories (excluding the ones that were not
assigned to any region), we perform a de-duplication step
that retains only the maximum scoring region for each
category.

List. The List module uses GPT3 to create a flexible and


powerful knowledge retriever. Fig. 14 shows the prompt
provided to GPT3 to retrieve factual knowledge.

Create comma separated lists based on the query.

Query: List at most 3 primary colors separated by commas


List:
red, blue, green

Query: List at most 2 north american states separated by commas


List:
California, Washington

Query: List at most {list_max} {new_query} separated by commas


List:

Figure 14. Prompt for the List module. list_max denotes the
default maximum list length and new_query is the placeholder for
the new retrieval query

You might also like