2209.15162v3
2209.15162v3
2209.15162v3
A BSTRACT
arXiv:2209.15162v3 [cs.CL] 9 Mar 2023
The extent to which text-only language models (LMs) learn to represent features
of the non-linguistic world is an open question. Prior work has shown that pre-
trained LMs can be taught to caption images when a vision model’s parameters
are optimized to encode images in the language space. We test a stronger hy-
pothesis: that the conceptual representations learned by frozen text-only models
and vision-only models are similar enough that this can be achieved with a lin-
ear map. We show that the image representations from vision models can be
transferred as continuous prompts to frozen LMs by training only a single lin-
ear projection. Using these to prompt the LM achieves competitive performance
on captioning and visual question answering tasks compared to models that tune
both the image encoder and text decoder (such as the MAGMA model). We com-
pare three image encoders with increasing amounts of linguistic supervision seen
during pretraining: BEIT (no linguistic information), NF-ResNET (lexical cate-
gory information), and CLIP (full natural language descriptions). We find that
all three encoders perform equally well at transferring visual property informa-
tion to the language model (e.g., whether an animal is large or small), but that
image encoders pretrained with linguistic supervision more saliently encode cat-
egory information (e.g., distinguishing hippo vs. elephant) and thus perform sig-
nificantly better on benchmark language-and-vision tasks. Our results indicate
that LMs encode conceptual information structurally similarly to vision-based
models, even those that are solely trained on images. Code is available here:
https://github.com/jmerullo/limber
1 I NTRODUCTION
Much recent work in NLP has revolved around studying the limits on representational capacity in-
curred by training on form-only text data, as discussed in Bender & Koller (2020). Tied to this
argument is the idea that without explicit grounding, language models are not inclined to learn con-
ceptual representations of language that reflect the rich conceptual knowledge that humans gain from
interacting with the physical, non-linguistic world. Despite this, there have been remarkable find-
ings in large language models’ abilities to generalize to and reason about non-linguistic phenomena
(Tsimpoukelli et al., 2021; Eichenberg et al., 2021; Li et al., 2021; Patel & Pavlick, 2022). Thus, an
open question in the field is to what extent (if at all) a language model trained on text-only data can
learn aspects of the physical world. In this paper, we test a specific hypothesis about the relationship
A frozen image encoder A linear projection is tuned The image projections are
encodes an image as a to project from image fed as soft prompts into a
feature map space to text space generative LM
“A picture of
Image Encoder ! Linear Proj. ! Text Decoder a dog on a
skateboard”
Figure 1: We train linear projections from image representations into the input space of a language
model to produce captions describing images. We find that LMs can describe the contents of most
image representations, but performance varies based on the type of image encoder used.
1
Published as a conference paper at ICLR 2023
between language model and image encoder representations: that these conceptual representations
can be approximately mapped to one another through a linear transformation. To do this, we train
a single linear layer to project from the representation space of images into the language space of a
generative LM without tuning any other model parameters, which we call LiMBeR: Linearly Map-
ping Between Representation spaces. That is, we linearly transform an image representation into
“soft prompts”–vector(s) in the embedding space that do not correspond to discrete language tokens
(Lester et al., 2021). The weights of this linear projection are tuned for an image captioning task
(illustrated in Figure 1). We can then evaluate its performance on vision-language (VL) tasks at test
time by exploring the text the LM generates. Because of the simplicity of the linear transformation,
we would expect that if the conceptual representation spaces of the two models are structured sim-
ilarly, this transfer will be successful and the LM will have little trouble describing the contents of
images.
We use three different image encoders with increasing levels of linguistic supervision in pretraining:
BEIT (Bao et al., 2021), Normalizer Free Resnet50 (NFRN50) (Brock et al., 2021), and CLIP (Rad-
ford et al., 2021) to train different projections into the LM. By linguistic supervision, we refer to the
extent to which the image encoder was exposed to language data during its pretraining, thus influ-
encing the expected representational similarity between it and an LM. While CLIP was pretrained to
align images with full natural language captions in a shared image-text representation space, BEIT
had no exposure to language and was trained by predicting the contents of masked out sections of
images. NFRN50 falls somewhere between these extremes: having been pretrained on an image
classification task for identifying the subject of an image over the set of classes in ImageNet1k Rus-
sakovsky et al. (2015). Although there is no natural language in this task, the pretraining objective
encourages the model to map visual features along lexical categorical concepts (the image classes)
derived from the WordNet hierarchy (Miller, 1995).
We show that prompting an LM with any of the three image encoders effectively transfers semantic
content in the image that the LM describes with natural language. However, performance also
appears proportional to the strength of the linguistic supervision the image encoder had. While CLIP
and NFRN50 perform competitively with tuning the models freely (e.g., Tsimpoukelli et al. (2021),
Eichenberg et al. (2021)), BEIT appears to transfer mostly coarse-grained visual properties and
struggles with encouraging the LM to generate exact lexical categories. We interpret this as evidence
that models trained on either language or vision data learn conceptual spaces that are structurally
similar to each other, but that the exact degree of similarity depends on the type of supervision the
image encoder receives. In summary, we show: (1) that visual semantic information can be linearly
mapped to language models in the form of soft prompts without tuning any model parameters.
(2) That this mapping allows generative models to describe images and answer questions about
images at a level that is comparable to what is achieved by multimodal models which tune image
and language representations jointly. And (3) by training our prompting pipeline with different
image encoder backbones, we demonstrate that linguistic supervision in pretraining plays a key role
in concept formation in models and thus, the transferability of visual features from vision to text
spaces.
2 R ELATED W ORK
Our approach takes inspiration from recent work in adapting pretrained language models for ac-
cepting representations of images as inputs. Particularly, the Frozen and MAGMA models (Tsim-
poukelli et al., 2021; Eichenberg et al., 2021), as well as Sung et al. (2022); Alayrac et al. (2022);
Mokady et al. (2021); Luo et al. (2022); Lin et al. (2021); Zhai et al. (2022), which show that pre-
trained image and text networks can be tuned together on an image captioning task and applied to
downstream vision-language (VL) tasks. These approaches either fine-tune the pretrained models,
or train non-linear MLP projection/fusion networks between modalities, making interpretation of
the representations difficult compared to our approach. Scialom et al. (2020) show a learned linear
transformation is sufficient for BERT to encode image region representations which are then fed to
a text decoder to generate questions about the image, but it is not well understood what abstractions
LMs are able to transfer from a transformation of this type, or if a text decoder can operate on linear
transformations of visual encodings directly.
2
Published as a conference paper at ICLR 2023
Pretrained/from scratch LMs have typically been used in the past for image captioning applications
in which an image representation is fed into the LM as input (Desai & Johnson, 2021; Shen et al.,
2021; Devlin et al., 2015). Gui et al. (2022); Yuan et al. pretrain vision-language models from
scratch using image-caption data. Zeng et al. (2022); Xie et al. (2022); Wang et al. (2022) augment
multimodal performance by feeding text prompts derived from VL models into an LM, in order to
incorporate knowledge learned by LM training. These show LMs can interface with visual inputs
described in text; our work questions whether the visual input can be fed directly into the LM, with-
out bridging through language first. The success of aforementioned models on VL tasks indicates
there is a representational similarity learned by text and image models independently, which we
investigate in this paper.
Our work is also highly related to the idea of model “stitching” (Lenc & Vedaldi, 2015) in which two
different models are attached at a certain layer. LiMBeR can be described as stitching the output of
an image encoder to the input of an LM in the form of soft prompts (Lester et al., 2021). Stitching
offers distinct advantages in evaluating the representational similarity between two models, as de-
scribed in Bansal et al. (2021), over other conventional methods like RSA and CKA (Kriegeskorte
et al., 2008; Kornblith et al., 2019). For example, LiMBeR allows us to show not just that CLIP en-
codings are more similar to text encodings than BEIT representations, but that BEIT representations
are nevertheless able to transfer visual property information to the LM (§5.3).
There has been considerable interest in recent work in establishing if LMs model aspects of the non-
linguistic world in order to model language. Lu et al. (2021) show that the weights of a pretrained
LM can generalize to tasks with different modalities. Hao et al. (2022) similarly show that LMs
can act as interfaces for multiple modalities. Li et al. (2021) show that models of entities and
situations can be derived from contextual word representations. Patel & Pavlick (2022) show that
very large LMs (GPT-3 scale (Brown et al., 2020)) can learn in-context non-linguistic conceptual
domains depicted in text. Our work differs from these in that we have an LM interface directly with
non-text data without changing model weights and show that, although fundamentally different, the
representation space of a text-only LM shares non-trivial similarities to that of several vision-based
models.
While previous work has shown success in mapping images to language model soft prompts as
a method for multimodal pretraining (e.g., Frozen, Magma; see Section 2), there have been no
attempts to restrict the mechanism behind this mapping and understand how it works. Our basic
approach is to train a single linear layer P to project from the hidden size hI of a pretrained image
encoder into the input space eL of a generative language model for an image captioning task. The
projected inputs do not correspond to discrete language tokens, and can be thought of as soft prompts
(Lester et al., 2021) representing the image. For brevity, we refer to training P as Linearly Mapping
Between Representation spaces (i.e., LiMBeR)1 . Our approach can also be viewed as paring down
the method used in Tsimpoukelli et al. (2021) and Eichenberg et al. (2021), such that the only trained
parameters reside in the projection P . By freezing the image encoder E and LM on either side of
the projection, we can examine the similarities between the representation spaces of the two as a
function of the ability of the LM to describe an image input or perform some task relating to it.
We expect that, if a language model represents visual conceptual information structurally similarly
to that learned by a vision encoder, then a simple linear transformation to the language space is all
that is required to transfer visual features into the language model. Before describing the training
procedure, we will describe the basic components of the model, and the variations we chose.
Language Model LM & Image Encoders E We hypothesize that the conceptual representations
learned by an LM are equivalent, up to a linear transformation, to the representations from an image
encoder E. The language model used is the 6 billion parameter decoder-only GPT-J model (Wang
& Komatsuzaki, 2021). P is trained to project from hI to the input space eL = 4096 of the LM.
We train several models with different E’s to determine the compatibility between encodings from
1
We avoid specifying images or text in our backronym because one could linearly map between any two
representation spaces of any modalities (e.g. video-to-text or text-to-text)
3
Published as a conference paper at ICLR 2023
E and the LM. We also test how the choice of E influences performance on this task, specifically,
with regards to the degree of linguistic supervision E saw in pretraining, as described in Section 1.
From E we extract an image encoding of dimensionality hI representing the image. We then project
that encoding to a eL ∗ k sequence of soft prompts, which we hereafter refer to as image prompts.
k is determined by the architecture of the E. For example, for consistency with the MAGMA
model, we use the 12x12x3072d feature map before pooling from CLIP, which we flatten to k =
12 ∗ 12 = 144. The encoders we experiment with are (1) CLIP RN50x16 (Radford et al., 2021),
k = 144, hI = 3072. Because CLIP is trained to learn multimodal image-text embeddings, we
expect that it will be easier for the model to learn a projection into language space than a vision only
encoder. (2) NFRN50 (Brock et al., 2021), k = 2, hI = 2048. We train three variants using NF-
Resnet50: one pretrained and frozen during caption training (NFRN50), one tuned during caption
training (NFRN50 Tuned; note that the LM is still frozen), and one randomly initialized (NFRN50
Random). The NFRN50 models are pretrained on an image classification task on data that is labeled
according to WordNet hypo/hypernym structure. This signal trains the model to separate object
classes according to these words. For this reason, we consider it to have indirect access to linguistic
supervision. (3) BEIT-Large (Bao et al., 2021), k = 196, hI = 1024. BEIT is pretrained using
a self-supervised masked visual token modeling task and does not have access to any labeled data
which may give the model an inductive bias towards a linguistic structure. We use the 16-pixel patch
version that was pretrained only on ImageNet22k. We additionally test two variants of this model:
BEIT Random, is randomly initialized, and BEIT FT, which was pretrained on the same task and
then finetuned for image classification on the same dataset. We use this model to show that it is
indeed the linguistic supervision of the pretraining objective which induces better performance in
the captioning task.
Following the MAGMA and Frozen models (Eichenberg et al., 2021; Tsimpoukelli et al., 2021),
we train a projection on an image captioning task so that we can learn to align the representation
spaces of E and the LM. All models are trained with the same basic hyperparameters and settings
as described in the MAGMA paper (see Appendix A for details) on the Conceptual Captions 3M
dataset (CC3M, Sharma et al. (2018)) for 15,000 training steps.
Baselines As baselines, we use NFRN50 Random, NFRN50 Tuned, and train our own instance of
MAGMAbase . Please note that NFRN50 Tuned is a stand-in for the Frozen model: it is architec-
turally the same, but differs in that we use the hyperparameters used to train the MAGMA model.
NFRN50 Random allows us to test the efficacy of LiMBeR when the image encoder backbone has
not learned any useful visual features. The MAGMA we train uses the CLIP RN50x16 image en-
coder (Radford et al., 2021), GPT-J as the LM, and adapters in sequence in the attention blocks with
a downsample factor of 4.
3.2 L IMITATIONS
Due to computational constraints, we did not control for the prompt length (k) for each image
encoder. Tsimpoukelli et al. (2021) experiment with a small range for the value of k for the Frozen
model and show that while there are some differences, k is mostly a factor in hyperparameter tuning
and should not strongly affect the comparison between models. We use much higher values of k for
CLIP and BEIT, and this is therefore is not strongly controlled for in our study.
We consider LM runoff another potential confound. In some cases, if the LM recognizes and gen-
erates a relevant word for one concept (e.g., “the beach”), it might continue generating relevant
information due to a strong linguistic prior for that info showing up (e.g., “building a sandcastle”),
giving the illusion it is recognizing every element in an image (even if it never saw “the sandcastle”).
Regardless, the scope of this problem is very limited, and across multiple large datasets our results
show that recovery of any image information is still possible, even if the full and precise extent of
which is impossible to know. We also include a ‘blind’ model in visual question answering analysis
to further control for this.
4
Published as a conference paper at ICLR 2023
Image Captioning
a giraffe in the lobby
CLIP CLIP tennis player in action
of the building
tennis player at the tennis
NFRN50 the giraffe in the zoo. NFRN50
tournament.
tennis player during a
BEIT a peacock in the garden BEIT
tennis match.
NFRN50 a man and a woman in a NFRN50
the new logo for the team
Random field of flowers Random
Visual Question Answering
CLIP He is surfing a wave. CLIP A tennis racket
Figure 2: Curated examples of captioning and zero-shot VQA illustrating the ability of each model to
transfer information to the LM without tuning either model. We use these examples to also illustrate
common failure modes for BEIT prompts of sometimes generating incorrect but conceptually related
captions/answers.
We first verify that image representations that are linearly projected into the input space of the LM
carry semantic information about the content of the image that the LM can make sense of. Since we
only tune a single projection between the image encoder and text decoder, the prompt tokens in the
LM are equivalent to the image representation up to that linear transformation. If LMs are learning a
conceptual space that reflects that of the non-linguistic, purely visually grounded space of the image
encoder, the LM should be able to capture the image information and describe it in text.
Data We evaluate on image prompts generated by each image encoder on multiple image captioning
datasets: MSCOCO (Lin et al., 2014) and NoCaps (Agrawal et al., 2019), as well as the VQA2
(Goyal et al., 2017) visual question-answering dataset. Following convention from SimVLM and
MAGMA, we input the prefix “A picture of” after every image to prompt the model. Like in previous
work, we find that this is a favorable prompt which tends to increase performance.
Metrics For image captioning, we report CIDEr-D (Vedantam et al., 2015), CLIPScore, and Ref-
CLIPScore (Hessel et al., 2021). CIDEr-D rewards generating accurate words which are more likely
to be visually informative, and CLIPScore can evaluate similarity between an image and caption
without references, which helps us give credit for captions that vary greatly from the ground truth,
but similar in semantic content (e.g. describing a pool as a lake). We report additional captioning
metrics in Appendix B. For visual question answering, we follow the few-shot procedure used in
Eichenberg et al. (2021) in which we prompt the models with the “[image] Q: [q] A:” format. We
take the first word of the generation and, like in the MAGMA paper, truncate to the length of the
longest ground truth answer. We also use the normalization procedure and accuracy metric described
in the VQA repo2
Results Our main results can be seen in Table 1. As evidenced by comparing MAGMA and CLIP,
and NFRN50 tuned and frozen, we find that there is relatively little benefit in training parameters in
either encoder or decoder. Note that the MAGMA model we implemented is identical to the frozen
CLIP model, with the exceptions that MAGMA tunes the image encoder and LM. On captioning
and VQA tasks, performance of the jointly-tuned models (MAGMA, NFRN50 Tuned) is not con-
sistently better, and is often worse, than just training the projection with frozen models. This trend
persists across over 10 automatic captioning metrics, which are described in Appendix B. Our results
indicate that there is in fact a relationship between the linguistic supervision of the pretraining task
2
https://github.com/GT-Vision-Lab/VQA
5
Published as a conference paper at ICLR 2023
VQA n-shots 0 1 2 4
Blind 20.60 35.11 36.17 36.99
NFRN50 Tuned 27.15 37.47 38.48 39.18
MAGMA (ours) 24.62 39.27 40.58 41.51
MAGMA (reported) 32.7 40.2 42.5 43.8
NFRN50 Random 25.34 36.15 36.79 37.43
BEIT 24.92 34.35 34.70 31.72
NFRN50 27.63 37.51 38.58 39.17
CLIP 33.33 39.93 40.82 40.34
Table 1: Captioning Performance and Visual Question Answering (VQA) accuracy for all variations
on model architecture and image encoders used. On captioning, we see a consistent increasing trend
in performance that correlates with an increase in linguistic supervision. However BEIT (the only
vision-only model), performs far above a randomly initialized NFRN50 model and is on par with
the other models on CLIPScore (CLIP-S) and RefCLIP Score (Ref-S) (Hessel et al., 2021). We see
that BEIT performs at the level of our random baselines on VQA, suggesting there is a deficiency in
relating visual information to more complex visual-linguistic reasoning tasks
and performance on transferring to the LM. That is, CLIP outperforms NFRN50, which outperforms
BEIT. To confirm this, we apply LiMBeR to a BEIT model finetuned on image classification (BEIT
FT.), and find that this model improves performance drastically, even outperforming CLIP on No-
Caps, and improving over BEIT on all metrics, including CLIP-Score by 9-10 points. This suggests
the importance of the linguistic supervision in the pretraining task, rather than perhaps architecture,
as the important factor for successful transfer.
Notably, we find that even vanilla BEIT, which has no linguistic supervision in pretraining, still
transfers well to the LM for captioning, far outperforming random NFRN50 across the board, which
had no pretraining to learn visual features. We do find that BEIT captions using vaguer language,
and/or semantically related-but-incorrect descriptions of objects (Figure 2; more examples in Ap-
pendix B). We see this reflected in the CLIPScores of the captions as well, which reward semantic
similarity rather than precise lexical overlap with a reference caption. BEIT captions score 62 and
63.6 for NoCaps and COCO respectively; on average only 4.5 points behind NFRN50 but 14.3 ahead
of random NFRN50. Perhaps we see the greatest failure of BEIT prompts in the inability to transfer
details that the LM can use to answer questions about images (At 4-shot VQA, BEIT scores 31.72%
while a ‘blind’ LM with no image input scores 36.99%). We hypothesize this is because BEIT rep-
resentations do not encode visual information that corresponds well to lexical categories. In Section
5, we provide evidence in favor of this hypothesis, and investigate the granularity of detail prompts
from each frozen encoder transfer to the LM.
6
Published as a conference paper at ICLR 2023
Figure 3: On average, recall of nouns in generated captions follows the standard pattern
(CLIP>NFRN50>BEIT). However, judging by Wu-Palmer similarity, BEIT performs nearly the
same or better than NFRN50 and CLIP on 4/5 of the noun categories. This indicates that although
BEIT struggles to transfer the exact correct concept, it is transferring a related one based on visual
similarity. On the right we show this effect for individual vehicle words. BEIT may have never
learned to distinguish the ‘bus’ concept, but the LM still understands to generate a highly related
concept, i.e., another vehicle. Average random Wu-Palmer similarity is around .4 consistently.
Following that, in Section 5.3 we focus on mistakes the models make: when the LM generates a
bad caption, does it generate a caption that describes entities with similar visual properties? For
example, a caption generated from an image of a “small”, “woodland”, and “furry” animal might
not mention the actual animal depicted (e.g., a squirrel); but does it instead mention a different
but similar furry animal (e.g., a rabbit)? We find that only linguistically informed image encoders
(NFRN50, CLIP) tend to strongly encode concepts aligning to lexical categories, but all pretrained
models including BEIT encode property information approximately equally well, and far better than
a randomly initialized image encoder baseline.
Using the COCO validation set, we count the top 50 nouns, modifiers (e.g., adjectives), and relations
(e.g., verbs, prepositional phrases) that appear in the ground truth captions and calculate how often
they appear in the generated captions that were used to calculate the scores in Table 1.
Metrics We calculate the precision/recall/F1 for each word, broken down along conceptual cate-
gories. To test our hypothesis that BEIT transfers coarser information, we also report the Wu-Palmer
similarity (Wup) (Wu & Palmer, 1994) between the ground truth word and the most similar word in
the generated caption. The Wup score works by calculating the distance between the ground truth
word and the generated word in the WordNet taxonomy, offering a way to measure ‘how close’ a
word was to the correct answer.
Results In Figure 3, we show that BEIT’s recall for nouns in categories like ‘people’, ‘environment’,
‘vehicles’, and ‘objects’ is lower than NFRN50 or CLIP, but is comparable in terms of Wup sim-
iliarity in many categories. Unlike NFRN50 and CLIP’s pretraining, BEIT’s pretraining does not
encourage it to learn conceptual differences between two similar looking objects that use different
words. Compared to prompts from a randomly initialized NFRN50, for which very few consistent
patterns emerge, the LM can still extract the broad conceptual meaning behind BEIT prompts, as
evidenced by high Wup similarity (and CLIPScore results in Table 1). We interpret these results as
supporting the hypothesis that BEIT prompts transfer conceptual information from the purely visual
to purely text space, but only in terms of coarse-grained conceptual information corresponding to
visual properties, not lexical categories. Our full analysis, including additional metrics and results
for each individual word from the top 50 nouns, modifiers, and relations can be found in Appendix
B.
5.2 P ROBING
To rule out the possibility that BEIT representations are encoding lexical concept information, but
are merely unable to linearly transfer it to the LM due to representational differences, we train linear
probes on several datasets for image classification. We find that BEIT typically does not encode fine-
7
Published as a conference paper at ICLR 2023
grained information as well as NFRN50 or CLIP, though it far outperforms the randomly initialized
NFRN50 baseline. We discuss training details and results in Appendix E.
To better understand what BEIT encodes, if not word category information, we further investigate
where errors arise, and how the structures of the embedding spaces for each frozen image encoder
differ. For the sake of this analysis, we constrain the task to generating captions for pictures of
animals. The reason for this narrower scope is that the captions are easier to analyze: the caption
describing a picture of an animal should virtually always mention the name of that animal, and the
word used to describe the animal is mostly unambiguous.
(a) Left: Wu-Palmer Similarity for captions in which the models don’t mention the animal show that BEIT,
NFRN50, and CLIP are all similarly close, meaning that even if they predict the wrong animal, it is on average
very taxonomically similar. Right: When the model mistakes one animal for another in the dataset, how
similar are the AWA properties for the true animal and the one it mistakes it most for? The average number
of overlapping properties show that animals predicted from BEIT are at least as similar to the real animal as
NFRN50 and CLIP. Median is shown as the solid orange line while the dashed green line shows the mean.
(b) UMAP projections of AWA images:While NFRN50 and CLIP cluster tightly along lexical categories
(color coded by animal), BEIT clusters the most distinctly along animals that live in water/the ocean; the
randomly initialized NFRN50 mostly randomly overlap in one cluster.
Figure 4
Data For this task we use the Animals With Attributes 2 (AWA) dataset (Xian et al., 2019) which
contains 37k total images covering 50 animal classes. Each animal class also comes with annota-
tions for 85 properties describing the animals (e.g., ‘claws’, ‘stripes’, ‘jungle’), which allow us to
analyze if prompts from certain encoders consistently make mistakes along any of these dimensions.
Metrics When an image prompt produces a caption, we can measure the similarity of any animals
mentioned to the WordNet synset of the ground truth animal label. We can also measure similarity
using the annotated properties provided by the AWA dataset. For a given animal (e.g., “squirrel”),
we can look at the other animal in the dataset that it is most often mistaken for (e.g., “rabbit”) and
compare the proportion of properties that they share.
Results We generate captions for each image using prompts from each frozen image encoder. We
consider a caption to be ‘correct’ if it contains the name of the animal the image depicts. CLIP and
NFRN50 are correct most often: 59% and 43% of the time respectively. BEIT and the randomly
initialized NFRN50 only achieve 13% and 0.4% accuracy, respectively. This aligns with previous
8
Published as a conference paper at ICLR 2023
observations that BEIT struggles with encoding fine-grained lexical level concepts. By looking at
failure cases for each model, we can establish whether each model is predicting the presence of a
similar animal or not. In Figure 4a, we show that when captions generated from each model mistake
one animal for another, the mistaken animals are highly similar to the ground truth animal when
measuring both Wu-Palmer similarity (Averages: BEIT: 0.8, NFRN50: 0.81, CLIP: 0.8) and overlap
of AWA properties (Averages: BEIT: 0.62, NFRN50: 0.68, CLIP: 0.59). Although BEIT prompts
do not transfer the exact animal concept to the LM, the coarse grained perceptual information is
transferred and ‘understood’ by the LM. In Figure 4b we create UMAP projections of the encodings
for each image in AWA and indeed find that NFRN50 and CLIP cluster according to tight lexical
categories (the animal types), BEIT clusters most tightly by perceptual features, such as habitat,
having flippers, etc.
7 C ONCLUSION
In this paper, we test how similar pretrained image and text representations are by training a lin-
ear map to project image representations into the input space of a language model (LM), such that
the LM can accurately describe the contents of the images. We show that models trained through
LiMBeR (Linearly Mapping Between Representation spaces) are competitive on image captioning
and visual question answering benchmarks with similar models like MAGMA that tune both image
and text networks. However, we also find that such transfer is highly dependant on the amount of
linguistic supervision the image encoder backbone had during its pretraining phase. BEIT, which
is a vision-only image encoder underperforms compared to a Resnet model trained on image clas-
sification, which in turn underperforms compared to CLIP, which was pretrained with natural lan-
guage captions. We explore what conceptual information transfers successfully, and find through
analysis of generated text, clustering, and probing that the representational similarity between LMs
and vision-only image representations is mostly restricted to coarse-grained concepts of perceptual
features, while linguistically supervised vision models can transfer lexical concepts. Our findings
indicate that LMs and vision models learn conceptually similar representation spaces, such that a
minimal linear transformation is an adequate approximation for transferring information about an
image. The extent of this representational similarity is not well understood and is an interesting
direction for future work.
9
Published as a conference paper at ICLR 2023
8 R EPRODUCIBILITY S TATEMENT
We are committed to making all of our results reproducible. Code for training all models, as well
as weights of the linear projections can be found here: https://github.com/jmerullo/
limber. We also release the weights of the linear projections that were trained for LiMBeR.
Because we froze the image and text models attached to the projection, the weights can be used to
quickly reproduce our results with the corresponding off-the-shelf pretrained models with no other
tuning necessary. We use the default data splits for all datasets we used and release the random seeds
used for all tasks that require generation from the LM in our codebase as well. For the AWA tasks
that require matching Wordnet synsets, we document the exact animal synsets that we used as the
‘ground truth’ for the animal labels in Appendix D, Table 6.
9 ACKNOWLEDGMENTS
We would like to thank Aaron Traylor and Nihal Nayak for thoughtful discussions and feedback on
this work, as well as StabilityAI for donating compute resources for model training. This research
is supported in part by ODNI and IARPA via the BETTER program (2019-19051600004). The
views and conclusions contained herein are those of the authors and should not be interpreted as
necessarily representing the official policies, either expressed or implied, of ODNI, IARPA, or the
U.S. Government.
R EFERENCES
Harsh Agrawal, Karan Desai, Yufei Wang, Xinlei Chen, Rishabh Jain, Mark Johnson, Dhruv Batra,
Devi Parikh, Stefan Lee, and Peter Anderson. Nocaps: Novel object captioning at scale. In
Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 8948–8957,
2019.
Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel
Lenc, Arthur Mensch, Katie Millican, Malcolm Reynolds, et al. Flamingo: a visual language
model for few-shot learning. arXiv preprint arXiv:2204.14198, 2022.
Yamini Bansal, Preetum Nakkiran, and Boaz Barak. Revisiting model stitching to compare neural
representations. Advances in Neural Information Processing Systems, 34:225–236, 2021.
Hangbo Bao, Li Dong, Songhao Piao, and Furu Wei. Beit: Bert pre-training of image transformers.
In International Conference on Learning Representations, 2021.
Emily M. Bender and Alexander Koller. Climbing towards NLU: On meaning, form, and under-
standing in the age of data. In Proceedings of the 58th Annual Meeting of the Association for
Computational Linguistics, pp. 5185–5198, Online, July 2020. Association for Computational
Linguistics. doi: 10.18653/v1/2020.acl-main.463. URL https://aclanthology.org/
2020.acl-main.463.
Andy Brock, Soham De, Samuel L Smith, and Karen Simonyan. High-performance large-scale
image recognition without normalization. In International Conference on Machine Learning, pp.
1059–1071. PMLR, 2021.
Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal,
Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are
few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
Karan Desai and Justin Johnson. Virtex: Learning visual representations from textual annotations.
In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp.
11162–11173, 2021.
Jacob Devlin, Hao Cheng, Hao Fang, Saurabh Gupta, Li Deng, Xiaodong He, Geoffrey Zweig,
and Margaret Mitchell. Language models for image captioning: The quirks and what works. In
Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the
7th International Joint Conference on Natural Language Processing (Volume 2: Short Papers),
pp. 100–105, 2015.
10
Published as a conference paper at ICLR 2023
Constantin Eichenberg, Sid Black, Samuel Weinbach, Letitia Parcalabescu, and Anette Frank.
Magma - multimodal augmentation of generative models through adapter-based finetuning.
ArXiv, abs/2112.05253, 2021.
Yash Goyal, Tejas Khot, Douglas Summers-Stay, Dhruv Batra, and Devi Parikh. Making the V
in VQA matter: Elevating the role of image understanding in Visual Question Answering. In
Conference on Computer Vision and Pattern Recognition (CVPR), 2017.
Liangke Gui, Qiuyuan Huang, Alex Hauptmann, Yonatan Bisk, and Jianfeng Gao. Training vision-
language transformers from captions alone. arXiv preprint arXiv:2205.09256, 2022.
Yaru Hao, Haoyu Song, Li Dong, Shaohan Huang, Zewen Chi, Wenhui Wang, Shuming Ma, and
Furu Wei. Language models are general-purpose interfaces. arXiv preprint arXiv:2206.06336,
2022.
Jack Hessel, Ari Holtzman, Maxwell Forbes, Ronan Le Bras, and Yejin Choi. CLIPScore: a
reference-free evaluation metric for image captioning. In EMNLP, 2021.
Simon Kornblith, Mohammad Norouzi, Honglak Lee, and Geoffrey Hinton. Similarity of neu-
ral network representations revisited. In Kamalika Chaudhuri and Ruslan Salakhutdinov (eds.),
Proceedings of the 36th International Conference on Machine Learning, volume 97 of Proceed-
ings of Machine Learning Research, pp. 3519–3529. PMLR, 09–15 Jun 2019. URL https:
//proceedings.mlr.press/v97/kornblith19a.html.
Nikolaus Kriegeskorte, Marieke Mur, and Peter A Bandettini. Representational similarity analysis-
connecting the branches of systems neuroscience. Frontiers in systems neuroscience, pp. 4, 2008.
Alex Krizhevsky, Geoffrey Hinton, et al. Learning multiple layers of features from tiny images.
2009.
Karel Lenc and Andrea Vedaldi. Understanding image representations by measuring their equiv-
ariance and equivalence. In Proceedings of the IEEE conference on computer vision and pattern
recognition, pp. 991–999, 2015.
Brian Lester, Rami Al-Rfou, and Noah Constant. The power of scale for parameter-efficient
prompt tuning. In Proceedings of the 2021 Conference on Empirical Methods in Natural Lan-
guage Processing, pp. 3045–3059, Online and Punta Cana, Dominican Republic, November
2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.emnlp-main.243. URL
https://aclanthology.org/2021.emnlp-main.243.
Belinda Z Li, Maxwell Nye, and Jacob Andreas. Implicit representations of meaning in neural lan-
guage models. In Proceedings of the 59th Annual Meeting of the Association for Computational
Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume
1: Long Papers), pp. 1813–1827, 2021.
Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr
Dollár, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In European
conference on computer vision, pp. 740–755. Springer, 2014.
Xudong Lin, Gedas Bertasius, Jue Wang, Shih-Fu Chang, Devi Parikh, and Lorenzo Torresani.
Vx2text: End-to-end learning of video-based text generation from multimodal inputs. In Proceed-
ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7005–7015,
2021.
Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. In International Confer-
ence on Learning Representations, 2018.
Kevin Lu, Aditya Grover, Pieter Abbeel, and Igor Mordatch. Pretrained transformers as univer-
sal computation engines. CoRR, abs/2103.05247, 2021. URL https://arxiv.org/abs/
2103.05247.
Ziyang Luo, Yadong Xi, Rongsheng Zhang, and Jing Ma. A frustratingly simple approach for end-
to-end image captioning, 2022. URL https://arxiv.org/abs/2201.12723.
11
Published as a conference paper at ICLR 2023
George A Miller. Wordnet: a lexical database for english. Communications of the ACM, 38(11):
39–41, 1995.
Ron Mokady, Amir Hertz, and Amit H Bermano. Clipcap: Clip prefix for image captioning. arXiv
preprint arXiv:2111.09734, 2021.
Cory Paik, Stéphane Aroca-Ouellette, Alessandro Roncone, and Katharina Kann. The world of an
octopus: How reporting bias influences a language model’s perception of color. In Proceedings of
the 2021 Conference on Empirical Methods in Natural Language Processing, pp. 823–835, 2021.
Roma Patel and Ellie Pavlick. Mapping language models to grounded conceptual spaces. In Interna-
tional Conference on Learning Representations, 2022. URL https://openreview.net/
forum?id=gJcEM8sxHK.
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal,
Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual
models from natural language supervision. In International Conference on Machine Learning,
pp. 8748–8763. PMLR, 2021.
S Rajbhandari, J Rasley, O Ruwase, and Y He. Zero: memory optimization towards training a
trillion parameter models. arxiv e-prints arxiv: 11910.02054 (2019), 2019.
Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng
Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, Alexander C. Berg, and Li Fei-Fei.
ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision
(IJCV), 115(3):211–252, 2015. doi: 10.1007/s11263-015-0816-y.
Thomas Scialom, Patrick Bordes, Paul-Alexis Dray, Jacopo Staiano, and Patrick Gallinari. What
bert sees: Cross-modal transfer for visual question generation. In Proceedings of the 13th Inter-
national Conference on Natural Language Generation, pp. 327–337, 2020.
Piyush Sharma, Nan Ding, Sebastian Goodman, and Radu Soricut. Conceptual captions: A cleaned,
hypernymed, image alt-text dataset for automatic image captioning. In Proceedings of the 56th
Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.
2556–2565, Melbourne, Australia, July 2018. Association for Computational Linguistics. doi:
10.18653/v1/P18-1238. URL https://aclanthology.org/P18-1238.
Sheng Shen, Liunian Harold Li, Hao Tan, Mohit Bansal, Anna Rohrbach, Kai-Wei Chang, Zhewei
Yao, and Kurt Keutzer. How much can clip benefit vision-and-language tasks? In International
Conference on Learning Representations, 2021.
Yi-Lin Sung, Jaemin Cho, and Mohit Bansal. Vl-adapter: Parameter-efficient transfer learning for
vision-and-language tasks. In Proceedings of the IEEE/CVF Conference on Computer Vision and
Pattern Recognition, pp. 5227–5237, 2022.
Maria Tsimpoukelli, Jacob Menick, Serkan Cabi, S. M. Ali Eslami, Oriol Vinyals, and Felix Hill.
Multimodal few-shot learning with frozen language models. In A. Beygelzimer, Y. Dauphin,
P. Liang, and J. Wortman Vaughan (eds.), Advances in Neural Information Processing Systems,
2021. URL https://openreview.net/forum?id=WtmMyno9Tq2.
Ramakrishna Vedantam, C Lawrence Zitnick, and Devi Parikh. Cider: Consensus-based image
description evaluation. In Proceedings of the IEEE conference on computer vision and pattern
recognition, pp. 4566–4575, 2015.
Ben Wang and Aran Komatsuzaki. GPT-J-6B: A 6 Billion Parameter Autoregressive Language
Model. https://github.com/kingoflolz/mesh-transformer-jax, May 2021.
Zhenhailong Wang, Manling Li, Ruochen Xu, Luowei Zhou, Jie Lei, Xudong Lin, Shuohang Wang,
Ziyi Yang, Chenguang Zhu, Derek Hoiem, et al. Language models with image descriptors are
strong few-shot video-language learners. arXiv preprint arXiv:2205.10747, 2022.
Zhibiao Wu and Martha Palmer. Verbs semantics and lexical selection. In Proceedings of the 32nd
annual meeting on Association for Computational Linguistics, pp. 133–138, 1994.
12
Published as a conference paper at ICLR 2023
Yongqin Xian, Christoph H. Lampert, Bernt Schiele, and Zeynep Akata. Zero-shot learning—a com-
prehensive evaluation of the good, the bad and the ugly. IEEE Transactions on Pattern Analysis
and Machine Intelligence, 41(9):2251–2265, 2019. doi: 10.1109/TPAMI.2018.2857768.
Yujia Xie, Luowei Zhou, Xiyang Dai, Lu Yuan, Nguyen Bach, Ce Liu, and Michael Zeng. Visual
clues: Bridging vision and language foundations for image paragraph captioning. arXiv preprint
arXiv:2206.01843, 2022.
Lu Yuan, Dongdong Chen, Yi-Ling Chen, Noel Codella, Xiyang Dai, Jianfeng Gao, Houdong Hu,
Xuedong Huang, Boxin Li, Chunyuan Li, et al. Florence: A new foundation model for computer
vision.
Andy Zeng, Adrian Wong, Stefan Welker, Krzysztof Choromanski, Federico Tombari, Aveek Puro-
hit, Michael S Ryoo, Vikas Sindhwani, Johnny Lee, Vincent Vanhoucke, et al. Socratic models:
Composing zero-shot multimodal reasoning with language. 2022.
Xiaohua Zhai, Xiao Wang, Basil Mustafa, Andreas Steiner, Daniel Keysers, Alexander Kolesnikov,
and Lucas Beyer. Lit: Zero-shot transfer with locked-image text tuning. In Proceedings of the
IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 18123–18133, 2022.
A L I MB E R T RAINING D ETAILS
We mimic the MAGMA pretraining process for each of our models as outlined in Eichenberg et al.
(2021). As described above, the training task is caption generation. For each training example,
an image and caption pair (x, y) is fed into the model. The image encoder E encodes the image
into a i1 , ..., ik of dimensionality hI and length k. For the CLIP Encoder, for example, we extract
(12,12,3072) feature patches, which we resize to (144,3072) and feed into the projection layer P .
The output of P is fed as tokens representing the image into the language model LM . The caption
y is tokenized as t1 , ..., tm where m is the variable length of the caption. LM is given the encoded
image tokens and starting with t1 learns to minimize the next token log probability of ti for timestep
i conditioned on i1 , ..., ik and t1 , ...ti−1 .
During training, we minimize the loss with the AdamW (Loshchilov & Hutter, 2018) optimizer per
mini-batch, with the help of ZeRO stage 2 (Rajbhandari et al., 2019). We use a dropout probability
of 0.1, a weight decay of 0, betas = (0.9, 0.95), and gradient clipping = 1.0. All models are trained
for 15,000 training steps across 16 A100 GPUs for approximately 1.75 days. Our effective batch
size was 2048. We use a learning rate of 8 ∗ 10− 4 for the projection layer P . For models where we
tune E as well, we tune its parameters with a learning rate of 2 ∗ 10− 6.
Table 2: Summary of image encoders used for pretraining. Prompt length refers to the number of
the tokens fed into the language model representing the image.
B C APTIONING P ERFORMANCE
To get a better idea of the kinds of captions LiMBeR produces, see Figure 6, which includes 15
images that were randomly selected from the COCO validation set (2017) and the generated cap-
tions for all models we test. We include a greater range captioning metric results in Table 3 for
the COCO dataset. Overall, we find the same trends that we report in the main paper, which are
13
Published as a conference paper at ICLR 2023
Figure 5: RSA similarity scores between representations of images of animals and the ground truth
captions describing them. GPT-J is has very low correlation with image encoder representations
that 1.) the greater amount of linguistic supervision that an image encoder has, the better its cap-
tioning performance and 2.) unfreezing the image encoder does not seem to lead to consistently
significant improvements. We also include the breakdown of the SPICE metric across the associ-
ated subcategories such as relations, attributes, and objects. Of the LiMBeR models, we find that
CLIP based models do the best across the board (12.1 for CLIP vs. 9.28 for NFRN50). Besides the
random baseline, BEIT performs the worst overall except in the color category (0.45 vs. 0.42 for
NFRN50). We also include heatmaps with recall/precision/F1/Wu-Palmer similarity metrics com-
paring the captions generated by each model and the top 50 nouns (objects), modifiers, and relations
from the ground truth captions from the COCO validation set (Figures 7, 8, 9).
We also use COCO images and captions to measure the similarity between vision and text encodings
of the same concepts. That is, if there is a structural similarity between image and text encodings,
do typical representational similarity metrics reflect this? For this experiment we want to encode a
large number of images and captions that depict a set of concepts, and compare the relative repre-
sentational similarity between the concepts for the image and text representations. The intuition is
that if image and text models are representing concepts in a similar structure, then two concepts that
are represented similarly in an image model should be similar within text representations as well.
We sample a small subset of images depicting 10 different animals from the COCO dataset, and
encode each image with each of our four image encoders (without a LiMBeR projection), and the
ground truth captions with GPT-J. We use the last hidden state of the last token as the representation
of the caption. We choose animals because it is easy for humans to intuitively compare how similar
two animals are. In total, we collect 939 images and captions for each animal class and compare the
representational similarity of the subset of encodings using the representational similarity analysis
(RSA) technique from Kriegeskorte et al. (2008). The similarity matrix between each representation
for each set is calculated, and then the upper triangular matrix for each similarity matrix is used to
calculate the Pearson correlation. The results of this can be seen in Figure 5. We zero out the diago-
nal for visibility of the other values, since the diagonal is always equal to 1. RSA does not seem to
capture similarity in a way that reflects the transfer performance to the LM. For example, similarity
between the BEIT and randomly initialized NFRN50 model are unusually high given differences in
performance in our experiments. Further analysis on the geometry of these representation spaces is
required to make stronger statements about how (dis)similar they are. We leave this for future work.
14
Published as a conference paper at ICLR 2023
Model BLEU-1 BLEU-2 BLEU-3 BLEU-4 METEOR ROUGE CIDER SPICE CLIPScore RefCLIPScore
NFRN50 Tuned 0.375 0.229 0.134 0.080 0.127 0.316 0.353 0.091 0.697 0.748
MAGMA (Ours) 0.309 0.216 0.145 0.097 0.146 0.341 0.475 0.110 0.753 0.796
MAGMA (Released) 0.432 0.300 0.203 0.137 0.159 0.376 0.521 0.117 0.767 0.794
BEIT Random 0.288 0.132 0.052 0.022 0.068 0.232 0.052 0.022 0.488 0.562
NFNRN Random 0.261 0.115 0.044 0.018 0.062 0.208 0.048 0.021 0.495 0.571
BEIT 0.319 0.187 0.105 0.060 0.106 0.299 0.223 0.064 0.636 0.697
NFRN50 0.409 0.254 0.149 0.088 0.132 0.334 0.362 0.093 0.689 0.741
BEIT FT. 0.420 0.283 0.182 0.116 0.155 0.367 0.510 0.116 0.742 0.789
CLIP 0.400 0.278 0.187 0.126 0.161 0.376 0.549 0.121 0.762 0.804
Table 3: All of the caption metrics for the models. For all scores, higher is better
Table 4: F-scores (x100) for each fine-grained category of the SPICE metric, evaluated on the 2017
COCO validation dataset. The top and bottom divide separates models where the image encoder is
either tuned of frozen, respectively. Models that use CLIP as their image encoder show a large jump
in improvement over other models (even compared to tuned ResNet), especially in the Attributes
(e.g. adjectives) and Object (e.g. nouns) categories.
NFRN50 is better at questions that require counting objects in an image. Future work is needed to
determine if this is just noise or a significant trend.
15
Published as a conference paper at ICLR 2023
Table 5: Average 4-shot accuracies of models on every question type from the VQA2.0 dataset.
Note that NFRN50, NFRN Random, and NFRN Tuned are renamed to save space. MAGMA refers
to our version of the model.
16
Published as a conference paper at ICLR 2023
Table 6: To aid with reproducibility, we report all animal synsets that were used for experiments
that require disambiguating words in captions to animal classes. This allows us to correctly count a
mention of “tigress” in a caption as a mention of the “tiger” animal type without relying on unreliable
string matching techniques.
17
Published as a conference paper at ICLR 2023
Table 7: Wu-Palmer Similarity for mistakes. Animals for which a model made fewer than 50 mis-
takes are dashed out
18
Published as a conference paper at ICLR 2023
19
Published as a conference paper at ICLR 2023
Table 8: Per animal average precision (AP) for properties of mentioned animal in captions per model
for the Animals with Attributes 2 (AWA2) dataset. BEIT, which tends to do worse at captioning and
question answering consistently predicts animals which share similar properties, and considerably
better than randomly initialized NFRN50 and randomly selecting animals as baselines. This sug-
gests that BEIT representations encode similar animals into broad conceptual categories (e.g. large
savanna animals) which are able to linearly transfer to the LM. Without linguistic supervision, BEIT
does not naturally distinguish these by the words we use to describe them, as NFRN50 and CLIP
do.
20
Published as a conference paper at ICLR 2023
Table 9: Accuracy for each model and each animal class in Animals with Attributes 2. A caption is
considered correct if the animal name is mentioned in the caption for the image.
We train probes on the image encodings from each of our image encoders to classify fine-grained
lexical and coarse grained categorical concepts on several datasets: COCO (Lin et al., 2014), and
CC3M (Sharma et al., 2018), and CIFAR-100 (Krizhevsky et al., 2009). The architecture is a single
linear layer which takes an image encoding of dimension hI (see Table 2) and projects to the number
of classes for the classification task. For single label classification, we use a softmax activation on
the logits and train with cross entropy as the loss function. For multilabel classification tasks, we
use a sigmoid activation layer on top of the logits and train with a binary cross entropy loss function.
We consider a certain class predicted if the value of the class after the sigmoid is ¿0.5.
Hyperparameters For simplicity, all probes are trained with the same hyperparameters (with a
few exceptions for the CC3M probes): learning rate: 1e-4, optimizer: AdamW (Loshchilov &
Hutter, 2018), betas=(0.9, 0.999), batch size: 48 for CC3M probes; 32 for all others, max epochs:
300 for CC3M probes; 300 for all others.
The COCO dataset contains labels for 80 objects in images, typically used for object detection or
segmentation tasks. Because one image can be labeled for multiple objects, we train the probe as a
multi-label classification task.
In addition to having fine-grained labels for each type, COCO provides labels for each broad cate-
gory that each object belongs to, called the supercategory. In Table 10, we show which supercate-
gory each object label falls under.
21
Published as a conference paper at ICLR 2023
We train a multilabel classification probe to classify the object category seen in a given image. We
report F1 for each LiMBeR model in Figure 11. We find that BEIT does a bit worse than NFRN50
and CLIP overall, but is able to classify some categories (images with accessories, vehicles, and
‘outdoor’ objects) well.
Next, we look at the probe results for probes trained to identify individual objects by type. Our
results can be found in Figure 12d. We find the same pattern emerges, and that BEIT does not
seem to be significantly closer to the other pretrained models in terms of F1 on the coarse-grained
vs. fine-grained patterns as we might expect. However, we do show that BEIT encodes strong, but
weaker lexical concept categories than the other two models, and that the finding that BEIT transfers
coarser grained information is not due to irreconcilable representational differences between BEIT
space and the LM space.
We also train the same set of probes on image data from CC3M, but evaluate on the same validation
set from COCO (i.e., the same evaluation as used in Section E.1.2). The purpose of this experiment
is to create a setting for a linear probe that better matches the LiMBeR setup we use in the main
paper. If there are concepts that the probe has no trouble with, but rarely appear in captions, that
could be an indicator that the LM and the image encoder represent that concept very differently in
representation space.
Data To align the CC3M images with the object labels in COCO, we create labels by looking
for exact string matches of the object label words (e.g. “teddy bear”) in CC3M captions. Of the
80 object classes, we cut out any that have fewer than 1000 images. This leaves us with 782,794
training images for the probes across 53 object classes; fewer than the CC3M dataset used to train
LiMBeR, but far more than the previous datasets we used for probing.
Results Because, in this setting, we train our probes on the same distribution as the LiMBeR
models, we compare the F1 of the probes identifying objects in images to the F1 of those concepts
appearing in the generated captions. Our results can be seen in Figure 13. It appears that if the image
encoder encodes the lexical concept, it generally also transfers to the LM with LiMBeR. Limitations
of this approach are that (1) the BEIT probe appears much worse at the domain shift from CC3M to
COCO (e.g., F1 for animals drops 0.3 compared to when the probes are trained on COCO) and (2)
some words in the label space are often substituted for more common words in generated captions
(e.g. “person” could be generated as “man”, “woman”, etc.). This makes it difficult to recognize
cases where the probe succeeds but the transfer fails. An interesting problem for future work is
better understanding which concepts are encoded in an image encoder’s representations, but do not
transfer well with a linear map to the LM.
CIFAR-100 (Krizhevsky et al., 2009) is a dataset of 60,000 32x32 images balanced across 100 object
classes. Like COCO, the 100 object labels are also annotated for coarse-grained object categories
including ‘aquatic mammals’ and ‘household furniture’. For CIFAR data, we train a classifier which
classifies an image for a single object label. Like with COCO, we train a set of probes for the fine
and coarse labels. We were surprised to find that for CIFAR images, BEIT representations tended
to better than NFRN50 in terms of average F1 (for the fine-grained probe, BEIT: 0.57, NFRN50:
0.47); a first for any of the experiments we ran. Given the majority of evidence shows NFRN50
encodes lexical category information stronger than BEIT, we hypothesize this is not because of BEIT
encoding lexical concepts more strongly, but due to the small resolution of images: because BEIT
uses visual tokens, it may be more robust to extremely blurry images, which are out of distribution
for both NFRN50 and BEIT.
22
Published as a conference paper at ICLR 2023
23
Published as a conference paper at ICLR 2023
Figure 6: 15 randomly selected images from the COCO 2017 validation dataset and the generated
captions from all models.
24
Published as a conference paper at ICLR 2023
Ground Truth: The vegetable are laid out neatly at the table.
NFRN50 Tuned : vegetables and fruits for sale at a market in the city
MAGMA (Ours) : fresh produce.
MAGMA (Rel.) : a farmers market
NFRN Random: a woman and her child in a garden
BEIT: the garden
NFRN50: the vegetables and fruits available at the farmers market.
CLIP: the farm stand with a variety of vegetables
Ground Truth: A woman standing in front of a box handing a woman a bag of food.
MAGMA (Rel.) :
a woman in a white shirt and black pants handing out food to a man in a
'
'white shirt and black pants.
NFRN Random: a young girl with a flower in her hair and a smile on her face
Ground Truth: A man lies on the beach while someone else holds a kite.
Figure 6: 15 randomly selected images from the first random seed taken from the COCO 2017
validation dataset with the generated captions from all models.
25
Published as a conference paper at ICLR 2023
(b) precision
(c) recall
(d) The Wu-Palmer Similarity between the ground truth word and the most similar word in the generated
captions.
(e) f1
Figure 7: The top 50 nouns that appear in the ground truth captions of the COCO validation set and
how often each model generates them
26
Published as a conference paper at ICLR 2023
(b) precision
(c) recall
(d) f1
Figure 8: The top 50 modifiers that appear in the ground truth captions of the COCO validation set
and how often each model generates them
27
Published as a conference paper at ICLR 2023
(b) precision
(c) recall
(d) f1
Figure 9: The top 50 relations that appear in the ground truth captions of the COCO validation set
and how often each model generates them
28
Published as a conference paper at ICLR 2023
NFRN50: Yes!
CLIP: Yes, he is enjoying the grass.
Figure 10: 15 randomly selected images from the VQA2 validation set and the generated answers
from all models. fl
30
Published as a conference paper at ICLR 2023
Figure 11: Probes trained on COCO images to classify the supercategories of the objects in the
images
31
Published as a conference paper at ICLR 2023
32
Published as a conference paper at ICLR 2023
Figure 13: F1 of image encoder probes trained on CC3M and evaluated on COCO. We find that F1
of captions by object category tend to follow those of probe performance. Notably the BEIT probe
is much worse at transferring from CC3M to COCO, and the captioning F1 tends to be consistently
higher which makes it difficult to draw conclusions for this model. Generally, it appears the abil-
ity to encode lexical information into the image representation entails being able to transfer that
information to the LM with a linear map.
33
Published as a conference paper at ICLR 2023
Figure 14: Probes trained on CIFAR images to classify the coarse labels of the objects in the images
34
Published as a conference paper at ICLR 2023
35