-
Notifications
You must be signed in to change notification settings - Fork 30k
updated visualBERT modelcard #40057
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
updated visualBERT modelcard #40057
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change | ||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
@@ -14,87 +14,61 @@ rendered properly in your Markdown viewer. | |||||||||||||||||||||||
|
||||||||||||||||||||||||
--> | ||||||||||||||||||||||||
|
||||||||||||||||||||||||
# VisualBERT | ||||||||||||||||||||||||
|
||||||||||||||||||||||||
<div class="flex flex-wrap space-x-1"> | ||||||||||||||||||||||||
<img alt="PyTorch" src="https://img.shields.io/badge/PyTorch-DE3412?style=flat&logo=pytorch&logoColor=white"> | ||||||||||||||||||||||||
<div style="float: right;"> | ||||||||||||||||||||||||
<div class="flex flex-wrap space-x-1"> | ||||||||||||||||||||||||
<img alt="PyTorch" src="https://img.shields.io/badge/PyTorch-DE3412?style=flat&logo=pytorch&logoColor=white"> | ||||||||||||||||||||||||
<img alt="TensorFlow" src="https://img.shields.io/badge/TensorFlow-FF6F00?style=flat&logo=tensorflow&logoColor=white"> | ||||||||||||||||||||||||
<img alt="Flax" src="https://img.shields.io/badge/Flax-29a79b.svg?style=flat&logo= | ||||||||||||||||||||||||
"> | ||||||||||||||||||||||||
<img alt="FlashAttention" src="https://img.shields.io/badge/%E2%9A%A1%EF%B8%8E%20FlashAttention-eae0c8?style=flat"> | ||||||||||||||||||||||||
<img alt="SDPA" src="https://img.shields.io/badge/SDPA-DE3412?style=flat&logo=pytorch&logoColor=white"> | ||||||||||||||||||||||||
</div> | ||||||||||||||||||||||||
</div> | ||||||||||||||||||||||||
|
||||||||||||||||||||||||
## Overview | ||||||||||||||||||||||||
|
||||||||||||||||||||||||
The VisualBERT model was proposed in [VisualBERT: A Simple and Performant Baseline for Vision and Language](https://huggingface.co/papers/1908.03557) by Liunian Harold Li, Mark Yatskar, Da Yin, Cho-Jui Hsieh, Kai-Wei Chang. | ||||||||||||||||||||||||
VisualBERT is a neural network trained on a variety of (image, text) pairs. | ||||||||||||||||||||||||
|
||||||||||||||||||||||||
The abstract from the paper is the following: | ||||||||||||||||||||||||
|
||||||||||||||||||||||||
*We propose VisualBERT, a simple and flexible framework for modeling a broad range of vision-and-language tasks. | ||||||||||||||||||||||||
VisualBERT consists of a stack of Transformer layers that implicitly align elements of an input text and regions in an | ||||||||||||||||||||||||
associated input image with self-attention. We further propose two visually-grounded language model objectives for | ||||||||||||||||||||||||
pre-training VisualBERT on image caption data. Experiments on four vision-and-language tasks including VQA, VCR, NLVR2, | ||||||||||||||||||||||||
and Flickr30K show that VisualBERT outperforms or rivals with state-of-the-art models while being significantly | ||||||||||||||||||||||||
simpler. Further analysis demonstrates that VisualBERT can ground elements of language to image regions without any | ||||||||||||||||||||||||
explicit supervision and is even sensitive to syntactic relationships, tracking, for example, associations between | ||||||||||||||||||||||||
verbs and image regions corresponding to their arguments.* | ||||||||||||||||||||||||
# VisualBERT | ||||||||||||||||||||||||
|
||||||||||||||||||||||||
This model was contributed by [gchhablani](https://huggingface.co/gchhablani). The original code can be found [here](https://github.com/uclanlp/visualbert). | ||||||||||||||||||||||||
[VisualBERT](https://huggingface.co/papers/1908.03557) is a vision-and-language model that extends the [BERT](https://huggingface.co/docs/transformers/en/model_doc/bert) architecture to understand how text and images relate. It's designed as a simple yet high-performing baseline for various multi-modal tasks. It processes text with visual features from object-detector regions, not raw pixels. In an approach called 'early fusion', these inputs are fed together into a single Transformer stack initialized from BERT, where self-attention implicitly aligns words with their corresponding image objects. | ||||||||||||||||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
|
||||||||||||||||||||||||
|
||||||||||||||||||||||||
## Usage tips | ||||||||||||||||||||||||
You can find all the original VisualBERT checkpoints under the [UCLA NLP](https://huggingface.co/uclanlp) organization. | ||||||||||||||||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
|
||||||||||||||||||||||||
|
||||||||||||||||||||||||
1. Most of the checkpoints provided work with the [`VisualBertForPreTraining`] configuration. Other | ||||||||||||||||||||||||
checkpoints provided are the fine-tuned checkpoints for down-stream tasks - VQA ('visualbert-vqa'), VCR | ||||||||||||||||||||||||
('visualbert-vcr'), NLVR2 ('visualbert-nlvr2'). Hence, if you are not working on these downstream tasks, it is | ||||||||||||||||||||||||
recommended that you use the pretrained checkpoints. | ||||||||||||||||||||||||
|
||||||||||||||||||||||||
2. For the VCR task, the authors use a fine-tuned detector for generating visual embeddings, for all the checkpoints. | ||||||||||||||||||||||||
We do not provide the detector and its weights as a part of the package, but it will be available in the research | ||||||||||||||||||||||||
projects, and the states can be loaded directly into the detector provided. | ||||||||||||||||||||||||
> [!TIP] | ||||||||||||||||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
|
||||||||||||||||||||||||
> Click on the VisualBERT models in the right sidebar for more examples of how to apply VisualBERT to different image and language tasks. | ||||||||||||||||||||||||
|
||||||||||||||||||||||||
VisualBERT is a multi-modal vision and language model. It can be used for visual question answering, multiple choice, | ||||||||||||||||||||||||
visual reasoning and region-to-phrase correspondence tasks. VisualBERT uses a BERT-like transformer to prepare | ||||||||||||||||||||||||
embeddings for image-text pairs. Both the text and visual features are then projected to a latent space with identical | ||||||||||||||||||||||||
dimension. | ||||||||||||||||||||||||
The example below demonstrates how to answer a question based on an image with [AutoModel] class. | ||||||||||||||||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
|
||||||||||||||||||||||||
|
||||||||||||||||||||||||
To feed images to the model, each image is passed through a pre-trained object detector and the regions and the | ||||||||||||||||||||||||
bounding boxes are extracted. The authors use the features generated after passing these regions through a pre-trained | ||||||||||||||||||||||||
CNN like ResNet as visual embeddings. They also add absolute position embeddings, and feed the resulting sequence of | ||||||||||||||||||||||||
vectors to a standard BERT model. The text input is concatenated in the front of the visual embeddings in the embedding | ||||||||||||||||||||||||
layer, and is expected to be bound by [CLS] and a [SEP] tokens, as in BERT. The segment IDs must also be set | ||||||||||||||||||||||||
appropriately for the textual and visual parts. | ||||||||||||||||||||||||
<hfoptions id="usage"> | ||||||||||||||||||||||||
<hfoption id="AutoModel"> | ||||||||||||||||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Have a complete example like this: import torch
import torchvision
from PIL import Image
import numpy as np
from transformers import AutoTokenizer, VisualBertForQuestionAnswering
import requests
from io import BytesIO
def get_visual_embeddings_simple(image, device=None):
model = torchvision.models.resnet50(pretrained=True)
model = torch.nn.Sequential(*list(model.children())[:-1])
model.to(device)
model.eval()
transform = torchvision.transforms.Compose([
torchvision.transforms.Resize(256),
torchvision.transforms.CenterCrop(224),
torchvision.transforms.ToTensor(),
torchvision.transforms.Normalize(
mean=[0.485, 0.456, 0.406],
std=[0.229, 0.224, 0.225]
)
])
if isinstance(image, str):
image = Image.open(image).convert('RGB')
elif isinstance(image, Image.Image):
image = image.convert('RGB')
else:
raise ValueError("Image must be a PIL Image or path to image file")
image_tensor = transform(image).unsqueeze(0).to(device)
with torch.no_grad():
features = model(image_tensor)
batch_size = features.shape[0]
feature_dim = features.shape[1]
visual_seq_length = 10
visual_embeds = features.squeeze(-1).squeeze(-1).unsqueeze(1).expand(batch_size, visual_seq_length, feature_dim)
return visual_embeds
tokenizer = AutoTokenizer.from_pretrained("google-bert/bert-base-uncased")
model = VisualBertForQuestionAnswering.from_pretrained("uclanlp/visualbert-vqa-coco-pre")
response = requests.get("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/pipeline-cat-chonk.jpeg")
image = Image.open(BytesIO(response.content))
visual_embeds = get_visual_embeddings_simple(image)
inputs = tokenizer("What is shown in this image?", return_tensors="pt")
visual_token_type_ids = torch.ones(visual_embeds.shape[:-1], dtype=torch.long)
visual_attention_mask = torch.ones(visual_embeds.shape[:-1], dtype=torch.float)
inputs.update({
"visual_embeds": visual_embeds,
"visual_token_type_ids": visual_token_type_ids,
"visual_attention_mask": visual_attention_mask,
})
with torch.no_grad():
outputs = model(**inputs)
logits = outputs.logits
predicted_answer_idx = logits.argmax(-1).item()
print(f"Predicted answer: {predicted_answer_idx}") |
||||||||||||||||||||||||
|
||||||||||||||||||||||||
The [`BertTokenizer`] is used to encode the text. A custom detector/image processor must be used | ||||||||||||||||||||||||
to get the visual embeddings. The following example notebooks show how to use VisualBERT with Detectron-like models: | ||||||||||||||||||||||||
```py | ||||||||||||||||||||||||
import torch | ||||||||||||||||||||||||
from transformers import BertTokenizer, VisualBertModel | ||||||||||||||||||||||||
|
||||||||||||||||||||||||
- [VisualBERT VQA demo notebook](https://github.com/huggingface/transformers-research-projects/tree/main/visual_bert) : This notebook | ||||||||||||||||||||||||
contains an example on VisualBERT VQA. | ||||||||||||||||||||||||
model = VisualBertModel.from_pretrained("uclanlp/visualbert-vqa-coco-pre") | ||||||||||||||||||||||||
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased") | ||||||||||||||||||||||||
|
||||||||||||||||||||||||
- [Generate Embeddings for VisualBERT (Colab Notebook)](https://colab.research.google.com/drive/1bLGxKdldwqnMVA5x4neY7-l_8fKGWQYI?usp=sharing) : This notebook contains | ||||||||||||||||||||||||
an example on how to generate visual embeddings. | ||||||||||||||||||||||||
inputs = tokenizer("What is the man eating?", return_tensors="pt") | ||||||||||||||||||||||||
visual_embeds = torch.rand(1, 36, 2048) | ||||||||||||||||||||||||
visual_token_type_ids = torch.ones((1, 36), dtype=torch.long) | ||||||||||||||||||||||||
visual_attention_mask = torch.ones((1, 36), dtype=torch.float) | ||||||||||||||||||||||||
|
||||||||||||||||||||||||
The following example shows how to get the last hidden state using [`VisualBertModel`]: | ||||||||||||||||||||||||
inputs.update({ | ||||||||||||||||||||||||
"visual_embeds": visual_embeds, | ||||||||||||||||||||||||
"visual_token_type_ids": visual_token_type_ids, | ||||||||||||||||||||||||
"visual_attention_mask": visual_attention_mask, | ||||||||||||||||||||||||
}) | ||||||||||||||||||||||||
|
||||||||||||||||||||||||
```python | ||||||||||||||||||||||||
>>> import torch | ||||||||||||||||||||||||
>>> from transformers import BertTokenizer, VisualBertModel | ||||||||||||||||||||||||
outputs = model(**inputs) | ||||||||||||||||||||||||
last_hidden_state = outputs.last_hidden_state | ||||||||||||||||||||||||
print("Last hidden state shape:", last_hidden_state.shape) | ||||||||||||||||||||||||
``` | ||||||||||||||||||||||||
|
||||||||||||||||||||||||
>>> model = VisualBertModel.from_pretrained("uclanlp/visualbert-vqa-coco-pre") | ||||||||||||||||||||||||
>>> tokenizer = BertTokenizer.from_pretrained("google-bert/bert-base-uncased") | ||||||||||||||||||||||||
</hfoption> | ||||||||||||||||||||||||
</hfoptions> | ||||||||||||||||||||||||
|
||||||||||||||||||||||||
>>> inputs = tokenizer("What is the man eating?", return_tensors="pt") | ||||||||||||||||||||||||
>>> # this is a custom function that returns the visual embeddings given the image path | ||||||||||||||||||||||||
>>> visual_embeds = get_visual_embeddings(image_path) | ||||||||||||||||||||||||
## Notes | ||||||||||||||||||||||||
|
||||||||||||||||||||||||
>>> visual_token_type_ids = torch.ones(visual_embeds.shape[:-1], dtype=torch.long) | ||||||||||||||||||||||||
>>> visual_attention_mask = torch.ones(visual_embeds.shape[:-1], dtype=torch.float) | ||||||||||||||||||||||||
>>> inputs.update( | ||||||||||||||||||||||||
... { | ||||||||||||||||||||||||
... "visual_embeds": visual_embeds, | ||||||||||||||||||||||||
... "visual_token_type_ids": visual_token_type_ids, | ||||||||||||||||||||||||
... "visual_attention_mask": visual_attention_mask, | ||||||||||||||||||||||||
... } | ||||||||||||||||||||||||
... ) | ||||||||||||||||||||||||
>>> outputs = model(**inputs) | ||||||||||||||||||||||||
>>> last_hidden_state = outputs.last_hidden_state | ||||||||||||||||||||||||
``` | ||||||||||||||||||||||||
- VisualBERT processes both text and visual inputs, so include visual features alongside text tokens. Use [BertTokenizer] for text and ensure images are preprocessed before input. | ||||||||||||||||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
|
||||||||||||||||||||||||
|
||||||||||||||||||||||||
## VisualBertConfig | ||||||||||||||||||||||||
|
||||||||||||||||||||||||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Only PyTorch is supported so all those other badges should be removed