Skip to content

Commit 4c99e55

Browse files
NielsRoggesgugger
andauthored
Improve documentation of some models (huggingface#14695)
* Migrate docs to mdx * Update TAPAS docs * Remove lines * Apply suggestions from code review Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com> * Apply some more suggestions from code review Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com> * Add pt/tf switch to code examples * More improvements * Improve docstrings * More improvements Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
1 parent 32eb29f commit 4c99e55

File tree

14 files changed

+1044
-1120
lines changed

14 files changed

+1044
-1120
lines changed

docs/source/model_doc/beit.rst

Lines changed: 9 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -40,8 +40,15 @@ significantly outperforming from-scratch DeiT training (81.8%) with the same set
4040
Tips:
4141

4242
- BEiT models are regular Vision Transformers, but pre-trained in a self-supervised way rather than supervised. They
43-
outperform both the original model (ViT) as well as Data-efficient Image Transformers (DeiT) when fine-tuned on
44-
ImageNet-1K and CIFAR-100.
43+
outperform both the :doc:`original model (ViT) <vit>` as well as :doc:`Data-efficient Image Transformers (DeiT)
44+
<deit>` when fine-tuned on ImageNet-1K and CIFAR-100. You can check out demo notebooks regarding inference as well as
45+
fine-tuning on custom data `here
46+
<https://github.com/NielsRogge/Transformers-Tutorials/tree/master/VisionTransformer>`__ (you can just replace
47+
:class:`~transformers.ViTFeatureExtractor` by :class:`~transformers.BeitFeatureExtractor` and
48+
:class:`~transformers.ViTForImageClassification` by :class:`~transformers.BeitForImageClassification`).
49+
- There's also a demo notebook available which showcases how to combine DALL-E's image tokenizer with BEiT for
50+
performing masked image modeling. You can find it `here
51+
<https://github.com/NielsRogge/Transformers-Tutorials/tree/master/BEiT>`__.
4552
- As the BEiT models expect each image to be of the same size (resolution), one can use
4653
:class:`~transformers.BeitFeatureExtractor` to resize (or rescale) and normalize images for the model.
4754
- Both the patch resolution and image resolution used during pre-training or fine-tuning are reflected in the name of

docs/source/model_doc/imagegpt.mdx

Lines changed: 100 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,100 @@
1+
<!--Copyright 2021 The HuggingFace Team. All rights reserved.
2+
3+
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the
4+
License. You may obtain a copy of the License at
5+
6+
http://www.apache.org/licenses/LICENSE-2.0
7+
8+
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an
9+
"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
10+
specific language governing permissions and limitations under the License. -->
11+
12+
# ImageGPT
13+
14+
## Overview
15+
16+
The ImageGPT model was proposed in [Generative Pretraining from Pixels](https://openai.com/blog/image-gpt) by Mark
17+
Chen, Alec Radford, Rewon Child, Jeffrey Wu, Heewoo Jun, David Luan, Ilya Sutskever. ImageGPT (iGPT) is a GPT-2-like
18+
model trained to predict the next pixel value, allowing for both unconditional and conditional image generation.
19+
20+
The abstract from the paper is the following:
21+
22+
*Inspired by progress in unsupervised representation learning for natural language, we examine whether similar models
23+
can learn useful representations for images. We train a sequence Transformer to auto-regressively predict pixels,
24+
without incorporating knowledge of the 2D input structure. Despite training on low-resolution ImageNet without labels,
25+
we find that a GPT-2 scale model learns strong image representations as measured by linear probing, fine-tuning, and
26+
low-data classification. On CIFAR-10, we achieve 96.3% accuracy with a linear probe, outperforming a supervised Wide
27+
ResNet, and 99.0% accuracy with full fine-tuning, matching the top supervised pre-trained models. We are also
28+
competitive with self-supervised benchmarks on ImageNet when substituting pixels for a VQVAE encoding, achieving 69.0%
29+
top-1 accuracy on a linear probe of our features.*
30+
31+
<img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/imagegpt_architecture.png"
32+
alt="drawing" width="600"/>
33+
34+
<small> Summary of the approach. Taken from the [original paper](https://cdn.openai.com/papers/Generative_Pretraining_from_Pixels_V2.pdf). </small>
35+
36+
This model was contributed by [nielsr](https://huggingface.co/nielsr), based on [this issue](https://github.com/openai/image-gpt/issues/7). The original code can be found
37+
[here](https://github.com/openai/image-gpt).
38+
39+
Tips:
40+
41+
- Demo notebooks for ImageGPT can be found
42+
[here](https://github.com/NielsRogge/Transformers-Tutorials/tree/master/ImageGPT).
43+
- ImageGPT is almost exactly the same as [GPT-2](./model_doc/gpt2), with the exception that a different activation
44+
function is used (namely "quick gelu"), and the layer normalization layers don't mean center the inputs. ImageGPT
45+
also doesn't have tied input- and output embeddings.
46+
- As the time- and memory requirements of the attention mechanism of Transformers scales quadratically in the sequence
47+
length, the authors pre-trained ImageGPT on smaller input resolutions, such as 32x32 and 64x64. However, feeding a
48+
sequence of 32x32x3=3072 tokens from 0..255 into a Transformer is still prohibitively large. Therefore, the authors
49+
applied k-means clustering to the (R,G,B) pixel values with k=512. This way, we only have a 32*32 = 1024-long
50+
sequence, but now of integers in the range 0..511. So we are shrinking the sequence length at the cost of a bigger
51+
embedding matrix. In other words, the vocabulary size of ImageGPT is 512, + 1 for a special "start of sentence" (SOS)
52+
token, used at the beginning of every sequence. One can use [`ImageGPTFeatureExtractor`] to prepare
53+
images for the model.
54+
- Despite being pre-trained entirely unsupervised (i.e. without the use of any labels), ImageGPT produces fairly
55+
performant image features useful for downstream tasks, such as image classification. The authors showed that the
56+
features in the middle of the network are the most performant, and can be used as-is to train a linear model (such as
57+
a sklearn logistic regression model for example). This is also referred to as "linear probing". Features can be
58+
easily obtained by first forwarding the image through the model, then specifying `output_hidden_states=True`, and
59+
then average-pool the hidden states at whatever layer you like.
60+
- Alternatively, one can further fine-tune the entire model on a downstream dataset, similar to BERT. For this, you can
61+
use [`ImageGPTForImageClassification`].
62+
- ImageGPT comes in different sizes: there's ImageGPT-small, ImageGPT-medium and ImageGPT-large. The authors did also
63+
train an XL variant, which they didn't release. The differences in size are summarized in the following table:
64+
65+
| **Model variant** | **Depths** | **Hidden sizes** | **Decoder hidden size** | **Params (M)** | **ImageNet-1k Top 1** |
66+
|---|---|---|---|---|---|
67+
| MiT-b0 | [2, 2, 2, 2] | [32, 64, 160, 256] | 256 | 3.7 | 70.5 |
68+
| MiT-b1 | [2, 2, 2, 2] | [64, 128, 320, 512] | 256 | 14.0 | 78.7 |
69+
| MiT-b2 | [3, 4, 6, 3] | [64, 128, 320, 512] | 768 | 25.4 | 81.6 |
70+
| MiT-b3 | [3, 4, 18, 3] | [64, 128, 320, 512] | 768 | 45.2 | 83.1 |
71+
| MiT-b4 | [3, 8, 27, 3] | [64, 128, 320, 512] | 768 | 62.6 | 83.6 |
72+
| MiT-b5 | [3, 6, 40, 3] | [64, 128, 320, 512] | 768 | 82.0 | 83.8 |
73+
74+
## ImageGPTConfig
75+
76+
[[autodoc]] ImageGPTConfig
77+
78+
## ImageGPTFeatureExtractor
79+
80+
[[autodoc]] ImageGPTFeatureExtractor
81+
82+
- __call__
83+
84+
## ImageGPTModel
85+
86+
[[autodoc]] ImageGPTModel
87+
88+
- forward
89+
90+
## ImageGPTForCausalImageModeling
91+
92+
[[autodoc]] ImageGPTForCausalImageModeling
93+
94+
- forward
95+
96+
## ImageGPTForImageClassification
97+
98+
[[autodoc]] ImageGPTForImageClassification
99+
100+
- forward

docs/source/model_doc/imagegpt.rst

Lines changed: 0 additions & 110 deletions
This file was deleted.

docs/source/model_doc/luke.rst

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -74,6 +74,9 @@ Tips:
7474
head models by specifying ``task="entity_classification"``, ``task="entity_pair_classification"``, or
7575
``task="entity_span_classification"``. Please refer to the example code of each head models.
7676

77+
A demo notebook on how to fine-tune :class:`~transformers.LukeForEntityPairClassification` for relation
78+
classification can be found `here <https://github.com/NielsRogge/Transformers-Tutorials/tree/master/LUKE>`__.
79+
7780
There are also 3 notebooks available, which showcase how you can reproduce the results as reported in the paper with
7881
the HuggingFace implementation of LUKE. They can be found `here
7982
<https://github.com/studio-ousia/luke/tree/master/notebooks>`__.

0 commit comments

Comments
 (0)