|
| 1 | +<!--Copyright 2021 The HuggingFace Team. All rights reserved. |
| 2 | + |
| 3 | +Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the |
| 4 | +License. You may obtain a copy of the License at |
| 5 | + |
| 6 | +http://www.apache.org/licenses/LICENSE-2.0 |
| 7 | + |
| 8 | +Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an |
| 9 | +"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the |
| 10 | +specific language governing permissions and limitations under the License. --> |
| 11 | + |
| 12 | +# ImageGPT |
| 13 | + |
| 14 | +## Overview |
| 15 | + |
| 16 | +The ImageGPT model was proposed in [Generative Pretraining from Pixels](https://openai.com/blog/image-gpt) by Mark |
| 17 | +Chen, Alec Radford, Rewon Child, Jeffrey Wu, Heewoo Jun, David Luan, Ilya Sutskever. ImageGPT (iGPT) is a GPT-2-like |
| 18 | +model trained to predict the next pixel value, allowing for both unconditional and conditional image generation. |
| 19 | + |
| 20 | +The abstract from the paper is the following: |
| 21 | + |
| 22 | +*Inspired by progress in unsupervised representation learning for natural language, we examine whether similar models |
| 23 | +can learn useful representations for images. We train a sequence Transformer to auto-regressively predict pixels, |
| 24 | +without incorporating knowledge of the 2D input structure. Despite training on low-resolution ImageNet without labels, |
| 25 | +we find that a GPT-2 scale model learns strong image representations as measured by linear probing, fine-tuning, and |
| 26 | +low-data classification. On CIFAR-10, we achieve 96.3% accuracy with a linear probe, outperforming a supervised Wide |
| 27 | +ResNet, and 99.0% accuracy with full fine-tuning, matching the top supervised pre-trained models. We are also |
| 28 | +competitive with self-supervised benchmarks on ImageNet when substituting pixels for a VQVAE encoding, achieving 69.0% |
| 29 | +top-1 accuracy on a linear probe of our features.* |
| 30 | + |
| 31 | +<img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/imagegpt_architecture.png" |
| 32 | +alt="drawing" width="600"/> |
| 33 | + |
| 34 | +<small> Summary of the approach. Taken from the [original paper](https://cdn.openai.com/papers/Generative_Pretraining_from_Pixels_V2.pdf). </small> |
| 35 | + |
| 36 | +This model was contributed by [nielsr](https://huggingface.co/nielsr), based on [this issue](https://github.com/openai/image-gpt/issues/7). The original code can be found |
| 37 | +[here](https://github.com/openai/image-gpt). |
| 38 | + |
| 39 | +Tips: |
| 40 | + |
| 41 | +- Demo notebooks for ImageGPT can be found |
| 42 | + [here](https://github.com/NielsRogge/Transformers-Tutorials/tree/master/ImageGPT). |
| 43 | +- ImageGPT is almost exactly the same as [GPT-2](./model_doc/gpt2), with the exception that a different activation |
| 44 | + function is used (namely "quick gelu"), and the layer normalization layers don't mean center the inputs. ImageGPT |
| 45 | + also doesn't have tied input- and output embeddings. |
| 46 | +- As the time- and memory requirements of the attention mechanism of Transformers scales quadratically in the sequence |
| 47 | + length, the authors pre-trained ImageGPT on smaller input resolutions, such as 32x32 and 64x64. However, feeding a |
| 48 | + sequence of 32x32x3=3072 tokens from 0..255 into a Transformer is still prohibitively large. Therefore, the authors |
| 49 | + applied k-means clustering to the (R,G,B) pixel values with k=512. This way, we only have a 32*32 = 1024-long |
| 50 | + sequence, but now of integers in the range 0..511. So we are shrinking the sequence length at the cost of a bigger |
| 51 | + embedding matrix. In other words, the vocabulary size of ImageGPT is 512, + 1 for a special "start of sentence" (SOS) |
| 52 | + token, used at the beginning of every sequence. One can use [`ImageGPTFeatureExtractor`] to prepare |
| 53 | + images for the model. |
| 54 | +- Despite being pre-trained entirely unsupervised (i.e. without the use of any labels), ImageGPT produces fairly |
| 55 | + performant image features useful for downstream tasks, such as image classification. The authors showed that the |
| 56 | + features in the middle of the network are the most performant, and can be used as-is to train a linear model (such as |
| 57 | + a sklearn logistic regression model for example). This is also referred to as "linear probing". Features can be |
| 58 | + easily obtained by first forwarding the image through the model, then specifying `output_hidden_states=True`, and |
| 59 | + then average-pool the hidden states at whatever layer you like. |
| 60 | +- Alternatively, one can further fine-tune the entire model on a downstream dataset, similar to BERT. For this, you can |
| 61 | + use [`ImageGPTForImageClassification`]. |
| 62 | +- ImageGPT comes in different sizes: there's ImageGPT-small, ImageGPT-medium and ImageGPT-large. The authors did also |
| 63 | + train an XL variant, which they didn't release. The differences in size are summarized in the following table: |
| 64 | + |
| 65 | +| **Model variant** | **Depths** | **Hidden sizes** | **Decoder hidden size** | **Params (M)** | **ImageNet-1k Top 1** | |
| 66 | +|---|---|---|---|---|---| |
| 67 | +| MiT-b0 | [2, 2, 2, 2] | [32, 64, 160, 256] | 256 | 3.7 | 70.5 | |
| 68 | +| MiT-b1 | [2, 2, 2, 2] | [64, 128, 320, 512] | 256 | 14.0 | 78.7 | |
| 69 | +| MiT-b2 | [3, 4, 6, 3] | [64, 128, 320, 512] | 768 | 25.4 | 81.6 | |
| 70 | +| MiT-b3 | [3, 4, 18, 3] | [64, 128, 320, 512] | 768 | 45.2 | 83.1 | |
| 71 | +| MiT-b4 | [3, 8, 27, 3] | [64, 128, 320, 512] | 768 | 62.6 | 83.6 | |
| 72 | +| MiT-b5 | [3, 6, 40, 3] | [64, 128, 320, 512] | 768 | 82.0 | 83.8 | |
| 73 | + |
| 74 | +## ImageGPTConfig |
| 75 | + |
| 76 | +[[autodoc]] ImageGPTConfig |
| 77 | + |
| 78 | +## ImageGPTFeatureExtractor |
| 79 | + |
| 80 | +[[autodoc]] ImageGPTFeatureExtractor |
| 81 | + |
| 82 | + - __call__ |
| 83 | + |
| 84 | +## ImageGPTModel |
| 85 | + |
| 86 | +[[autodoc]] ImageGPTModel |
| 87 | + |
| 88 | + - forward |
| 89 | + |
| 90 | +## ImageGPTForCausalImageModeling |
| 91 | + |
| 92 | +[[autodoc]] ImageGPTForCausalImageModeling |
| 93 | + |
| 94 | + - forward |
| 95 | + |
| 96 | +## ImageGPTForImageClassification |
| 97 | + |
| 98 | +[[autodoc]] ImageGPTForImageClassification |
| 99 | + |
| 100 | + - forward |
0 commit comments