@@ -810,15 +810,16 @@ def forward(
810
810
>>> # EXAMPLE 2: using the Perceiver to classify images
811
811
>>> # - we define an ImagePreprocessor, which can be used to embed images
812
812
>>> preprocessor=PerceiverImagePreprocessor(
813
- config,
814
- prep_type="conv1x1",
815
- spatial_downsample=1,
816
- out_channels=256,
817
- position_encoding_type="trainable",
818
- concat_or_add_pos="concat",
819
- project_pos_dim=256,
820
- trainable_position_encoding_kwargs=dict(num_channels=256, index_dims=config.image_size ** 2),
821
- )
813
+ ... config,
814
+ ... prep_type="conv1x1",
815
+ ... spatial_downsample=1,
816
+ ... out_channels=256,
817
+ ... position_encoding_type="trainable",
818
+ ... concat_or_add_pos="concat",
819
+ ... project_pos_dim=256,
820
+ ... trainable_position_encoding_kwargs=dict(num_channels=256, index_dims=config.image_size ** 2,
821
+ ... ),
822
+ ... )
822
823
823
824
>>> model = PerceiverModel(
824
825
... config,
@@ -1188,10 +1189,11 @@ def forward(
1188
1189
This model uses learned position embeddings. In other words, this model is not given any privileged information about
1189
1190
the structure of images. As shown in the paper, this model can achieve a top-1 accuracy of 72.7 on ImageNet.
1190
1191
1191
- `PerceiverForImageClassificationLearned` uses
1192
- `transformers.models.perceiver.modeling_perceiver.PerceiverImagePreprocessor` (with `prep_type` = "conv1x1") to
1193
- preprocess the input images, and `transformers.models.perceiver.modeling_perceiver.PerceiverClassificationDecoder` to
1194
- decode the latent representation of `~transformers.PerceiverModel` into classification logits.
1192
+ :class:`~transformers.PerceiverForImageClassificationLearned` uses
1193
+ :class:`~transformers.models.perceiver.modeling_perceiver.PerceiverImagePreprocessor` (with :obj:`prep_type="conv1x1"`)
1194
+ to preprocess the input images, and
1195
+ :class:`~transformers.models.perceiver.modeling_perceiver.PerceiverClassificationDecoder` to decode the latent
1196
+ representation of :class:`~transformers.PerceiverModel` into classification logits.
1195
1197
""" ,
1196
1198
PERCEIVER_START_DOCSTRING ,
1197
1199
)
@@ -1326,10 +1328,11 @@ def forward(
1326
1328
This model uses fixed 2D Fourier position embeddings. As shown in the paper, this model can achieve a top-1 accuracy of
1327
1329
79.0 on ImageNet, and 84.5 when pre-trained on a large-scale dataset (i.e. JFT).
1328
1330
1329
- `PerceiverForImageClassificationLearned` uses
1330
- `transformers.models.perceiver.modeling_perceiver.PerceiverImagePreprocessor` (with `prep_type` = "pixels") to
1331
- preprocess the input images, and `transformers.models.perceiver.modeling_perceiver.PerceiverClassificationDecoder` to
1332
- decode the latent representation of `~transformers.PerceiverModel` into classification logits.
1331
+ :class:`~transformers.PerceiverForImageClassificationLearned` uses
1332
+ :class:`~transformers.models.perceiver.modeling_perceiver.PerceiverImagePreprocessor` (with :obj:`prep_type="pixels"`)
1333
+ to preprocess the input images, and
1334
+ :class:`~transformers.models.perceiver.modeling_perceiver.PerceiverClassificationDecoder` to decode the latent
1335
+ representation of :class:`~transformers.PerceiverModel` into classification logits.
1333
1336
""" ,
1334
1337
PERCEIVER_START_DOCSTRING ,
1335
1338
)
@@ -1461,10 +1464,11 @@ def forward(
1461
1464
This model uses a 2D conv+maxpool preprocessing network. As shown in the paper, this model can achieve a top-1 accuracy
1462
1465
of 82.1 on ImageNet.
1463
1466
1464
- `PerceiverForImageClassificationLearned` uses
1465
- `transformers.models.perceiver.modeling_perceiver.PerceiverImagePreprocessor` (with `prep_type` = "conv") to preprocess
1466
- the input images, and `transformers.models.perceiver.modeling_perceiver.PerceiverClassificationDecoder` to decode the
1467
- latent representation of `~transformers.PerceiverModel` into classification logits.
1467
+ :class:`~transformers.PerceiverForImageClassificationLearned` uses
1468
+ :class:`~transformers.models.perceiver.modeling_perceiver.PerceiverImagePreprocessor` (with :obj:`prep_type="conv"`) to
1469
+ preprocess the input images, and
1470
+ :class:`~transformers.models.perceiver.modeling_perceiver.PerceiverClassificationDecoder` to decode the latent
1471
+ representation of :class:`~transformers.PerceiverModel` into classification logits.
1468
1472
""" ,
1469
1473
PERCEIVER_START_DOCSTRING ,
1470
1474
)
@@ -1592,10 +1596,11 @@ def forward(
1592
1596
1593
1597
@add_start_docstrings (
1594
1598
"""
1595
- Example use of Perceiver for optical flow, for tasks such as Sintel and KITTI. `PerceiverForOpticalFlow` uses
1596
- `transformers.models.perceiver.modeling_perceiver.PerceiverImagePreprocessor` (with `prep_type` = "patches") to
1597
- preprocess the input images, and `transformers.models.perceiver.modeling_perceiver.PerceiverOpticalFlowDecoder` to
1598
- decode the latent representation of `~transformers.PerceiverModel`.
1599
+ Example use of Perceiver for optical flow, for tasks such as Sintel and KITTI.
1600
+ :class:`~transformers.PerceiverForOpticalFlow` uses
1601
+ :class:`~transformers.models.perceiver.modeling_perceiver.PerceiverImagePreprocessor` (with `prep_type="patches"`) to
1602
+ preprocess the input images, and :class:`~transformers.models.perceiver.modeling_perceiver.PerceiverOpticalFlowDecoder`
1603
+ to decode the latent representation of :class:`~transformers.PerceiverModel`.
1599
1604
1600
1605
As input, one concatenates 2 subsequent frames along the channel dimension and extract a 3 x 3 patch around each pixel
1601
1606
(leading to 3 x 3 x 3 x 2 = 54 values for each pixel). Fixed Fourier position encodings are used to encode the position
@@ -1717,25 +1722,26 @@ def forward(
1717
1722
"""
1718
1723
Example use of Perceiver for multimodal (video) autoencoding, for tasks such as Kinetics-700.
1719
1724
1720
- `PerceiverForMultimodalAutoencoding` uses
1721
- `transformers.models.perceiver.modeling_perceiver.PerceiverMultimodalPreprocessor` to preprocess the 3 modalities:
1722
- images, audio and class labels. This preprocessor uses modality-specific preprocessors to preprocess every modality
1723
- separately, after which they are concatenated. Trainable position embeddings are used to pad each modality to the same
1724
- number of channels to make concatenation along the time dimension possible. Next, one applies the Perceiver encoder.
1725
+ :class:`~transformers.PerceiverForMultimodalAutoencoding` uses
1726
+ :class:`~transformers.models.perceiver.modeling_perceiver.PerceiverMultimodalPreprocessor` to preprocess the 3
1727
+ modalities: images, audio and class labels. This preprocessor uses modality-specific preprocessors to preprocess every
1728
+ modality separately, after which they are concatenated. Trainable position embeddings are used to pad each modality to
1729
+ the same number of channels to make concatenation along the time dimension possible. Next, one applies the Perceiver
1730
+ encoder.
1725
1731
1726
- ` transformers.models.perceiver.modeling_perceiver.PerceiverMultimodalDecoder` is used to decode the latent
1727
- representation of `~transformers.PerceiverModel`. This decoder uses each modality-specific decoder to construct
1732
+ :class:`~ transformers.models.perceiver.modeling_perceiver.PerceiverMultimodalDecoder` is used to decode the latent
1733
+ representation of :class: `~transformers.PerceiverModel`. This decoder uses each modality-specific decoder to construct
1728
1734
queries. The decoder queries are created based on the inputs after preprocessing. However, autoencoding an entire video
1729
1735
in a single forward pass is computationally infeasible, hence one only uses parts of the decoder queries to do
1730
1736
cross-attention with the latent representation. This is determined by the subsampled indices for each modality, which
1731
- can be provided as additional input to the forward pass of ` PerceiverForMultimodalAutoencoding`.
1737
+ can be provided as additional input to the forward pass of :class:`~transformers. PerceiverForMultimodalAutoencoding`.
1732
1738
1733
- ` transformers.models.perceiver.modeling_perceiver.PerceiverMultimodalDecoder` also pads the decoder queries of the
1734
- different modalities to the same number of channels, in order to concatenate them along the time dimension. Next,
1735
- cross-attention is performed with the latent representation of ` PerceiverModel`.
1739
+ :class:`~ transformers.models.perceiver.modeling_perceiver.PerceiverMultimodalDecoder` also pads the decoder queries of
1740
+ the different modalities to the same number of channels, in order to concatenate them along the time dimension. Next,
1741
+ cross-attention is performed with the latent representation of :class:`~transformers. PerceiverModel`.
1736
1742
1737
- Finally, ` transformers.models.perceiver.modeling_perceiver.PerceiverMultiModalPostprocessor` is used to turn this
1738
- tensor into an actual video. It first splits up the output into the different modalities, and then applies the
1743
+ Finally, :class:`~ transformers.models.perceiver.modeling_perceiver.PerceiverMultiModalPostprocessor` is used to turn
1744
+ this tensor into an actual video. It first splits up the output into the different modalities, and then applies the
1739
1745
respective postprocessor for each modality.
1740
1746
1741
1747
Note that, by masking the classification label during evaluation (i.e. simply providing a tensor of zeros for the
0 commit comments