Skip to content

Commit aece7ba

Browse files
NielsRoggesguggerLysandreJik
authored
Improve Perceiver docs (huggingface#14786)
* Fix docs * Apply suggestions from code review Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com> * Code quality Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com> Co-authored-by: Lysandre <lysandre.debut@reseau.eseo.fr>
1 parent 50bc57c commit aece7ba

File tree

1 file changed

+44
-38
lines changed

1 file changed

+44
-38
lines changed

src/transformers/models/perceiver/modeling_perceiver.py

Lines changed: 44 additions & 38 deletions
Original file line numberDiff line numberDiff line change
@@ -810,15 +810,16 @@ def forward(
810810
>>> # EXAMPLE 2: using the Perceiver to classify images
811811
>>> # - we define an ImagePreprocessor, which can be used to embed images
812812
>>> preprocessor=PerceiverImagePreprocessor(
813-
config,
814-
prep_type="conv1x1",
815-
spatial_downsample=1,
816-
out_channels=256,
817-
position_encoding_type="trainable",
818-
concat_or_add_pos="concat",
819-
project_pos_dim=256,
820-
trainable_position_encoding_kwargs=dict(num_channels=256, index_dims=config.image_size ** 2),
821-
)
813+
... config,
814+
... prep_type="conv1x1",
815+
... spatial_downsample=1,
816+
... out_channels=256,
817+
... position_encoding_type="trainable",
818+
... concat_or_add_pos="concat",
819+
... project_pos_dim=256,
820+
... trainable_position_encoding_kwargs=dict(num_channels=256, index_dims=config.image_size ** 2,
821+
... ),
822+
... )
822823
823824
>>> model = PerceiverModel(
824825
... config,
@@ -1188,10 +1189,11 @@ def forward(
11881189
This model uses learned position embeddings. In other words, this model is not given any privileged information about
11891190
the structure of images. As shown in the paper, this model can achieve a top-1 accuracy of 72.7 on ImageNet.
11901191
1191-
`PerceiverForImageClassificationLearned` uses
1192-
`transformers.models.perceiver.modeling_perceiver.PerceiverImagePreprocessor` (with `prep_type` = "conv1x1") to
1193-
preprocess the input images, and `transformers.models.perceiver.modeling_perceiver.PerceiverClassificationDecoder` to
1194-
decode the latent representation of `~transformers.PerceiverModel` into classification logits.
1192+
:class:`~transformers.PerceiverForImageClassificationLearned` uses
1193+
:class:`~transformers.models.perceiver.modeling_perceiver.PerceiverImagePreprocessor` (with :obj:`prep_type="conv1x1"`)
1194+
to preprocess the input images, and
1195+
:class:`~transformers.models.perceiver.modeling_perceiver.PerceiverClassificationDecoder` to decode the latent
1196+
representation of :class:`~transformers.PerceiverModel` into classification logits.
11951197
""",
11961198
PERCEIVER_START_DOCSTRING,
11971199
)
@@ -1326,10 +1328,11 @@ def forward(
13261328
This model uses fixed 2D Fourier position embeddings. As shown in the paper, this model can achieve a top-1 accuracy of
13271329
79.0 on ImageNet, and 84.5 when pre-trained on a large-scale dataset (i.e. JFT).
13281330
1329-
`PerceiverForImageClassificationLearned` uses
1330-
`transformers.models.perceiver.modeling_perceiver.PerceiverImagePreprocessor` (with `prep_type` = "pixels") to
1331-
preprocess the input images, and `transformers.models.perceiver.modeling_perceiver.PerceiverClassificationDecoder` to
1332-
decode the latent representation of `~transformers.PerceiverModel` into classification logits.
1331+
:class:`~transformers.PerceiverForImageClassificationLearned` uses
1332+
:class:`~transformers.models.perceiver.modeling_perceiver.PerceiverImagePreprocessor` (with :obj:`prep_type="pixels"`)
1333+
to preprocess the input images, and
1334+
:class:`~transformers.models.perceiver.modeling_perceiver.PerceiverClassificationDecoder` to decode the latent
1335+
representation of :class:`~transformers.PerceiverModel` into classification logits.
13331336
""",
13341337
PERCEIVER_START_DOCSTRING,
13351338
)
@@ -1461,10 +1464,11 @@ def forward(
14611464
This model uses a 2D conv+maxpool preprocessing network. As shown in the paper, this model can achieve a top-1 accuracy
14621465
of 82.1 on ImageNet.
14631466
1464-
`PerceiverForImageClassificationLearned` uses
1465-
`transformers.models.perceiver.modeling_perceiver.PerceiverImagePreprocessor` (with `prep_type` = "conv") to preprocess
1466-
the input images, and `transformers.models.perceiver.modeling_perceiver.PerceiverClassificationDecoder` to decode the
1467-
latent representation of `~transformers.PerceiverModel` into classification logits.
1467+
:class:`~transformers.PerceiverForImageClassificationLearned` uses
1468+
:class:`~transformers.models.perceiver.modeling_perceiver.PerceiverImagePreprocessor` (with :obj:`prep_type="conv"`) to
1469+
preprocess the input images, and
1470+
:class:`~transformers.models.perceiver.modeling_perceiver.PerceiverClassificationDecoder` to decode the latent
1471+
representation of :class:`~transformers.PerceiverModel` into classification logits.
14681472
""",
14691473
PERCEIVER_START_DOCSTRING,
14701474
)
@@ -1592,10 +1596,11 @@ def forward(
15921596

15931597
@add_start_docstrings(
15941598
"""
1595-
Example use of Perceiver for optical flow, for tasks such as Sintel and KITTI. `PerceiverForOpticalFlow` uses
1596-
`transformers.models.perceiver.modeling_perceiver.PerceiverImagePreprocessor` (with `prep_type` = "patches") to
1597-
preprocess the input images, and `transformers.models.perceiver.modeling_perceiver.PerceiverOpticalFlowDecoder` to
1598-
decode the latent representation of `~transformers.PerceiverModel`.
1599+
Example use of Perceiver for optical flow, for tasks such as Sintel and KITTI.
1600+
:class:`~transformers.PerceiverForOpticalFlow` uses
1601+
:class:`~transformers.models.perceiver.modeling_perceiver.PerceiverImagePreprocessor` (with `prep_type="patches"`) to
1602+
preprocess the input images, and :class:`~transformers.models.perceiver.modeling_perceiver.PerceiverOpticalFlowDecoder`
1603+
to decode the latent representation of :class:`~transformers.PerceiverModel`.
15991604
16001605
As input, one concatenates 2 subsequent frames along the channel dimension and extract a 3 x 3 patch around each pixel
16011606
(leading to 3 x 3 x 3 x 2 = 54 values for each pixel). Fixed Fourier position encodings are used to encode the position
@@ -1717,25 +1722,26 @@ def forward(
17171722
"""
17181723
Example use of Perceiver for multimodal (video) autoencoding, for tasks such as Kinetics-700.
17191724
1720-
`PerceiverForMultimodalAutoencoding` uses
1721-
`transformers.models.perceiver.modeling_perceiver.PerceiverMultimodalPreprocessor` to preprocess the 3 modalities:
1722-
images, audio and class labels. This preprocessor uses modality-specific preprocessors to preprocess every modality
1723-
separately, after which they are concatenated. Trainable position embeddings are used to pad each modality to the same
1724-
number of channels to make concatenation along the time dimension possible. Next, one applies the Perceiver encoder.
1725+
:class:`~transformers.PerceiverForMultimodalAutoencoding` uses
1726+
:class:`~transformers.models.perceiver.modeling_perceiver.PerceiverMultimodalPreprocessor` to preprocess the 3
1727+
modalities: images, audio and class labels. This preprocessor uses modality-specific preprocessors to preprocess every
1728+
modality separately, after which they are concatenated. Trainable position embeddings are used to pad each modality to
1729+
the same number of channels to make concatenation along the time dimension possible. Next, one applies the Perceiver
1730+
encoder.
17251731
1726-
`transformers.models.perceiver.modeling_perceiver.PerceiverMultimodalDecoder` is used to decode the latent
1727-
representation of `~transformers.PerceiverModel`. This decoder uses each modality-specific decoder to construct
1732+
:class:`~transformers.models.perceiver.modeling_perceiver.PerceiverMultimodalDecoder` is used to decode the latent
1733+
representation of :class:`~transformers.PerceiverModel`. This decoder uses each modality-specific decoder to construct
17281734
queries. The decoder queries are created based on the inputs after preprocessing. However, autoencoding an entire video
17291735
in a single forward pass is computationally infeasible, hence one only uses parts of the decoder queries to do
17301736
cross-attention with the latent representation. This is determined by the subsampled indices for each modality, which
1731-
can be provided as additional input to the forward pass of `PerceiverForMultimodalAutoencoding`.
1737+
can be provided as additional input to the forward pass of :class:`~transformers.PerceiverForMultimodalAutoencoding`.
17321738
1733-
`transformers.models.perceiver.modeling_perceiver.PerceiverMultimodalDecoder` also pads the decoder queries of the
1734-
different modalities to the same number of channels, in order to concatenate them along the time dimension. Next,
1735-
cross-attention is performed with the latent representation of `PerceiverModel`.
1739+
:class:`~transformers.models.perceiver.modeling_perceiver.PerceiverMultimodalDecoder` also pads the decoder queries of
1740+
the different modalities to the same number of channels, in order to concatenate them along the time dimension. Next,
1741+
cross-attention is performed with the latent representation of :class:`~transformers.PerceiverModel`.
17361742
1737-
Finally, `transformers.models.perceiver.modeling_perceiver.PerceiverMultiModalPostprocessor` is used to turn this
1738-
tensor into an actual video. It first splits up the output into the different modalities, and then applies the
1743+
Finally, :class:`~transformers.models.perceiver.modeling_perceiver.PerceiverMultiModalPostprocessor` is used to turn
1744+
this tensor into an actual video. It first splits up the output into the different modalities, and then applies the
17391745
respective postprocessor for each modality.
17401746
17411747
Note that, by masking the classification label during evaluation (i.e. simply providing a tensor of zeros for the

0 commit comments

Comments
 (0)