Visvesvaraya Technological University: An Image Is Worth 16X16 Words Transformers For Image Recognition at Scale

VISVESVARAYA TECHNOLOGICAL UNIVERSITY
"Jnana Sangama", Belgavi-590 018, Karnataka, India
A Seminar Report On
“AN IMAGE IS WORTH 16X16 WORDS TRANSFORMERS FOR

IMAGE RECOGNITION AT SCALE”
Submitted in partial fulfilment of the requirement for the award of the degree of
BACHELOR OF ENGINEERING
IN
COMPUTER SCIENCE AND ENGINEERING
Submitted by
NIKHIL G R 1SJ17CS050
Under the guidance of
Seminar Guide Seminar Coordinator

Dr. Anitha T N Prof. Pradeep Kumar G M
Professor and HOD Assistant Professor
Dept. of CSE, SJCIT Dept. of CSE, SJCIT
S J C INSTITUTE OF TECHNOLOGY
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
CHIKKABALLAPUR-562101
2020-2021
||Jai Sri Gurudev||
Sri AdichunchanagiriShikshana Trust®
SRI JAGADGURU CHANDRASHEKARANATHA SWAMIJI

INSTITUTE OF TECHNOLOG Y
CHICKBALLAPUR– 562101
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
CERTIFICATE
This is to certify that the Seminar entitled “AN IMAGE IS WORTH 16X16 WORDS
TRANSFORMERS FOR IMAGE RECOGNITION AT SCALE” is a bonafide work
carried out by Nikhil G R (1SJ17CS050) in partial fulfillment for the award of Bachelor
of Engineering in Computer Science and Engineering in Eighth Semester of the
Visvesvaraya Technological University, Belagavi during the year 2021. It is certified
that all corrections/suggestions indicated for internal assessment have been incorporated
in the report. The seminar report has been approved as it satisfies the academic
requirements with respect to eighth semester prescribed for the above said degree.
.............................. ………......................................... .............................

Signature of Guide Signature of Seminar Coordinator Signature of HOD
Dr. Anitha T N Prof. Pradeep Kumar G M Dr. Anitha T N
Professor & HOD Assistant Professor Professor & HOD
Dept. of CSE, SJCIT Dept. of CSE, SJCIT Dept. of CSE, SJCIT
ACKNOWLEDGEMT
With reverential pranam, we express our sincere gratitude and salutations to the feet of his
holiness Byravaikya Padmabhus hana Sri Sri Sri Dr. Balagangadharanatha Maha
Swamiji, & his holiness Jagadguru Sri Sri Sri Dr. Nirmalanandanatha Swamiji of Sri
Adichunchanagiri Mutt for their unlimited blessings. First and foremost, I wish to express my
deep sincere feelings of gratitude to our institution, Sri Jagadguru Chandrashekarana t ha
Swamiji Institute of Technology. For providing me an opportunity for completing my seminar
successfully.
I extend deep sense of sincere gratitude to Dr. Ravi Kumar K M, Principal, S J C
Institute of Technology, Chickballapur, for providing an opportunity to complete the
Seminar.
I extend special in-depth, heartfelt, and sincere gratitude to Dr. Anitha T N, Head of the
Department, Computer Science and Engineering, S J C Institute of Technology,
Chickballapur, for her constant support and valuable guidance for the Seminar.
I convey my sincere thanks to my guide Dr. Anitha T N, Professor & HOD, Department
of Computer Science and Engineering, S J C Institute of Technology, for her constant
support, valuable guidance and suggestions for the Seminar.
I also feel immense pleasure to express deep and profound gratitude to Seminar
Coordinator Prof. Pradeep Kumar G M, Assistant Professor, Department of Compute r
Science and Engineering, S J C Institute of Technology, for his guidance and suggestio ns
for the Seminar.
Finally, I would like to thank all faculty members of Department of Computer Science and
Engineering, S J C Institute of Technology, Chickballapur for their support.
NIKHIL G R (1SJ17CS050)
i
ABSTRACT
While the Transformer architecture has become the de-facto standard for natural langua ge
processing tasks, its applications to computer vision remain limited. In vision, attention is either
applied in conjunction with convolutional networks, or used to replace certain components of
convolutional networks while keeping their overall structure in place. We show that this reliance
on CNNs is not necessary and a pure transformer applied directly to sequences of image patches
can perform very well on image classification tasks. When pre-trained on large amounts of data and
transferred to multiple mid-sized or small image recognition benchmarks (ImageNet, CIFAR-100,
VTAB, etc.), Vision Transformer (ViT) attains excellent results compared to state-of-the-art
convolutional networks while requiring substantially fewer computational resources to train.
ii
CONTENTS
Acknowledgement i
Abstract ii
Contents iii
List of Figures iv
Chapter No Chapter Title Page No
1 INTRODUCTION 1-2
1.1 Overview 1
1.2 Problem Statement 2
2 LITERATURE SURVEY 3-4
3 METHODOLOGY 5-6
3.1 Method 5
3.2 Fine-tuning and Higher Resolution 6
4 WORKING PRINCIPLE 7-11
4.1 How it Works 7-8
4.2 Positional Embeddings 9-11
5 SCOPE AND APPLICATIONS 12-13
5.1 Scope 12
5.2 Applications 13
6 CONCLUSION 14
7 REFERENCES 15
iii
LIST OF FIGURES
Figure No. Name of the Figure Page No.
4.1.1 Vision Transformers Architecture 7
4.1.2 The Well Known Transformer Block 8
4.1.3 Comparison of different Models 8
4.1.4 Positional Embeddings 9
4.1.5 Image by Alexey Dosovitskiy et al 2020 10
4.1.6 Final Model Image 11
iv
CHAPTER-1
INTRODUCTION
1.1 Overview
Self-attention-based architectures, in particular Transformers (Vaswani et al., 2017), have

become the model of choice in natural language processing (NLP). The dominant approach is to
pre-train on a large text corpus and then fine-tune on a smaller task-specific dataset (Devlin et
al., 2019). Thanks to Transformer’s computational efficiency and scalability, it has become
possible to train models of unprecedented size, with over 100B parameters (Brown et al., 2020;
Lepikhin et al., 2020). With the models and datasets growing, there is still no sign of saturating
performance.
In computer vision, however, convolutional architectures remain dominant (LeCun et al., 1989;
Krizhevsky et al., 2012; He et al., 2016). Inspired by NLP successes, multiple works try
combining CNN-like architectures with self-attention (Wang et al., 2018; Carion et al., 2020),
some replacing the convolutions entirely (Ramachandran et al., 2019; Wang et al., 2020a). The
latter models, while theoretically efficient, have not yet been scaled effectively on mode rn
hardware accelerators due to the use of specialized attention patterns. Therefore, in large-scale
image recognition, classic ResNetlike architectures are still state of the art (Mahajan et al., 2018;
Xie et al., 2020; Kolesnikov et al., 2020).
Inspired by the Transformer scaling successes in NLP, the authors have experimented with
applying a standard Transformer directly to images, with the fewest possible modifications. To
do so, they have split an image into patches and provide the sequence of linear embeddings of
these patches as an input to a Transformer. Image patches are treated the same way as tokens
(words) in an NLP application. They have trained the model on image classification in
supervised fashion.
When trained on mid-sized datasets such as ImageNet without strong regularization, these
models yield modest accuracies of a few percentage points below ResNets of comparable size.
This seemingly discouraging outcome may be expected. Transformers lack some of the
inductive biases inherent to CNNs, such as translation equivariance and locality, and therefore
do not generalize well when trained on insufficient amounts of data.
However, the picture changes if the models are trained on larger datasets (14M-300M images).
They have found that large scale training trumps inductive bias. The Vision Transformer (ViT)
Vision Transformers Introductin
attains excellent results when pre-trained at sufficient scale and transferred to tasks with fewer
datapoints. When pre-trained on the public ImageNet-21k dataset or the in-house JFT-300M
dataset, ViT approaches or beats state of the art on multiple image recognition benchmarks. In
particular, the best model reaches the accuracy of 88:55% on ImageNet, 90:72% on ImageNet-
ReaL, 94:55% on CIFAR-100, and 77:63% on the VTAB suite of 19 tasks.
1.2 Problem Statement
 Transformers were proposed by Vaswani et al. (2017) for machine translation, and have
since become the state of the art method in many NLP tasks. Large Transformer-based
models are often pre-trained on large corpora and then fine-tuned for the task at hand:
BERT (Devlin et al., 2019) uses a denoising self-supervised pre-training task, while the
GPT line of work uses language modeling as its pre-training task (Radford et al., 2018;
2019; Brown et al., 2020).
 Naive application of self-attention to images would require that each pixel attends to
every other pixel. With quadratic cost in the number of pixels, this does not scale to
realistic input sizes. Thus, to apply Transformers in the context of image processing,
several approximations have been tried in the past.
 Most related to ours is the model of Cordonnier et al. (2020), which extracts patches of
size 2 _ 2 from the input image and applies full self-attention on top. This model is very
similar to ViT, but the work goes further to demonstrate that large scale pre-training
makes vanilla transformers competitive with (or even better than) state-of-the-art CNNs.
 There has also been a lot of interest in combining convolutional neural networks (CNNs)
with forms of self-attention, e.g. by augmenting feature maps for image classifica tio n
(Bello et al., 2019) or by further processing the output of a CNN using self-attention, e.g.
for object detection (Hu et al., 2018; Carion et al., 2020), video processing (Wang et al.,
2018; Sun et al., 2019), image classification (Wu et al., 2020), unsupervised object
discovery (Locatello et al., 2020), or unified text-vision tasks (Chen et al., 2020c; Lu et
al., 2019; Li et al., 2020.
Dept. of CSE, SJCIT 2 2020-21

CHAPTER-2
LITERATURE SURVEY
2.1 Quantifying Attention Flow in Transformers
In this paper, the author has noticed that the Transformer model, "self-attentio n"
combines information from attended embeddings into the representation of the focal embedding
in the next layer. Thus, across layers of the Transformer, information originating from differe nt
tokens gets increasingly mixed. This makes attention weights unreliable as explanations probes.
In this paper, they have considered the problem of quantifying the flow of information through
self-attention. They have proposed two methods for approximating the attention to input tokens
given attention weights, attention rollout and attention flow, as post hoc methods when they use
attention weights as the relative relevance of the input tokens.
Advantage: Self Attention is used.
Disadvantage: Accuracy of the model is low.
2.2 Attention Augmented Convolutional Networks

In this paper, the author has discussed that the Convolutional networks have been the
paradigm of choice in many computer vision applications. The convolution operation however
has a significant weakness in that it only operates on a local neighborhood, thus missing global
information. Self-attention, on the other hand, has emerged as a recent advance to capture long
range interactions, but has mostly been applied to sequence modeling and generative modeling
tasks. In this paper, they have considered the use of self-attention for discriminative visual tasks
as an alternative to convolutions. They have introduced a novel two-dimensional relative self-
attention mechanism that proves competitive in replacing convolutions as a stand-alone
computational primitive for image classification. They have found in control experiments that
the best results are obtained when combining both convolutions and self-attention.
Advantage: Good quality maintained.

Disadvantage: Slow Process.
Vision Transformers Literature Survey
2.3 Are we done with ImageNet?
Yes, and no. We ask whether recent progress on the ImageNet classification benchmark
continues to represent meaningful generalization, or whether the community has started to
overfit to the idiosyncrasies of its labeling procedure. Therefore, the authors have developed a
significantly more robust procedure for collecting human annotations of the ImageNet validatio n
set. Using these new labels, they had the accuracy of recently proposed ImageNet classifiers,
and found their gains to be substantially smaller than those reported on the original labels.
Furthermore, they have found the original ImageNet labels to no longer be the best predictors of
this independently-collected set, indicating that their usefulness in evaluating vision models may
be nearing an end. Nevertheless, they have found the annotation procedure to have largely
remedied the errors in the original labels, reinforcing ImageNet as a powerful benchmark for
future research in visual recognition.
Advantage: Quality Assurance.
Disadvantage: Validation of data.
2.4 End-to-End Object Detection with Transformers
In this paper, the author has presented a new method that views object detection as a direct set
prediction problem. The approach streamlines the detection pipeline, effectively removing the
need for many hand-designed components like a non-maximum suppression procedure or anchor
generation that explicitly encode our prior knowledge about the task. The main ingredients of
the new framework, called Detection Transformer or DETR, are a set-based global loss that
forces unique predictions via bipartite matching, and a transformer encoder-decoder
architecture. Given a fixed small set of learned object queries, DETR reasons about the relations
of the objects and the global image context to directly output the final set of predictions in
parallel. The new model is conceptually simple and does not require a specialized library, unlike
many other modern detectors. DETR demonstrates accuracy and run-time performance on par
with the well-established and highly-optimized Faster RCNN baseline on the challenging COCO
object detection dataset.
Advantage: Accuracy is High.
Disadvantage: Difficult to train the model.
2020-21
Dept. of CSE, SJCIT 4
CHAPTER-3
METHODOLGY
3.1 Method
In model design we follow the original Transformer (Vaswani et al., 2017) as closely as possible.
An advantage of this intentionally simple setup is that scalable NLP Transformer architecture’s
and their efficient implementations – can be used almost out of the box. The standard
Transformer receives as input a 1D sequence of token embeddings. To handle 2D images, we
reshape the image x € Ř^(HWC) into a sequence of flattened 2D patches xp € Ř^N*(P^2.C),
where (H,W) is the resolution of the original image, C is the number of channels, (P,P) is the
resolution of each image patch, and N = HW/P^2 is the resulting number of patches, which also
serves as the effective input sequence length for the Transformer. The Transformer uses constant
latent vector size D through all of its layers, so we flatten the patches and map to D dimens io ns
with a trainable linear projection. We refer to the output of this projection as the patch
embeddings.
Similar to BERT’s [class] token, we prepend a learnable embedding to the sequence of
embedded patches (z00= xclass), whose state at the output of the Transformer encoder (z0L)
serves as the image representation y. Both during pre-training and fine-tuning, a classifica tio n
head is attached to z0L. The classification head is implemented by a MLP with one hidden layer
at pre-training time and by a single linear layer at fine-tuning time.
Position embeddings are added to the patch embeddings to retain positional information. We use
standard learnable 1D position embeddings, since we have not observed significant performance
gains from using more advanced 2D-aware position embeddings. The resulting sequence of
embedding vectors serves as input to the encoder.
The Transformer encoder (Vaswani et al., 2017) consists of alternating layers of multiheaded
self attention and MLP blocks. Layernorm (LN) is applied before every block, and residual
connections after every block (Wang et al., 2019; Baevski & Auli, 2019).
Inductive Bias
We note that Vision Transformer has much less image-specific inductive bias than
CNNs. In CNNs, locality, two-dimensional neighborhood structure, and translation equivaria nce
are baked into each layer throughout the whole model. In ViT, only MLP layers are local and
Vision Transformers Methodology
translationally equivariant, while the self-attention layers are global. The two-dimensio na l
neighborhood structure is used very sparingly: in the beginning of the model by cutting the
image into patches and at fine-tuning time for adjusting the position embeddings for images of
different resolution Other than that, the position embeddings at initialization time carry no
information about the 2D positions of the patches and all spatial relations between the patches
have to be learned from scratch.
Hybrid Architecture
As an alternative to raw image patches, the input sequence can be formed from feature
maps of a CNN (LeCun et al., 1989). In this hybrid model, the patch embedding projection E is
applied to patches extracted from a CNN feature map. As a special case, the patches can have
spatial size 1x1, which means that the input sequence is obtained by simply flattening the spatial
dimensions of the feature map and projecting to the Transformer dimension. The classifica tio n
input embedding and position embeddings are added.
3.2 Fine Tuning and Higher-Resolution
Typically, we pre-train ViT on large datasets, and fine-tune to (smaller) downstream tasks. For
this, we remove the pre-trained prediction head and attach a zero-initialized D _ K feedforward
layer, where K is the number of downstream classes. It is often beneficial to fine-tune at higher
resolution than pre-training (Touvron et al., 2019; Kolesnikov et al., 2020). When feeding
images of higher resolution, we keep the patch size the same, which results in a larger effective
sequence length. The Vision Transformer can handle arbitrary sequence lengths (up to memory
constraints), however, the pre-trained position embeddings may no longer be meaningful. We
therefore perform 2D interpolation of the pre-trained position embeddings, according to their
location in the original image.
We evaluate the representation learning capabilities of ResNet, Vision Transformer (ViT), and
the hybrid. To understand the data requirements of each model, we pre-train on datasets of
varying size and evaluate many benchmark tasks. When considering the computational cost of
pre-training the model, ViT performs very favourably, attaining state of the art on most
recognition benchmarks at a lower pre-training cost. Lastly, we perform a small experime nt
using self-supervision, and show that self-supervised ViT holds promise for the future.

CHAPTER-4
WORKING PRINCIPLE
4.1 How It Works
Transformers lack the inductive biases of Convolutional Neural Networks (CNNs), such as
translation invariance and a locally restricted receptive field. Invariance means that we can
recognize an entity (i.e. object) in an image, even when its appearance or position
varies. Translation in computer vision implies that each image pixel has been moved by a fixed
amount in a particular direction. Convolution is a linear local operator. We see only the neighbor
values as indicated by the kernel. On the other hand, the transformer is by design permutation
invariant. The bad news is that it cannot process grid-structured data. We need sequence. We
The total architecture is called Vision Transformer.
 Split an image into patches.
 Flatten the patches.
 Produce lower-dimensional linear embeddings from the flattened patches.
 Add positional embeddings.
 Feed the sequence as an input to a standard transformer encoder.
 Pretrain the model with image labels (fully supervised on a huge dataset).
 Finetune on the downstream dataset for image classification.
Fig 4.1.1: Vision Transformers Architecture

Vision Transformers Working Principle
Image patches are basically the sequence tokens (like words). In fact, the encoder block is
identical to the original transformer proposed by Vaswani et al. (2017).
Fig 4.1.2: The well-known Transformer Block
The only thing that changes is the number of those blocks. To this end, and to further prove that
with more data they can train larger ViT variants, 3 models were proposed:
Fig 4.1.3: Comparison of Different Models
Heads refer to multi-head attention, while the MLP size refers to the blue module in the figure.
MLP stands for multi- layer perceptron but it's actually a bunch of linear transformation layers.
Hidden size D is the embedding size, which is kept fixed throughout the layers.

4.2 Positional Embeddings
Even though many positional embedding schemes were applied, no significant difference was
found. This is probably due to the fact that the transformer encoder operates on a patch-level.
Learning embeddings that capture the order relationships between patches (spatial informatio n)
is not so crucial. It is relatively easier to understand the relationships between patches of P x P
than of a full image Height x Width.
Hence, after the low-dimensional linear projection, a trainable position embedding is added to
the patch representations. It is interesting to see what these position embeddings look like after
training:
Fig 4.1.4: Positional Embeddings
We don’t need successive conv. layers to get to 128-away pixels anymore. With convolutio ns
without dilation, the receptive field is increased linearly. Using self-attention we have
interaction between pixels representations in the 1st layer and pairs of representations in the 2nd
layer and so on.

Fig 4.1.5: Right: Image generated using Fomoro AI calculator, Left: Image by Alexey
Dosovitskiy et al 2020
Based on the diagram on the left from ViT, one can argue that:
 There are indeed heads that attend to the whole patch already in the early layers. One can
justify the performance gain based on the early access pixel interactions. It seems more
critical for the early layers to have access to the whole patch (global info). In other words,
the heads that belong to the upper left part of the image may be the core reason for
superior performance.
 Interestingly, the attention distance increases with network depth similar to the receptive
field of local operations.
 There are also attention heads with consistently small attention distances in the low
layers. On the right, a 24-layer with standard 3x3 convolutions has a receptive field of
less than 50. We would approximately need 50 conv layers, to attend to a ~100 receptive
field, without dilation or pooling layers.
 To enforce this idea of highly localized attention heads, the authors experimented with
hybrid models that apply a ResNet before the Transformer. They found less highly
localized heads, as expected. Along with filter visualization, it suggests that it may serve
a similar function as early convolutional layers in CNNs.

Attention distance and visualization
Attention distance was computed as the average distance between the query pixel and
the rest of the patch, multiplied by the attention weight. They used 128 example images and
averaged their results.
An example: if a pixel is 20 pixels away and the attention weight is 0.5 the distance is 10.
Finally, the model attends to image regions that are semantically relevant for classification, as
illustrated below:
Fig 4.1.6: Final Model Image

CHAPTER-5
SCOPE AND APPLICATION
5.1 Scope
 It is common to train large versions of these models and fine-tune them for differe nt
tasks, so they are useful even when the data is scarce.
 Performances in these models, even with billions of parameters, do not seem to

saturate.
 The larger the model, the more accurate the results are, and the more interesting the
emerging knowledge that the model presents
 Instead of including self-attention within convolutional pipelines, other works have
proposed to rely uniquely on self-attention layers and
to leverage the original encoder-decoder architecture presented for Transformers,
adapting them to Computer Vision tasks.
 Instead of using projected image patches as input to the transformer, we can use
feature maps from the early stages of a ResNet.
 When updating features with transformers, the order of the input sequence is lost.
This order will be difficult or even impossible to learn by the Transformer itself, so
what it is done is to aggregate a positional representation to the input embedding of
the model. This positional encoding can be learned or it can be sampled from a fixed
function, and the position where it is aggregated can vary, although it is usually done
just at the input embeddings, right before being fed into the model.
Vision Transformers Scope and Applications
5.2 Applications
 It is a simple, scalable architecture, and performs state-of-art, especially when trained on
large datasets such as JFT-300M. It’s also relatively cheap to pre-train the model.
Transformers completely replaced LSTM in NLP.
 Transformers solve a problem that was not only limited to NLP, but also to Computer
Vision tasks.
 Vision transformers work better on large-scale data.
 Like the GPT-3 and BERT models, the Visual Transformer model also can scale.
 Convolutions are translation invariant, locality-sensitive, and lack a global understanding
of images.
 Huge models (ViT-H) generally do better than large models (ViT-L) and wins against
state-of-the-art methods.

CHAPTER-6
CONCLUSION
In the paper entitled “AN IMAGE IS WORTH 16X16 WORDS: TRANSFORMERS

FOR IMAGE RECOGNITION AT SCALE”, the researchers have explored the direct
application of Transformers to image recognition. Unlike prior works using self-attention in
computer vision, they have not introduced image-specific inductive biases into the architecture
apart from the initial patch extraction step. Instead, they have interpreted an image as a sequence
of patches and process it by a standard Transformer encoder as used in NLP. This simple, yet
scalable, strategy works surprisingly well when coupled with pre-training on large datasets.
Thus, Vision Transformer matches or exceeds the state of the art on many image classifica tio n
datasets, whilst being relatively cheap to pre-train.
14
REFERENCES
[1] Tom B Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla
Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language
models are few-shot learners. arXiv, 2020.
[2] Samira Abnar and Willem Zuidema. Quantifying attention flow in transformers. In ACL,
2020.
[3] Alexei Baevski and Michael Auli. Adaptive input representations for neural language
modeling. In ICLR, 2019.
[4] I. Bello, B. Zoph, Q. Le, A. Vaswani, and J. Shlens. Attention augmented convolutio na l
networks. In ICCV, 2019.
[5] Philip Bachman, R Devon Hjelm, andWilliam Buchwalter. Learning representations by
maximizing mutual information across views. In NeurIPS, 2019.
[6] Han Hu, Zheng Zhang, Zhenda Xie, and Stephen Lin. Local relation networks for image
recognition. In ICCV, 2019.
[7] Liunian Harold Li, Mark Yatskar, Da Yin, Cho-Jui Hsieh, and Kai-Wei Chang. VisualBERT:
A Simple and Performant Baseline for Vision and Language. In Arxiv, 2019.
[8] Jiasen Lu, Dhruv Batra, Devi Parikh, and Stefan Lee. ViLBERT: Pretraining Task-Agnostic
Visiolinguistic Representations for Vision-and-Language Tasks. In NeurIPS. 2019.
[9] Alec Radford, Jeff Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever.
Language models are unsupervised multitask learners. Technical Report, 2019.
[10] Prajit Ramachandran, Niki Parmar, Ashish Vaswani, Irwan Bello, Anselm Levskaya, and
Jon Shlens. Stand-alone self-attention in vision models. In NeurIPS, 2019.
[11] Chen Sun, Austin Myers, Carl Vondrick, Kevin Murphy, and Cordelia Schmid. Videobert:
A joint model for video and language representation learning. In ICCV, 2019.
[12] Hugo Touvron, Andrea Vedaldi, Matthijs Douze, and Herve Jegou. Fixing the train-test
resolution discrepancy. In NeurIPS. 2019.
[13] Qiang Wang, Bei Li, Tong Xiao, Jingbo Zhu, Changliang Li, Derek F. Wong, and Lidia S.
Chao. Learning deep transformer models for machine translation. In ACL, 2019.
[14] Dirk Weissenborn, Oscar T¨ackstr¨om, and Jakob Uszkoreit. Scaling autoregressive video
models. In ICLR, 2019.
[15] Xiaohua Zhai, Joan Puigcerver, Alexander Kolesnikov, Pierre Ruyssen, Carlos Riquelme,
Mario Lucic, Josip Djolonga, Andre Susano Pinto, Maxim Neumann, Alexey Dosovitskiy, et al.
A large-scale study of representation learning with the visual task adaptation benchmark. arXiv
preprint arXiv:1910.04867, 2019b.

Visvesvaraya Technological University: An Image Is Worth 16X16 Words Transformers For Image Recognition at Scale

Uploaded by

Copyright:

Available Formats

Visvesvaraya Technological University: An Image Is Worth 16X16 Words Transformers For Image Recognition at Scale

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Visvesvaraya Technological University: An Image Is Worth 16X16 Words Transformers For Image Recognition at Scale

Uploaded by

Copyright:

Available Formats

VISVESVARAYA TECHNOLOGICAL UNIVERSITY

"Jnana Sangama", Belgavi-590 018, Karnataka, India

“AN IMAGE IS WORTH 16X16 WORDS TRANSFORMERS FOR

Under the guidance of

Seminar Guide Seminar Coordinator

SRI JAGADGURU CHANDRASHEKARANATHA SWAMIJI

DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING

.............................. ………......................................... .............................

Chapter No Chapter Title Page No

Figure No. Name of the Figure Page No.

4.1.1 Vision Transformers Architecture 7

4.1.2 The Well Known Transformer Block 8

4.1.3 Comparison of different Models 8

4.1.4 Positional Embeddings 9

4.1.5 Image by Alexey Dosovitskiy et al 2020 10

4.1.6 Final Model Image 11

Self-attention-based architectures, in particular Transformers (Vaswani et al., 2017), have

1.2 Problem Statement

Dept. of CSE, SJCIT 2 2020-21

2.1 Quantifying Attention Flow in Transformers

2.2 Attention Augmented Convolutional Networks

Advantage: Good quality maintained.

2.3 Are we done with ImageNet?

2.4 End-to-End Object Detection with Transformers

3.2 Fine Tuning and Higher-Resolution

Dept. of CSE, SJCIT 6 2020-21

4.1 How It Works

Fig 4.1.1: Vision Transformers Architecture

Fig 4.1.2: The well-known Transformer Block

Fig 4.1.3: Comparison of Different Models

Dept. of CSE, SJCIT 8 2020-21

Fig 4.1.4: Positional Embeddings

Dept. of CSE, SJCIT 9 2020-21

Dept. of CSE, SJCIT 10 2020-21

Fig 4.1.6: Final Model Image

Dept. of CSE, SJCIT 11 2020-21

 Performances in these models, even with billions of parameters, do not seem to

Dept. of CSE, SJCIT 13 2020-21

In the paper entitled “AN IMAGE IS WORTH 16X16 WORDS: TRANSFORMERS

You might also like