Visvesvaraya Technological University: An Image Is Worth 16X16 Words Transformers For Image Recognition at Scale
Visvesvaraya Technological University: An Image Is Worth 16X16 Words Transformers For Image Recognition at Scale
Visvesvaraya Technological University: An Image Is Worth 16X16 Words Transformers For Image Recognition at Scale
A Seminar Report On
Submitted in partial fulfilment of the requirement for the award of the degree of
BACHELOR OF ENGINEERING
IN
COMPUTER SCIENCE AND ENGINEERING
Submitted by
NIKHIL G R 1SJ17CS050
S J C INSTITUTE OF TECHNOLOGY
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
CHIKKABALLAPUR-562101
2020-2021
||Jai Sri Gurudev||
Sri AdichunchanagiriShikshana Trust®
CERTIFICATE
This is to certify that the Seminar entitled “AN IMAGE IS WORTH 16X16 WORDS
TRANSFORMERS FOR IMAGE RECOGNITION AT SCALE” is a bonafide work
carried out by Nikhil G R (1SJ17CS050) in partial fulfillment for the award of Bachelor
of Engineering in Computer Science and Engineering in Eighth Semester of the
Visvesvaraya Technological University, Belagavi during the year 2021. It is certified
that all corrections/suggestions indicated for internal assessment have been incorporated
in the report. The seminar report has been approved as it satisfies the academic
requirements with respect to eighth semester prescribed for the above said degree.
With reverential pranam, we express our sincere gratitude and salutations to the feet of his
holiness Byravaikya Padmabhus hana Sri Sri Sri Dr. Balagangadharanatha Maha
Swamiji, & his holiness Jagadguru Sri Sri Sri Dr. Nirmalanandanatha Swamiji of Sri
Adichunchanagiri Mutt for their unlimited blessings. First and foremost, I wish to express my
deep sincere feelings of gratitude to our institution, Sri Jagadguru Chandrashekarana t ha
Swamiji Institute of Technology. For providing me an opportunity for completing my seminar
successfully.
I extend deep sense of sincere gratitude to Dr. Ravi Kumar K M, Principal, S J C
Institute of Technology, Chickballapur, for providing an opportunity to complete the
Seminar.
I extend special in-depth, heartfelt, and sincere gratitude to Dr. Anitha T N, Head of the
Department, Computer Science and Engineering, S J C Institute of Technology,
Chickballapur, for her constant support and valuable guidance for the Seminar.
I convey my sincere thanks to my guide Dr. Anitha T N, Professor & HOD, Department
of Computer Science and Engineering, S J C Institute of Technology, for her constant
support, valuable guidance and suggestions for the Seminar.
I also feel immense pleasure to express deep and profound gratitude to Seminar
Coordinator Prof. Pradeep Kumar G M, Assistant Professor, Department of Compute r
Science and Engineering, S J C Institute of Technology, for his guidance and suggestio ns
for the Seminar.
Finally, I would like to thank all faculty members of Department of Computer Science and
Engineering, S J C Institute of Technology, Chickballapur for their support.
NIKHIL G R (1SJ17CS050)
i
ABSTRACT
While the Transformer architecture has become the de-facto standard for natural langua ge
processing tasks, its applications to computer vision remain limited. In vision, attention is either
applied in conjunction with convolutional networks, or used to replace certain components of
convolutional networks while keeping their overall structure in place. We show that this reliance
on CNNs is not necessary and a pure transformer applied directly to sequences of image patches
can perform very well on image classification tasks. When pre-trained on large amounts of data and
transferred to multiple mid-sized or small image recognition benchmarks (ImageNet, CIFAR-100,
VTAB, etc.), Vision Transformer (ViT) attains excellent results compared to state-of-the-art
convolutional networks while requiring substantially fewer computational resources to train.
ii
CONTENTS
Acknowledgement i
Abstract ii
Contents iii
List of Figures iv
1 INTRODUCTION 1-2
1.1 Overview 1
1.2 Problem Statement 2
2 LITERATURE SURVEY 3-4
3 METHODOLOGY 5-6
3.1 Method 5
3.2 Fine-tuning and Higher Resolution 6
4 WORKING PRINCIPLE 7-11
4.1 How it Works 7-8
4.2 Positional Embeddings 9-11
5 SCOPE AND APPLICATIONS 12-13
5.1 Scope 12
5.2 Applications 13
6 CONCLUSION 14
7 REFERENCES 15
iii
LIST OF FIGURES
iv
CHAPTER-1
INTRODUCTION
1.1 Overview
attains excellent results when pre-trained at sufficient scale and transferred to tasks with fewer
datapoints. When pre-trained on the public ImageNet-21k dataset or the in-house JFT-300M
dataset, ViT approaches or beats state of the art on multiple image recognition benchmarks. In
particular, the best model reaches the accuracy of 88:55% on ImageNet, 90:72% on ImageNet-
ReaL, 94:55% on CIFAR-100, and 77:63% on the VTAB suite of 19 tasks.
Transformers were proposed by Vaswani et al. (2017) for machine translation, and have
since become the state of the art method in many NLP tasks. Large Transformer-based
models are often pre-trained on large corpora and then fine-tuned for the task at hand:
BERT (Devlin et al., 2019) uses a denoising self-supervised pre-training task, while the
GPT line of work uses language modeling as its pre-training task (Radford et al., 2018;
2019; Brown et al., 2020).
Naive application of self-attention to images would require that each pixel attends to
every other pixel. With quadratic cost in the number of pixels, this does not scale to
realistic input sizes. Thus, to apply Transformers in the context of image processing,
several approximations have been tried in the past.
Most related to ours is the model of Cordonnier et al. (2020), which extracts patches of
size 2 _ 2 from the input image and applies full self-attention on top. This model is very
similar to ViT, but the work goes further to demonstrate that large scale pre-training
makes vanilla transformers competitive with (or even better than) state-of-the-art CNNs.
There has also been a lot of interest in combining convolutional neural networks (CNNs)
with forms of self-attention, e.g. by augmenting feature maps for image classifica tio n
(Bello et al., 2019) or by further processing the output of a CNN using self-attention, e.g.
for object detection (Hu et al., 2018; Carion et al., 2020), video processing (Wang et al.,
2018; Sun et al., 2019), image classification (Wu et al., 2020), unsupervised object
discovery (Locatello et al., 2020), or unified text-vision tasks (Chen et al., 2020c; Lu et
al., 2019; Li et al., 2020.
In this paper, the author has noticed that the Transformer model, "self-attentio n"
combines information from attended embeddings into the representation of the focal embedding
in the next layer. Thus, across layers of the Transformer, information originating from differe nt
tokens gets increasingly mixed. This makes attention weights unreliable as explanations probes.
In this paper, they have considered the problem of quantifying the flow of information through
self-attention. They have proposed two methods for approximating the attention to input tokens
given attention weights, attention rollout and attention flow, as post hoc methods when they use
attention weights as the relative relevance of the input tokens.
Advantage: Self Attention is used.
Disadvantage: Accuracy of the model is low.
Yes, and no. We ask whether recent progress on the ImageNet classification benchmark
continues to represent meaningful generalization, or whether the community has started to
overfit to the idiosyncrasies of its labeling procedure. Therefore, the authors have developed a
significantly more robust procedure for collecting human annotations of the ImageNet validatio n
set. Using these new labels, they had the accuracy of recently proposed ImageNet classifiers,
and found their gains to be substantially smaller than those reported on the original labels.
Furthermore, they have found the original ImageNet labels to no longer be the best predictors of
this independently-collected set, indicating that their usefulness in evaluating vision models may
be nearing an end. Nevertheless, they have found the annotation procedure to have largely
remedied the errors in the original labels, reinforcing ImageNet as a powerful benchmark for
future research in visual recognition.
Advantage: Quality Assurance.
Disadvantage: Validation of data.
In this paper, the author has presented a new method that views object detection as a direct set
prediction problem. The approach streamlines the detection pipeline, effectively removing the
need for many hand-designed components like a non-maximum suppression procedure or anchor
generation that explicitly encode our prior knowledge about the task. The main ingredients of
the new framework, called Detection Transformer or DETR, are a set-based global loss that
forces unique predictions via bipartite matching, and a transformer encoder-decoder
architecture. Given a fixed small set of learned object queries, DETR reasons about the relations
of the objects and the global image context to directly output the final set of predictions in
parallel. The new model is conceptually simple and does not require a specialized library, unlike
many other modern detectors. DETR demonstrates accuracy and run-time performance on par
with the well-established and highly-optimized Faster RCNN baseline on the challenging COCO
object detection dataset.
Advantage: Accuracy is High.
Disadvantage: Difficult to train the model.
2020-21
Dept. of CSE, SJCIT 4
CHAPTER-3
METHODOLGY
3.1 Method
In model design we follow the original Transformer (Vaswani et al., 2017) as closely as possible.
An advantage of this intentionally simple setup is that scalable NLP Transformer architecture’s
and their efficient implementations – can be used almost out of the box. The standard
Transformer receives as input a 1D sequence of token embeddings. To handle 2D images, we
reshape the image x € Ř^(HWC) into a sequence of flattened 2D patches xp € Ř^N*(P^2.C),
where (H,W) is the resolution of the original image, C is the number of channels, (P,P) is the
resolution of each image patch, and N = HW/P^2 is the resulting number of patches, which also
serves as the effective input sequence length for the Transformer. The Transformer uses constant
latent vector size D through all of its layers, so we flatten the patches and map to D dimens io ns
with a trainable linear projection. We refer to the output of this projection as the patch
embeddings.
Similar to BERT’s [class] token, we prepend a learnable embedding to the sequence of
embedded patches (z00= xclass), whose state at the output of the Transformer encoder (z0L)
serves as the image representation y. Both during pre-training and fine-tuning, a classifica tio n
head is attached to z0L. The classification head is implemented by a MLP with one hidden layer
at pre-training time and by a single linear layer at fine-tuning time.
Position embeddings are added to the patch embeddings to retain positional information. We use
standard learnable 1D position embeddings, since we have not observed significant performance
gains from using more advanced 2D-aware position embeddings. The resulting sequence of
embedding vectors serves as input to the encoder.
The Transformer encoder (Vaswani et al., 2017) consists of alternating layers of multiheaded
self attention and MLP blocks. Layernorm (LN) is applied before every block, and residual
connections after every block (Wang et al., 2019; Baevski & Auli, 2019).
Inductive Bias
We note that Vision Transformer has much less image-specific inductive bias than
CNNs. In CNNs, locality, two-dimensional neighborhood structure, and translation equivaria nce
are baked into each layer throughout the whole model. In ViT, only MLP layers are local and
Vision Transformers Methodology
translationally equivariant, while the self-attention layers are global. The two-dimensio na l
neighborhood structure is used very sparingly: in the beginning of the model by cutting the
image into patches and at fine-tuning time for adjusting the position embeddings for images of
different resolution Other than that, the position embeddings at initialization time carry no
information about the 2D positions of the patches and all spatial relations between the patches
have to be learned from scratch.
Hybrid Architecture
As an alternative to raw image patches, the input sequence can be formed from feature
maps of a CNN (LeCun et al., 1989). In this hybrid model, the patch embedding projection E is
applied to patches extracted from a CNN feature map. As a special case, the patches can have
spatial size 1x1, which means that the input sequence is obtained by simply flattening the spatial
dimensions of the feature map and projecting to the Transformer dimension. The classifica tio n
input embedding and position embeddings are added.
Typically, we pre-train ViT on large datasets, and fine-tune to (smaller) downstream tasks. For
this, we remove the pre-trained prediction head and attach a zero-initialized D _ K feedforward
layer, where K is the number of downstream classes. It is often beneficial to fine-tune at higher
resolution than pre-training (Touvron et al., 2019; Kolesnikov et al., 2020). When feeding
images of higher resolution, we keep the patch size the same, which results in a larger effective
sequence length. The Vision Transformer can handle arbitrary sequence lengths (up to memory
constraints), however, the pre-trained position embeddings may no longer be meaningful. We
therefore perform 2D interpolation of the pre-trained position embeddings, according to their
location in the original image.
We evaluate the representation learning capabilities of ResNet, Vision Transformer (ViT), and
the hybrid. To understand the data requirements of each model, we pre-train on datasets of
varying size and evaluate many benchmark tasks. When considering the computational cost of
pre-training the model, ViT performs very favourably, attaining state of the art on most
recognition benchmarks at a lower pre-training cost. Lastly, we perform a small experime nt
using self-supervision, and show that self-supervised ViT holds promise for the future.
Transformers lack the inductive biases of Convolutional Neural Networks (CNNs), such as
translation invariance and a locally restricted receptive field. Invariance means that we can
recognize an entity (i.e. object) in an image, even when its appearance or position
varies. Translation in computer vision implies that each image pixel has been moved by a fixed
amount in a particular direction. Convolution is a linear local operator. We see only the neighbor
values as indicated by the kernel. On the other hand, the transformer is by design permutation
invariant. The bad news is that it cannot process grid-structured data. We need sequence. We
The total architecture is called Vision Transformer.
Split an image into patches.
Flatten the patches.
Produce lower-dimensional linear embeddings from the flattened patches.
Add positional embeddings.
Feed the sequence as an input to a standard transformer encoder.
Pretrain the model with image labels (fully supervised on a huge dataset).
Finetune on the downstream dataset for image classification.
Image patches are basically the sequence tokens (like words). In fact, the encoder block is
identical to the original transformer proposed by Vaswani et al. (2017).
The only thing that changes is the number of those blocks. To this end, and to further prove that
with more data they can train larger ViT variants, 3 models were proposed:
Heads refer to multi-head attention, while the MLP size refers to the blue module in the figure.
MLP stands for multi- layer perceptron but it's actually a bunch of linear transformation layers.
Hidden size D is the embedding size, which is kept fixed throughout the layers.
Even though many positional embedding schemes were applied, no significant difference was
found. This is probably due to the fact that the transformer encoder operates on a patch-level.
Learning embeddings that capture the order relationships between patches (spatial informatio n)
is not so crucial. It is relatively easier to understand the relationships between patches of P x P
than of a full image Height x Width.
Hence, after the low-dimensional linear projection, a trainable position embedding is added to
the patch representations. It is interesting to see what these position embeddings look like after
training:
We don’t need successive conv. layers to get to 128-away pixels anymore. With convolutio ns
without dilation, the receptive field is increased linearly. Using self-attention we have
interaction between pixels representations in the 1st layer and pairs of representations in the 2nd
layer and so on.
Fig 4.1.5: Right: Image generated using Fomoro AI calculator, Left: Image by Alexey
Dosovitskiy et al 2020
Based on the diagram on the left from ViT, one can argue that:
There are indeed heads that attend to the whole patch already in the early layers. One can
justify the performance gain based on the early access pixel interactions. It seems more
critical for the early layers to have access to the whole patch (global info). In other words,
the heads that belong to the upper left part of the image may be the core reason for
superior performance.
Interestingly, the attention distance increases with network depth similar to the receptive
field of local operations.
There are also attention heads with consistently small attention distances in the low
layers. On the right, a 24-layer with standard 3x3 convolutions has a receptive field of
less than 50. We would approximately need 50 conv layers, to attend to a ~100 receptive
field, without dilation or pooling layers.
To enforce this idea of highly localized attention heads, the authors experimented with
hybrid models that apply a ResNet before the Transformer. They found less highly
localized heads, as expected. Along with filter visualization, it suggests that it may serve
a similar function as early convolutional layers in CNNs.
Attention distance was computed as the average distance between the query pixel and
the rest of the patch, multiplied by the attention weight. They used 128 example images and
averaged their results.
An example: if a pixel is 20 pixels away and the attention weight is 0.5 the distance is 10.
Finally, the model attends to image regions that are semantically relevant for classification, as
illustrated below:
5.1 Scope
It is common to train large versions of these models and fine-tune them for differe nt
tasks, so they are useful even when the data is scarce.
14
REFERENCES
[1] Tom B Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla
Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language
models are few-shot learners. arXiv, 2020.
[2] Samira Abnar and Willem Zuidema. Quantifying attention flow in transformers. In ACL,
2020.
[3] Alexei Baevski and Michael Auli. Adaptive input representations for neural language
modeling. In ICLR, 2019.
[4] I. Bello, B. Zoph, Q. Le, A. Vaswani, and J. Shlens. Attention augmented convolutio na l
networks. In ICCV, 2019.
[5] Philip Bachman, R Devon Hjelm, andWilliam Buchwalter. Learning representations by
maximizing mutual information across views. In NeurIPS, 2019.
[6] Han Hu, Zheng Zhang, Zhenda Xie, and Stephen Lin. Local relation networks for image
recognition. In ICCV, 2019.
[7] Liunian Harold Li, Mark Yatskar, Da Yin, Cho-Jui Hsieh, and Kai-Wei Chang. VisualBERT:
A Simple and Performant Baseline for Vision and Language. In Arxiv, 2019.
[8] Jiasen Lu, Dhruv Batra, Devi Parikh, and Stefan Lee. ViLBERT: Pretraining Task-Agnostic
Visiolinguistic Representations for Vision-and-Language Tasks. In NeurIPS. 2019.
[9] Alec Radford, Jeff Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever.
Language models are unsupervised multitask learners. Technical Report, 2019.
[10] Prajit Ramachandran, Niki Parmar, Ashish Vaswani, Irwan Bello, Anselm Levskaya, and
Jon Shlens. Stand-alone self-attention in vision models. In NeurIPS, 2019.
[11] Chen Sun, Austin Myers, Carl Vondrick, Kevin Murphy, and Cordelia Schmid. Videobert:
A joint model for video and language representation learning. In ICCV, 2019.
[12] Hugo Touvron, Andrea Vedaldi, Matthijs Douze, and Herve Jegou. Fixing the train-test
resolution discrepancy. In NeurIPS. 2019.
[13] Qiang Wang, Bei Li, Tong Xiao, Jingbo Zhu, Changliang Li, Derek F. Wong, and Lidia S.
Chao. Learning deep transformer models for machine translation. In ACL, 2019.
[14] Dirk Weissenborn, Oscar T¨ackstr¨om, and Jakob Uszkoreit. Scaling autoregressive video
models. In ICLR, 2019.
[15] Xiaohua Zhai, Joan Puigcerver, Alexander Kolesnikov, Pierre Ruyssen, Carlos Riquelme,
Mario Lucic, Josip Djolonga, Andre Susano Pinto, Maxim Neumann, Alexey Dosovitskiy, et al.
A large-scale study of representation learning with the visual task adaptation benchmark. arXiv
preprint arXiv:1910.04867, 2019b.