ResT V2: Simpler, Faster and Stronger

Zhang, Qing-Long; Yang, Yu-Bin

Computer Science > Computer Vision and Pattern Recognition

arXiv:2204.07366 (cs)

[Submitted on 15 Apr 2022 (v1), last revised 27 Sep 2022 (this version, v3)]

Title:ResT V2: Simpler, Faster and Stronger

Authors:Qing-Long Zhang, Yu-Bin Yang

View PDF

Abstract:This paper proposes ResTv2, a simpler, faster, and stronger multi-scale vision Transformer for visual recognition. ResTv2 simplifies the EMSA structure in ResTv1 (i.e., eliminating the multi-head interaction part) and employs an upsample operation to reconstruct the lost medium- and high-frequency information caused by the downsampling operation. In addition, we explore different techniques for better apply ResTv2 backbones to downstream tasks. We found that although combining EMSAv2 and window attention can greatly reduce the theoretical matrix multiply FLOPs, it may significantly decrease the computation density, thus causing lower actual speed. We comprehensively validate ResTv2 on ImageNet classification, COCO detection, and ADE20K semantic segmentation. Experimental results show that the proposed ResTv2 can outperform the recently state-of-the-art backbones by a large margin, demonstrating the potential of ResTv2 as solid backbones. The code and models will be made publicly available at \url{this https URL}

Comments:	ResTv2, a simpler, faster, and stronger multi-scale vision Transformer for visual recognition
Subjects:	Computer Vision and Pattern Recognition (cs.CV)
Cite as:	arXiv:2204.07366 [cs.CV]
	(or arXiv:2204.07366v3 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2204.07366

Submission history

From: Qing-Long Zhang [view email]
[v1] Fri, 15 Apr 2022 07:57:40 UTC (544 KB)
[v2] Tue, 10 May 2022 13:12:53 UTC (544 KB)
[v3] Tue, 27 Sep 2022 07:01:18 UTC (548 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:ResT V2: Simpler, Faster and Stronger

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:ResT V2: Simpler, Faster and Stronger

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators