RobSparse: Automatic Search for GPU-Friendly Robust and Sparse Vision Transformers

Su, Yulan; Zhang, Sisi; Wang, Yan; Wang, Xingbin; Zhao, Lutan; Meng, Dan; Hou, Rui

doi:10.1007/978-981-96-2064-7_23

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 15522))

Included in the following conference series:

International Conference on Multimedia Modeling

243 Accesses

Abstract

Vision Transformers (ViTs) have garnered significant attention for their superior performance in vision recognition. However, they face two practical challenges: high computational costs and vulnerability to adversarial attacks. To overcome these issues, we propose a novel automatic search framework for adversarially robust and GPU-friendly sparse vision transformers. Our approach uses complexity-aware search to assign different connection patterns for each transformer layer. Additionally, an information bottleneck-driven N:M pruning metric is used to determine which weights to prune in the sparse layers. Experimental results demonstrate that our method reduces parameters by 45.52% to 48.49%, with minimal impact on accuracy and adversarial robustness, making it a practical solution for deploying ViTs in resource-constrained and security-critical scenarios.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

€32.70 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: EUR 29.95; Price includes VAT (France)

eBook: EUR 111.27; Price includes VAT (France)

Softcover Book: EUR 78.06; Price includes VAT (France)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

LPViT: Low-Power Semi-structured Pruning for Vision Transformers

Dual stage black-box adversarial attack against vision transformer

Article 15 February 2024

SpecFormer: Guarding Vision Transformer Robustness via Maximum Singular Value Penalization

References

Ahn, S., Hu, S.X., Damianou, A., Lawrence, N.D., Dai, Z.: Variational information distillation for knowledge transfer. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9163–9171 (2019)
Google Scholar
Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than CNNS? In: Advances in Neural Information Processing Systems, vol. 34, pp. 26831–26843 (2021)
Google Scholar
Cao, J., et al.: MadTP: multimodal alignment-guided dynamic token pruning for accelerating vision-language transformer. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15710–15719 (2024)
Google Scholar
Chen, T., Cheng, Y., Gan, Z., Yuan, L., Zhang, L., Wang, Z.: Chasing sparsity in vision transformers: an end-to-end exploration. In: Advances in Neural Information Processing Systems, vol. 34, pp. 19974–19988 (2021)
Google Scholar
Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: International Conference on Machine Learning, pp. 2206–2216. PMLR (2020)
Google Scholar
Dosovitskiy, A., et al.: An image is worth 16x16 words: transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020)
d’Ascoli, S., Touvron, H., Leavitt, M.L., Morcos, A.S., Biroli, G., Sagun, L.: ConViT: improving vision transformers with soft convolutional inductive biases. In: International Conference on Machine Learning, pp. 2286–2296. PMLR (2021)
Google Scholar
Francesco Croce, M.A., Vikash Sehwag, E.D.: Robustbench. https://robustbench.github.io/
Goodfellow, I.J., Shlens, J., Szegedy, C.: Explaining and harnessing adversarial examples. arXiv preprint arXiv:1412.6572 (2014)
Gretton, A., Bousquet, O., Smola, A., Schölkopf, B.: Measuring statistical dependence with Hilbert-Schmidt norms. In: Jain, S., Simon, H.U., Tomita, E. (eds.) ALT 2005. LNCS (LNAI), vol. 3734, pp. 63–77. Springer, Heidelberg (2005). https://doi.org/10.1007/11564089_7
Chapter MATH Google Scholar
Jain, S., Dutta, T.: Towards understanding and improving adversarial robustness of vision transformers. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 24736–24745 (2024)
Google Scholar
Jian, T., Wang, Z., Wang, Y., Dy, J., Ioannidis, S.: Pruning adversarially robust neural networks without adversarial examples. In: 2022 IEEE International Conference on Data Mining (ICDM), pp. 993–998. IEEE (2022)
Google Scholar
Liu, H., Simonyan, K., Yang, Y.: Darts: differentiable architecture search. arXiv preprint arXiv:1806.09055 (2018)
Liu, Z., et al.: Swin transformer: hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021)
Google Scholar
Madry, A., Makelov, A., Schmidt, L., Tsipras, D., Vladu, A.: Towards deep learning models resistant to adversarial attacks. arXiv preprint arXiv:1706.06083 (2017)
Mishra, A., et al.: Accelerating sparse deep neural networks. arXiv preprint arXiv:2104.08378 (2021)
Mo, Y., Wu, D., Wang, Y., Guo, Y., Wang, Y.: When adversarial training meets vision transformers: recipes from training to architecture. In: Advances in Neural Information Processing Systems, vol. 35, pp. 18599–18611 (2022)
Google Scholar
Pan, Z., Zhuang, B., Liu, J., He, H., Cai, J.: Scalable vision transformers with hierarchical pooling. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 377–386 (2021)
Google Scholar
Rao, Y., Zhao, W., Liu, B., Lu, J., Zhou, J., Hsieh, C.J.: DynamicViT: efficient vision transformers with dynamic token sparsification. In: Advances in Neural Information Processing Systems, vol. 34, pp. 13937–13949 (2021)
Google Scholar
Tang, Y., et al.: Patch slimming for efficient vision transformers. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 12165–12174 (2022)
Google Scholar
Tishby, N., Pereira, F.C., Bialek, W.: The information bottleneck method. arXiv preprint physics/0004057 (2000)
Google Scholar
Tishby, N., Zaslavsky, N.: Deep learning and the information bottleneck principle. In: 2015 IEEE Information Theory Workshop (ITW), pp. 1–5. IEEE (2015)
Google Scholar
Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357. PMLR (2021)
Google Scholar
Ye, H., et al.: Once for both: single stage of importance and sparsity search for vision transformer compression. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5578–5588 (2024)
Google Scholar
Yu, F., Huang, K., Wang, M., Cheng, Y., Chu, W., Cui, L.: Width & depth pruning for vision transformers. In: Proceedings of the AAAI Conference on Artificial Intelligence, pp. 3143–3151 (2022)
Google Scholar
Yu, L., Xiang, W.: X-pruner: explainable pruning for vision transformers. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 24355–24363 (2023)
Google Scholar
Yu, S., et al.: Unified visual transformer compression. arXiv preprint arXiv:2203.08243 (2022)
Zhang, Q., et al.: Platon: pruning large transformer models with upper confidence bound of weight importance. In: International Conference on Machine Learning, pp. 26809–26823. PMLR (2022)
Google Scholar
Zhang, Y., Wei, L., Freris, N.: Synergistic patch pruning for vision transformer: unifying intra- & inter-layer patch importance. In: The Twelfth International Conference on Learning Representations (2024)
Google Scholar
Zhu, M., Tang, Y., Han, K.: Vision transformer pruning. arXiv preprint arXiv:2104.08500 (2021)

Download references

Acknowlwdgement

This work was supported by the National Natural Science Foundation of China: No. 62272459. We would like to thank Dr. Kai Wang for his help in technical discussions and paper writing. We also wish to thank the anonymous reviewers for their valuable comments and suggestions.

Author information

Authors and Affiliations

Key Laboratory of Cyberspace Security Defense, Institute of Information Engineering, CAS, Beijing, China
Yulan Su, Sisi Zhang, Yan Wang, Xingbin Wang, Lutan Zhao, Dan Meng & Rui Hou
School of Cyber Security, University of Chinese Academy of Sciences, Beijing, China
Yulan Su & Yan Wang

Authors

Yulan Su
View author publications
You can also search for this author in PubMed Google Scholar
Sisi Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Yan Wang
View author publications
You can also search for this author in PubMed Google Scholar
Xingbin Wang
View author publications
You can also search for this author in PubMed Google Scholar
Lutan Zhao
View author publications
You can also search for this author in PubMed Google Scholar
Dan Meng
View author publications
You can also search for this author in PubMed Google Scholar
Rui Hou
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Rui Hou .

Editor information

Editors and Affiliations

Nagoya University, Nagoya, Japan
Ichiro Ide
Centre of Research & Technology, Thermi, Greece
Ioannis Kompatsiaris
Chinese Academy of Sciences, Beijing, China
Changsheng Xu
The University of Electro-Communications, Tokyo, Japan
Keiji Yanai
National Cheng Kung University, Tainan City, Taiwan
Wei-Ta Chu
Mukogawa Women's University, Nishinomiya, Japan
Naoko Nitta
Simula, Oslo, Norway
Michael Riegler
The University of Tokyo, Tokyo, Japan
Toshihiko Yamasaki

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Su, Y. et al. (2025). RobSparse: Automatic Search for GPU-Friendly Robust and Sparse Vision Transformers. In: Ide, I., et al. MultiMedia Modeling. MMM 2025. Lecture Notes in Computer Science, vol 15522. Springer, Singapore. https://doi.org/10.1007/978-981-96-2064-7_23

Download citation

DOI: https://doi.org/10.1007/978-981-96-2064-7_23
Published: 28 December 2024
Publisher Name: Springer, Singapore
Print ISBN: 978-981-96-2063-0
Online ISBN: 978-981-96-2064-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

RobSparse: Automatic Search for GPU-Friendly Robust and Sparse Vision Transformers

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

LPViT: Low-Power Semi-structured Pruning for Vision Transformers

Dual stage black-box adversarial attack against vision transformer

SpecFormer: Guarding Vision Transformer Robustness via Maximum Singular Value Penalization

References

Acknowlwdgement

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

RobSparse: Automatic Search for GPU-Friendly Robust and Sparse Vision Transformers

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

LPViT: Low-Power Semi-structured Pruning for Vision Transformers

Dual stage black-box adversarial attack against vision transformer

SpecFormer: Guarding Vision Transformer Robustness via Maximum Singular Value Penalization

References

Acknowlwdgement

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation