Skip to main content

RobSparse: Automatic Search for GPU-Friendly Robust and Sparse Vision Transformers

  • Conference paper
  • First Online:
MultiMedia Modeling (MMM 2025)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 15522))

Included in the following conference series:

  • 243 Accesses

Abstract

Vision Transformers (ViTs) have garnered significant attention for their superior performance in vision recognition. However, they face two practical challenges: high computational costs and vulnerability to adversarial attacks. To overcome these issues, we propose a novel automatic search framework for adversarially robust and GPU-friendly sparse vision transformers. Our approach uses complexity-aware search to assign different connection patterns for each transformer layer. Additionally, an information bottleneck-driven N:M pruning metric is used to determine which weights to prune in the sparse layers. Experimental results demonstrate that our method reduces parameters by 45.52% to 48.49%, with minimal impact on accuracy and adversarial robustness, making it a practical solution for deploying ViTs in resource-constrained and security-critical scenarios.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
€32.70 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
EUR 29.95
Price includes VAT (France)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
EUR 111.27
Price includes VAT (France)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
EUR 78.06
Price includes VAT (France)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

References

  1. Ahn, S., Hu, S.X., Damianou, A., Lawrence, N.D., Dai, Z.: Variational information distillation for knowledge transfer. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9163–9171 (2019)

    Google Scholar 

  2. Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than CNNS? In: Advances in Neural Information Processing Systems, vol. 34, pp. 26831–26843 (2021)

    Google Scholar 

  3. Cao, J., et al.: MadTP: multimodal alignment-guided dynamic token pruning for accelerating vision-language transformer. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15710–15719 (2024)

    Google Scholar 

  4. Chen, T., Cheng, Y., Gan, Z., Yuan, L., Zhang, L., Wang, Z.: Chasing sparsity in vision transformers: an end-to-end exploration. In: Advances in Neural Information Processing Systems, vol. 34, pp. 19974–19988 (2021)

    Google Scholar 

  5. Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: International Conference on Machine Learning, pp. 2206–2216. PMLR (2020)

    Google Scholar 

  6. Dosovitskiy, A., et al.: An image is worth 16x16 words: transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020)

  7. d’Ascoli, S., Touvron, H., Leavitt, M.L., Morcos, A.S., Biroli, G., Sagun, L.: ConViT: improving vision transformers with soft convolutional inductive biases. In: International Conference on Machine Learning, pp. 2286–2296. PMLR (2021)

    Google Scholar 

  8. Francesco Croce, M.A., Vikash Sehwag, E.D.: Robustbench. https://robustbench.github.io/

  9. Goodfellow, I.J., Shlens, J., Szegedy, C.: Explaining and harnessing adversarial examples. arXiv preprint arXiv:1412.6572 (2014)

  10. Gretton, A., Bousquet, O., Smola, A., Schölkopf, B.: Measuring statistical dependence with Hilbert-Schmidt norms. In: Jain, S., Simon, H.U., Tomita, E. (eds.) ALT 2005. LNCS (LNAI), vol. 3734, pp. 63–77. Springer, Heidelberg (2005). https://doi.org/10.1007/11564089_7

    Chapter  MATH  Google Scholar 

  11. Jain, S., Dutta, T.: Towards understanding and improving adversarial robustness of vision transformers. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 24736–24745 (2024)

    Google Scholar 

  12. Jian, T., Wang, Z., Wang, Y., Dy, J., Ioannidis, S.: Pruning adversarially robust neural networks without adversarial examples. In: 2022 IEEE International Conference on Data Mining (ICDM), pp. 993–998. IEEE (2022)

    Google Scholar 

  13. Liu, H., Simonyan, K., Yang, Y.: Darts: differentiable architecture search. arXiv preprint arXiv:1806.09055 (2018)

  14. Liu, Z., et al.: Swin transformer: hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021)

    Google Scholar 

  15. Madry, A., Makelov, A., Schmidt, L., Tsipras, D., Vladu, A.: Towards deep learning models resistant to adversarial attacks. arXiv preprint arXiv:1706.06083 (2017)

  16. Mishra, A., et al.: Accelerating sparse deep neural networks. arXiv preprint arXiv:2104.08378 (2021)

  17. Mo, Y., Wu, D., Wang, Y., Guo, Y., Wang, Y.: When adversarial training meets vision transformers: recipes from training to architecture. In: Advances in Neural Information Processing Systems, vol. 35, pp. 18599–18611 (2022)

    Google Scholar 

  18. Pan, Z., Zhuang, B., Liu, J., He, H., Cai, J.: Scalable vision transformers with hierarchical pooling. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 377–386 (2021)

    Google Scholar 

  19. Rao, Y., Zhao, W., Liu, B., Lu, J., Zhou, J., Hsieh, C.J.: DynamicViT: efficient vision transformers with dynamic token sparsification. In: Advances in Neural Information Processing Systems, vol. 34, pp. 13937–13949 (2021)

    Google Scholar 

  20. Tang, Y., et al.: Patch slimming for efficient vision transformers. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 12165–12174 (2022)

    Google Scholar 

  21. Tishby, N., Pereira, F.C., Bialek, W.: The information bottleneck method. arXiv preprint physics/0004057 (2000)

    Google Scholar 

  22. Tishby, N., Zaslavsky, N.: Deep learning and the information bottleneck principle. In: 2015 IEEE Information Theory Workshop (ITW), pp. 1–5. IEEE (2015)

    Google Scholar 

  23. Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357. PMLR (2021)

    Google Scholar 

  24. Ye, H., et al.: Once for both: single stage of importance and sparsity search for vision transformer compression. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5578–5588 (2024)

    Google Scholar 

  25. Yu, F., Huang, K., Wang, M., Cheng, Y., Chu, W., Cui, L.: Width & depth pruning for vision transformers. In: Proceedings of the AAAI Conference on Artificial Intelligence, pp. 3143–3151 (2022)

    Google Scholar 

  26. Yu, L., Xiang, W.: X-pruner: explainable pruning for vision transformers. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 24355–24363 (2023)

    Google Scholar 

  27. Yu, S., et al.: Unified visual transformer compression. arXiv preprint arXiv:2203.08243 (2022)

  28. Zhang, Q., et al.: Platon: pruning large transformer models with upper confidence bound of weight importance. In: International Conference on Machine Learning, pp. 26809–26823. PMLR (2022)

    Google Scholar 

  29. Zhang, Y., Wei, L., Freris, N.: Synergistic patch pruning for vision transformer: unifying intra- & inter-layer patch importance. In: The Twelfth International Conference on Learning Representations (2024)

    Google Scholar 

  30. Zhu, M., Tang, Y., Han, K.: Vision transformer pruning. arXiv preprint arXiv:2104.08500 (2021)

Download references

Acknowlwdgement

This work was supported by the National Natural Science Foundation of China: No. 62272459. We would like to thank Dr. Kai Wang for his help in technical discussions and paper writing. We also wish to thank the anonymous reviewers for their valuable comments and suggestions.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Rui Hou .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2025 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Su, Y. et al. (2025). RobSparse: Automatic Search for GPU-Friendly Robust and Sparse Vision Transformers. In: Ide, I., et al. MultiMedia Modeling. MMM 2025. Lecture Notes in Computer Science, vol 15522. Springer, Singapore. https://doi.org/10.1007/978-981-96-2064-7_23

Download citation

  • DOI: https://doi.org/10.1007/978-981-96-2064-7_23

  • Published:

  • Publisher Name: Springer, Singapore

  • Print ISBN: 978-981-96-2063-0

  • Online ISBN: 978-981-96-2064-7

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics