Skip to content

This is the official pytorch implementation for paper: Filter, Correlate, Compress: Training-Free Token Reduction for MLLM Acceleration

Notifications You must be signed in to change notification settings

kawhiiiileo/FiCoCo

Repository files navigation

Filter, Correlate, Compress: Training-Free Token Reduction for MLLM Acceleration

Yuhang Han1* , Xuyang Liu2*, Zihan Zhang3, Pengxiang Ding4, Donglin Wang4,
Honggang Chen2, Qingsen Yan1, Siteng Huang5✉

1Northwestern Polytechnical University, 2Sichuan University,
3Johns Hopkins University, 4Westlake University, 5Zhejiang University

image

🔥 News

  • 2025.01.10 🤗🤗 We release our latest work GlobalCom2, a "global-to-local" approach for training-free acceleration of high-resolution MLLMs. Code is available!
  • 2024.11.17 🤗🤗 We release our work FiCoCo which proposes a unified paradigm to demystify the popular works and guide the future designs of training-free token reduction for MLLMs.

👀 Overview

image

TLDR: This study introduces a unified "filter-correlate-compress" paradigm to streamline training-free token reduction in Multimodal Large Language Models (MLLMs), achieving up to 82.4% FLOPs reduction with minimal performance impact and outperforming existing methods across 10 benchmarks.

🛠 Preparation

  1. Clone this repository.
git clone https://github.com/kawhiiiileo/FiCoCo.git
cd FiCoCo
  1. Environment Setup and Preparation
 conda create -n FiCoCo python=3.10 -y
 conda activate FiCoCo
 pip install -e .
  1. Download Multimodal Benchmark

Please follow the detailed instruction in LLaVA-Evaluation.

  1. Download LLaVA and put them under ./liuhaotian/llava-v1.5-7b.

🚀 Run and Evaluation

To configure the FiCoCo model with these parameters, update the corresponding settings in your code or configuration file. Below is an example configuration:

For example:
merge_visual: true # Enable FiCoCo-V for visual tokens compression
AT: true # Enable FiCoCo-L for visual tokens compression
r: 42 # Compress 42 tokens per layer
control_encoding_layer: 11 # Start compression from the 12th transformer layer

Example for evaluating SQA results (r=42, control_encoding_layer=11, merge_visual=True):

CUDA_VISIBLE_DEVICES=0 bash scripts/v1_5/eval/sqa.sh

To calculate the theoretical computational efficiency shown above, we recommend the methodology presented in the work of LLM-Viewer. We deeply appreciate their outstanding contribution to this field.

🚀 Exploring Without CLS Token

Considering that some MLLM visual encoders do not involve a [CLS] token, we propose a feasible alternative. The specific results are as follows, and further details can be found in the paper.

📌 Citation

If you use FiCoCo in your research, please cite our work by using the following BibTeX entry:

@misc{han2025filtercorrelatecompresstrainingfree,
      title={Filter, Correlate, Compress: Training-Free Token Reduction for MLLM Acceleration}, 
      author={Yuhang Han and Xuyang Liu and Zihan Zhang and Pengxiang Ding and Donglin Wang and Honggang Chen and Qingsen Yan and Siteng Huang},
      year={2025},
      eprint={2411.17686},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2411.17686}, 
}

👍 Acknowledgment

We extend our gratitude to the open-source efforts of LLaVA, ToMe and Open-LLaVA-NeXT.

📧 Contact

For any question about our paper or code, please email yuhangh984@gmail.com or liuxuyang@stu.scu.edu.cn.

About

This is the official pytorch implementation for paper: Filter, Correlate, Compress: Training-Free Token Reduction for MLLM Acceleration

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published