Yuhang Han1* ,
Xuyang Liu2*,
Zihan Zhang3,
Pengxiang Ding4,
Donglin Wang4,
Honggang Chen2,
Qingsen Yan1,
Siteng Huang5✉
1Northwestern Polytechnical University, 2Sichuan University,
3Johns Hopkins University, 4Westlake University, 5Zhejiang University
2025.01.10
🤗🤗 We release our latest work GlobalCom2, a "global-to-local" approach for training-free acceleration of high-resolution MLLMs. Code is available!2024.11.17
🤗🤗 We release our work FiCoCo which proposes a unified paradigm to demystify the popular works and guide the future designs of training-free token reduction for MLLMs.
TLDR: This study introduces a unified "filter-correlate-compress" paradigm to streamline training-free token reduction in Multimodal Large Language Models (MLLMs), achieving up to 82.4% FLOPs reduction with minimal performance impact and outperforming existing methods across 10 benchmarks.
- Clone this repository.
git clone https://github.com/kawhiiiileo/FiCoCo.git
cd FiCoCo
- Environment Setup and Preparation
conda create -n FiCoCo python=3.10 -y
conda activate FiCoCo
pip install -e .
- Download Multimodal Benchmark
Please follow the detailed instruction in LLaVA-Evaluation.
- Download LLaVA and put them under
./liuhaotian/llava-v1.5-7b
.
To configure the FiCoCo model with these parameters, update the corresponding settings in your code or configuration file. Below is an example configuration:
For example:
merge_visual: true
# Enable FiCoCo-V for visual tokens compression
AT: true
# Enable FiCoCo-L for visual tokens compression
r: 42
# Compress 42 tokens per layer
control_encoding_layer: 11
# Start compression from the 12th transformer layer
Example for evaluating SQA results (r=42, control_encoding_layer=11, merge_visual=True):
CUDA_VISIBLE_DEVICES=0 bash scripts/v1_5/eval/sqa.sh
To calculate the theoretical computational efficiency shown above, we recommend the methodology presented in the work of LLM-Viewer. We deeply appreciate their outstanding contribution to this field.
Considering that some MLLM visual encoders do not involve a [CLS] token, we propose a feasible alternative. The specific results are as follows, and further details can be found in the paper.
If you use FiCoCo in your research, please cite our work by using the following BibTeX entry:
@misc{han2025filtercorrelatecompresstrainingfree,
title={Filter, Correlate, Compress: Training-Free Token Reduction for MLLM Acceleration},
author={Yuhang Han and Xuyang Liu and Zihan Zhang and Pengxiang Ding and Donglin Wang and Honggang Chen and Qingsen Yan and Siteng Huang},
year={2025},
eprint={2411.17686},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2411.17686},
}
We extend our gratitude to the open-source efforts of LLaVA, ToMe and Open-LLaVA-NeXT.
For any question about our paper or code, please email yuhangh984@gmail.com
or liuxuyang@stu.scu.edu.cn
.