Skip to content

New DynamicSlidingWindowLayer & associated Cache #40039

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 28 commits into
base: main
Choose a base branch
from

Conversation

Cyrilvallez
Copy link
Member

@Cyrilvallez Cyrilvallez commented Aug 8, 2025

What does this PR do?

As per the title. To avoid wasting memory for models with sliding window. As I don't want to reintroduce static hybrid caches by default to avoid all the pitfalls of automatic compilation, but don't want to waste that memory, this is definitely the way to go.

The only change that is needed is to pass the config to DynamicCache, to be able to parse sliding_window/layer_types. If we don't, then the behavior is exactly the same as before.

See the following figures for an illustration:

  • top: Mistral 7B, all layers are sliding, so the cache stops growing after reaching the window size of 4096
  • bottom: Gemma 2 9B, 1 out of 2 layers are sliding, so the Cache grows "sublinearly" after reaching the window size of 4096
Screenshot 2025-08-11 at 19 48 52 Screenshot 2025-08-11 at 19 55 49

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@Cyrilvallez Cyrilvallez force-pushed the dynamic-sliding-hybrid branch from 534a6a4 to 41d55aa Compare August 11, 2025 08:31
@Cyrilvallez Cyrilvallez changed the title New DynamicSlidingWindow layer & caches New DynamicSlidingWindow layer & cache Aug 11, 2025
@Cyrilvallez Cyrilvallez changed the title New DynamicSlidingWindow layer & cache New DynamicSlidingWindowLayer & associated Cache Aug 11, 2025
@Cyrilvallez
Copy link
Member Author

Cyrilvallez commented Aug 11, 2025

All good now, slow tests on mistral, gemma2 and t5gemma are all similar to main (only a slight fa2 issue that surfaced on a slow test for mistral, but it's unrelated and solved by #40002)

Copy link
Contributor

[For maintainers] Suggested jobs to run (before merge)

run-slow: arcee, aria, bitnet, cohere, cohere2, csm, deepseek_v2, deepseek_v3, diffllama, doge, dots1, emu3, ernie4_5, exaone4, fsmt, gemma2

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants