Skip to content

Feat: add Flux 1 Lite 8B (Freepik) support #474

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
Nov 23, 2024
Merged

Conversation

stduhpf
Copy link
Contributor

@stduhpf stduhpf commented Nov 20, 2024

Diffusion model weights: https://huggingface.co/Freepik/flux.1-lite-8B-alpha/blob/main/flux.1-lite-8B-alpha.safetensors.

It's basically Flux 1 [Dev] with most double blocks removed. Flux 1 Lite Q4_k is smaller than Flux 1 Dev Q3_k, while delivering better image quality (in my subjective opinion). It's also about 25% faster during image generation.

.\build\bin\Release\sd.exe --diffusion-model ..\ComfyUI\models\unet\flux.1-lite-8B-alpha-q4_k.gguf --vae ..\ComfyUI\models\vae\ae.q8_0.gguf --clip_l ..\ComfyUI\models\clip\clip_l.q8_0.gguf --t5xxl ..\ComfyUI\models\clip\t5xxl_q4_k.gguf -p "a lovely cat holding a sign says 'flux.cpp'" --cfg-scale 1.0 --sampling-method euler -t 24 --vae-tiling --color -v -W 1024 -H 1024
output

Flux 1 Dev Q3_k (5 114 284 416 Bytes) Flux 1 Lite Q4_k (4 819 297 120 Bytes)
dev-q3_k lite-q4_k

@Green-Sky
Copy link
Contributor

Green-Sky commented Nov 21, 2024

Tested this ontop of my flash attention pr, and the results are good 🚀 .

cuda (rtx 2070 mobile / 8gb vram)

quant dims fa compute buffer size speed
q3_k 512x512 🔴 398.50 MB(VRAM) 1.67s/it
q3_k 512x512 🟢 248.50 MB(VRAM) 1.39s/it
q3_k 1024x512 🔴 942.75 MB(VRAM) 3.32s/it
q3_k 1024x512 🟢 456.75 MB(VRAM) 2.62s/it
q3_k 768x768 🔴 1105.07 MB(VRAM) 3.91s/it
q3_k 768x768 🟢 505.07 MB(VRAM) 3.01s/it
q3_k 1024x1024 🔴 2577.25 MB(VRAM) 8.06s/it
q3_k 1024x1024 🟢 843.25 MB(VRAM) 5.79s/it
q4_k 512x512 🔴 398.50 MB(VRAM) 1.56s/it
q4_k 512x512 🟢 248.50 MB(VRAM) 1.30s/it
q4_k 1024x512 🔴 942.75 MB(VRAM) 3.10s/it
q4_k 1024x512 🟢 456.75 MB(VRAM) 2.43s/it
q4_k 768x768 🔴 1105.07 MB(VRAM) 3.64s/it
q4_k 768x768 🟢 505.07 MB(VRAM) 2.79s/it
q4_k 1024x1024 🔴 OOM
q4_k 1024x1024 🟢 843.25 MB(VRAM) 5.20s/it

Good stuff, the compute buffer sizes are, unsurprisingly, the same as for dev/schnell and the speed faster.

@Green-Sky
Copy link
Contributor

With this model, the flash attention pr and vae-tiling, you can do some obscene stuff like q4_k 2048x1024 on my 8gig vram.
output

@stduhpf
Copy link
Contributor Author

stduhpf commented Nov 21, 2024

Flash Attention is still not supported on Vulkan sadly. I might try merging in ggml-org/llama.cpp#10206 later to see how it goes.

@leejet
Copy link
Owner

leejet commented Nov 23, 2024

Thank you for your contribution.

@leejet leejet merged commit 6ea8122 into leejet:master Nov 23, 2024
9 checks passed
@stduhpf stduhpf deleted the freepik branch November 23, 2024 10:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants