Skip to content

Commit 8124588

Browse files
FSSRepoleejet
andauthored
feat: ggml-alloc integration and gpu acceleration (leejet#75)
* set ggml url to FSSRepo/ggml * ggml-alloc integration * offload all functions to gpu * gguf format + native converter * merge custom vae to a model * full offload to gpu * improve pretty progress --------- Co-authored-by: leejet <leejet714@gmail.com>
1 parent c874063 commit 8124588

29 files changed

+120851
-2831
lines changed

.gitignore

Lines changed: 7 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,12 @@
11
build*/
22
test/
3-
3+
.vscode/
44
.cache/
55
*.swp
66
.vscode/
7+
*.bat
8+
*.bin
9+
*.exe
10+
*.gguf
11+
output.png
12+
models/*

.gitmodules

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,3 @@
11
[submodule "ggml"]
2-
path = ggml
3-
url = https://github.com/leejet/ggml.git
2+
path = ggml
3+
url = https://github.com/FSSRepo/ggml.git

CMakeLists.txt

Lines changed: 15 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -24,10 +24,24 @@ endif()
2424
# general
2525
#option(SD_BUILD_TESTS "sd: build tests" ${SD_STANDALONE})
2626
option(SD_BUILD_EXAMPLES "sd: build examples" ${SD_STANDALONE})
27+
option(SD_CUBLAS "sd: cuda backend" OFF)
28+
option(SD_FLASH_ATTN "sd: use flash attention for x4 less memory usage" OFF)
2729
option(BUILD_SHARED_LIBS "sd: build shared libs" OFF)
2830
#option(SD_BUILD_SERVER "sd: build server example" ON)
2931

32+
if(SD_CUBLAS)
33+
message("Use CUBLAS as backend stable-diffusion")
34+
set(GGML_CUBLAS ON)
35+
add_definitions(-DSD_USE_CUBLAS)
36+
endif()
37+
38+
if(SD_FLASH_ATTN)
39+
message("Use Flash Attention for memory optimization")
40+
add_definitions(-DSD_USE_FLASH_ATTENTION)
41+
endif()
42+
3043

44+
set(CMAKE_POLICY_DEFAULT_CMP0077 NEW)
3145
# deps
3246
add_subdirectory(ggml)
3347

@@ -38,6 +52,7 @@ target_link_libraries(${SD_LIB} PUBLIC ggml)
3852
target_include_directories(${SD_LIB} PUBLIC .)
3953
target_compile_features(${SD_LIB} PUBLIC cxx_std_11)
4054

55+
add_subdirectory(common)
4156

4257
if (SD_BUILD_EXAMPLES)
4358
add_subdirectory(examples)

README.md

Lines changed: 42 additions & 23 deletions
Original file line numberDiff line numberDiff line change
@@ -9,17 +9,20 @@ Inference of [Stable Diffusion](https://github.com/CompVis/stable-diffusion) in
99
## Features
1010

1111
- Plain C/C++ implementation based on [ggml](https://github.com/ggerganov/ggml), working in the same way as [llama.cpp](https://github.com/ggerganov/llama.cpp)
12+
- Super lightweight and without external dependencies.
1213
- 16-bit, 32-bit float support
1314
- 4-bit, 5-bit and 8-bit integer quantization support
1415
- Accelerated memory-efficient CPU inference
15-
- Only requires ~2.3GB when using txt2img with fp16 precision to generate a 512x512 image
16+
- Only requires ~2.3GB when using txt2img with fp16 precision to generate a 512x512 image, enabling Flash Attention just requires ~1.8GB.
1617
- AVX, AVX2 and AVX512 support for x86 architectures
1718
- SD1.x and SD2.x support
19+
- Full CUDA backend for GPU acceleration, for now just for float16 and float32 models. There are some issues with quantized models and CUDA; it will be fixed in the future.
20+
- Flash Attention for memory usage optimization (only cpu for now).
1821
- Original `txt2img` and `img2img` mode
1922
- Negative prompt
2023
- [stable-diffusion-webui](https://github.com/AUTOMATIC1111/stable-diffusion-webui) style tokenizer (not all the features, only token weighting for now)
2124
- LoRA support, same as [stable-diffusion-webui](https://github.com/AUTOMATIC1111/stable-diffusion-webui/wiki/Features#lora)
22-
- Latent Consistency Models support(LCM/LCM-LoRA)
25+
- Latent Consistency Models support (LCM/LCM-LoRA)
2326
- Sampling method
2427
- `Euler A`
2528
- `Euler`
@@ -40,10 +43,11 @@ Inference of [Stable Diffusion](https://github.com/CompVis/stable-diffusion) in
4043
### TODO
4144

4245
- [ ] More sampling methods
43-
- [ ] GPU support
4446
- [ ] Make inference faster
4547
- The current implementation of ggml_conv_2d is slow and has high memory usage
4648
- [ ] Continuing to reduce memory usage (quantizing the weights of ggml_conv_2d)
49+
- [ ] Implement BPE Tokenizer
50+
- [ ] Add [TAESD](https://github.com/madebyollin/taesd) for faster VAE decoding
4751
- [ ] k-quants support
4852

4953
## Usage
@@ -77,24 +81,20 @@ git submodule update
7781
# curl -L -O https://huggingface.co/stabilityai/stable-diffusion-2-1/blob/main/v2-1_768-nonema-pruned.safetensors
7882
```
7983

80-
- convert weights to ggml model format
84+
- convert weights to gguf model format
8185

8286
```shell
83-
cd models
84-
pip install -r requirements.txt
85-
# (optional) python convert_diffusers_to_original_stable_diffusion.py --model_path [path to diffusers weights] --checkpoint_path [path to weights]
86-
python convert.py [path to weights] --out_type [output precision]
87-
# For example, python convert.py sd-v1-4.ckpt --out_type f16
87+
./bin/convert sd-v1-4.ckpt -t f16
8888
```
8989

9090
### Quantization
9191

92-
You can specify the output model format using the --out_type parameter
92+
You can specify the output model format using the `--type` or `-t` parameter
9393

9494
- `f16` for 16-bit floating-point
9595
- `f32` for 32-bit floating-point
96-
- `q8_0` for 8-bit integer quantization
97-
- `q5_0` or `q5_1` for 5-bit integer quantization
96+
- `q8_0` for 8-bit integer quantization
97+
- `q5_0` or `q5_1` for 5-bit integer quantization
9898
- `q4_0` or `q4_1` for 4-bit integer quantization
9999

100100
### Build
@@ -115,6 +115,24 @@ cmake .. -DGGML_OPENBLAS=ON
115115
cmake --build . --config Release
116116
```
117117
118+
##### Using CUBLAS
119+
120+
This provides BLAS acceleration using the CUDA cores of your Nvidia GPU. Make sure to have the CUDA toolkit installed. You can download it from your Linux distro's package manager (e.g. `apt install nvidia-cuda-toolkit`) or from here: [CUDA Toolkit](https://developer.nvidia.com/cuda-downloads). Recommended to have at least 4 GB of VRAM.
121+
122+
```
123+
cmake .. -DSD_CUBLAS=ON
124+
cmake --build . --config Release
125+
```
126+
127+
### Using Flash Attention
128+
129+
Enabling flash attention reduces memory usage by at least 400 MB. At the moment, it is not supported when CUBLAS is enabled because the kernel implementation is missing.
130+
131+
```
132+
cmake .. -DSD_FLASH_ATTN=ON
133+
cmake --build . --config Release
134+
```
135+
118136
### Run
119137
120138
```
@@ -141,14 +159,15 @@ arguments:
141159
--steps STEPS number of sample steps (default: 20)
142160
--rng {std_default, cuda} RNG (default: cuda)
143161
-s SEED, --seed SEED RNG seed (default: 42, use random seed for < 0)
162+
-b, --batch-count COUNT number of images to generate.
144163
--schedule {discrete, karras} Denoiser sigma schedule (default: discrete)
145164
-v, --verbose print extra info
146165
```
147166
148167
#### txt2img example
149168
150169
```
151-
./bin/sd -m ../models/sd-v1-4-ggml-model-f16.bin -p "a lovely cat"
170+
./bin/sd -m ../sd-v1-4-f16.gguf -p "a lovely cat"
152171
```
153172
154173
Using formats of different precisions will yield results of varying quality.
@@ -163,7 +182,7 @@ Using formats of different precisions will yield results of varying quality.
163182
164183
165184
```
166-
./bin/sd --mode img2img -m ../models/sd-v1-4-ggml-model-f16.bin -p "cat with blue eyes" -i ./output.png -o ./img2img_output.png --strength 0.4
185+
./bin/sd --mode img2img -m ../models/sd-v1-4-f16.gguf -p "cat with blue eyes" -i ./output.png -o ./img2img_output.png --strength 0.4
167186
```
168187
169188
<p align="center">
@@ -172,12 +191,11 @@ Using formats of different precisions will yield results of varying quality.
172191
173192
#### with LoRA
174193
175-
- convert lora weights to ggml model format
194+
- convert lora weights to gguf model format
176195
177196
```shell
178-
cd models
179-
python convert.py [path to weights] --lora
180-
# For example, python convert.py marblesh.safetensors
197+
bin/convert [lora path] -t f16
198+
# For example, bin/convert marblesh.safetensors -t f16
181199
```
182200
183201
- You can specify the directory where the lora weights are stored via `--lora-model-dir`. If not specified, the default is the current working directory.
@@ -187,10 +205,10 @@ Using formats of different precisions will yield results of varying quality.
187205
Here's a simple example:
188206
189207
```
190-
./bin/sd -m ../models/v1-5-pruned-emaonly-ggml-model-f16.bin -p "a lovely cat<lora:marblesh:1>" --lora-model-dir ../models
208+
./bin/sd -m ../models/v1-5-pruned-emaonly-f16.gguf -p "a lovely cat<lora:marblesh:1>" --lora-model-dir ../models
191209
```
192210
193-
`../models/marblesh-ggml-lora.bin` will be applied to the model
211+
`../models/marblesh.gguf` will be applied to the model
194212
195213
#### LCM/LCM-LoRA
196214
@@ -201,7 +219,7 @@ Here's a simple example:
201219
Here's a simple example:
202220
203221
```
204-
./bin/sd -m ../models/v1-5-pruned-emaonly-ggml-model-f16.bin -p "a lovely cat<lora:lcm-lora-sdv1-5:1>" --steps 4 --lora-model-dir ../models -v --cfg-scale 1
222+
./bin/sd -m ../models/v1-5-pruned-emaonly-f16.gguf -p "a lovely cat<lora:lcm-lora-sdv1-5:1>" --steps 4 --lora-model-dir ../models -v --cfg-scale 1
205223
```
206224
207225
| without LCM-LoRA (--cfg-scale 7) | with LCM-LoRA (--cfg-scale 1) |
@@ -222,15 +240,16 @@ docker build -t sd .
222240
```shell
223241
docker run -v /path/to/models:/models -v /path/to/output/:/output sd [args...]
224242
# For example
225-
# docker run -v ./models:/models -v ./build:/output sd -m /models/sd-v1-4-ggml-model-f16.bin -p "a lovely cat" -v -o /output/output.png
243+
# docker run -v ./models:/models -v ./build:/output sd -m /models/sd-v1-4-f16.gguf -p "a lovely cat" -v -o /output/output.png
226244
```
227245

228246
## Memory/Disk Requirements
229247

230248
| precision | f32 | f16 |q8_0 |q5_0 |q5_1 |q4_0 |q4_1 |
231249
| ---- | ---- |---- |---- |---- |---- |---- |---- |
232250
| **Disk** | 2.7G | 2.0G | 1.7G | 1.6G | 1.6G | 1.5G | 1.5G |
233-
| **Memory**(txt2img - 512 x 512) | ~2.8G | ~2.3G | ~2.1G | ~2.0G | ~2.0G | ~2.0G | ~2.0G |
251+
| **Memory** (txt2img - 512 x 512) | ~2.8G | ~2.3G | ~2.1G | ~2.0G | ~2.0G | ~2.0G | ~2.0G |
252+
| **Memory** (txt2img - 512 x 512) *with Flash Attention* | ~2.4G | ~1.9G | ~1.6G | ~1.5G | ~1.5G | ~1.5G | ~1.5G |
234253

235254
## Contributors
236255

common/CMakeLists.txt

Lines changed: 15 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,15 @@
1+
set(TARGET common)
2+
3+
# json.hpp library from: https://github.com/nlohmann/json
4+
5+
add_library(${TARGET} OBJECT common.cpp common.h stb_image.h stb_image_write.h json.hpp)
6+
7+
target_include_directories(${TARGET} PUBLIC .)
8+
target_link_libraries(${TARGET} PRIVATE stable-diffusion ${CMAKE_THREAD_LIBS_INIT})
9+
target_compile_features(${TARGET} PUBLIC cxx_std_11)
10+
11+
# ZIP Library from: https://github.com/kuba--/zip
12+
13+
set(Z_TARGET zip)
14+
add_library(${Z_TARGET} OBJECT zip.c zip.h miniz.h)
15+
target_include_directories(${Z_TARGET} PUBLIC .)

0 commit comments

Comments
 (0)