You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
feat: ggml-alloc integration and gpu acceleration (leejet#75)
* set ggml url to FSSRepo/ggml
* ggml-alloc integration
* offload all functions to gpu
* gguf format + native converter
* merge custom vae to a model
* full offload to gpu
* improve pretty progress
---------
Co-authored-by: leejet <leejet714@gmail.com>
Copy file name to clipboardExpand all lines: README.md
+42-23Lines changed: 42 additions & 23 deletions
Original file line number
Diff line number
Diff line change
@@ -9,17 +9,20 @@ Inference of [Stable Diffusion](https://github.com/CompVis/stable-diffusion) in
9
9
## Features
10
10
11
11
- Plain C/C++ implementation based on [ggml](https://github.com/ggerganov/ggml), working in the same way as [llama.cpp](https://github.com/ggerganov/llama.cpp)
12
+
- Super lightweight and without external dependencies.
12
13
- 16-bit, 32-bit float support
13
14
- 4-bit, 5-bit and 8-bit integer quantization support
14
15
- Accelerated memory-efficient CPU inference
15
-
- Only requires ~2.3GB when using txt2img with fp16 precision to generate a 512x512 image
16
+
- Only requires ~2.3GB when using txt2img with fp16 precision to generate a 512x512 image, enabling Flash Attention just requires ~1.8GB.
16
17
- AVX, AVX2 and AVX512 support for x86 architectures
17
18
- SD1.x and SD2.x support
19
+
- Full CUDA backend for GPU acceleration, for now just for float16 and float32 models. There are some issues with quantized models and CUDA; it will be fixed in the future.
20
+
- Flash Attention for memory usage optimization (only cpu for now).
18
21
- Original `txt2img` and `img2img` mode
19
22
- Negative prompt
20
23
-[stable-diffusion-webui](https://github.com/AUTOMATIC1111/stable-diffusion-webui) style tokenizer (not all the features, only token weighting for now)
21
24
- LoRA support, same as [stable-diffusion-webui](https://github.com/AUTOMATIC1111/stable-diffusion-webui/wiki/Features#lora)
22
-
- Latent Consistency Models support(LCM/LCM-LoRA)
25
+
- Latent Consistency Models support(LCM/LCM-LoRA)
23
26
- Sampling method
24
27
-`Euler A`
25
28
-`Euler`
@@ -40,10 +43,11 @@ Inference of [Stable Diffusion](https://github.com/CompVis/stable-diffusion) in
40
43
### TODO
41
44
42
45
-[ ] More sampling methods
43
-
-[ ] GPU support
44
46
-[ ] Make inference faster
45
47
- The current implementation of ggml_conv_2d is slow and has high memory usage
46
48
-[ ] Continuing to reduce memory usage (quantizing the weights of ggml_conv_2d)
49
+
-[ ] Implement BPE Tokenizer
50
+
-[ ] Add [TAESD](https://github.com/madebyollin/taesd) for faster VAE decoding
# (optional) python convert_diffusers_to_original_stable_diffusion.py --model_path [path to diffusers weights] --checkpoint_path [path to weights]
86
-
python convert.py [path to weights] --out_type [output precision]
87
-
# For example, python convert.py sd-v1-4.ckpt --out_type f16
87
+
./bin/convert sd-v1-4.ckpt -t f16
88
88
```
89
89
90
90
### Quantization
91
91
92
-
You can specify the output model format using the --out_type parameter
92
+
You can specify the output model format using the `--type` or `-t` parameter
93
93
94
94
- `f16`for 16-bit floating-point
95
95
- `f32`for 32-bit floating-point
96
-
- `q8_0`for 8-bit integer quantization
97
-
- `q5_0` or `q5_1`for 5-bit integer quantization
96
+
- `q8_0`for 8-bit integer quantization
97
+
- `q5_0` or `q5_1`for 5-bit integer quantization
98
98
- `q4_0` or `q4_1`for 4-bit integer quantization
99
99
100
100
### Build
@@ -115,6 +115,24 @@ cmake .. -DGGML_OPENBLAS=ON
115
115
cmake --build . --config Release
116
116
```
117
117
118
+
##### Using CUBLAS
119
+
120
+
This provides BLAS acceleration using the CUDA cores of your Nvidia GPU. Make sure to have the CUDA toolkit installed. You can download it from your Linux distro's package manager (e.g. `apt install nvidia-cuda-toolkit`) or from here: [CUDA Toolkit](https://developer.nvidia.com/cuda-downloads). Recommended to have at least 4 GB of VRAM.
121
+
122
+
```
123
+
cmake .. -DSD_CUBLAS=ON
124
+
cmake --build . --config Release
125
+
```
126
+
127
+
### Using Flash Attention
128
+
129
+
Enabling flash attention reduces memory usage by at least 400 MB. At the moment, it is not supported when CUBLAS is enabled because the kernel implementation is missing.
130
+
131
+
```
132
+
cmake .. -DSD_FLASH_ATTN=ON
133
+
cmake --build . --config Release
134
+
```
135
+
118
136
### Run
119
137
120
138
```
@@ -141,14 +159,15 @@ arguments:
141
159
--steps STEPS number of sample steps (default: 20)
142
160
--rng {std_default, cuda} RNG (default: cuda)
143
161
-s SEED, --seed SEED RNG seed (default: 42, use random seed for < 0)
162
+
-b, --batch-count COUNT number of images to generate.
./bin/sd -m ../models/sd-v1-4-ggml-model-f16.bin -p "a lovely cat"
170
+
./bin/sd -m ../sd-v1-4-f16.gguf -p "a lovely cat"
152
171
```
153
172
154
173
Using formats of different precisions will yield results of varying quality.
@@ -163,7 +182,7 @@ Using formats of different precisions will yield results of varying quality.
163
182
164
183
165
184
```
166
-
./bin/sd --mode img2img -m ../models/sd-v1-4-ggml-model-f16.bin -p "cat with blue eyes" -i ./output.png -o ./img2img_output.png --strength 0.4
185
+
./bin/sd --mode img2img -m ../models/sd-v1-4-f16.gguf -p "cat with blue eyes" -i ./output.png -o ./img2img_output.png --strength 0.4
167
186
```
168
187
169
188
<p align="center">
@@ -172,12 +191,11 @@ Using formats of different precisions will yield results of varying quality.
172
191
173
192
#### with LoRA
174
193
175
-
- convert lora weights to ggml model format
194
+
- convert lora weights to gguf model format
176
195
177
196
```shell
178
-
cd models
179
-
python convert.py [path to weights] --lora
180
-
# For example, python convert.py marblesh.safetensors
197
+
bin/convert [lora path] -t f16
198
+
# For example, bin/convert marblesh.safetensors -t f16
181
199
```
182
200
183
201
- You can specify the directory where the lora weights are stored via `--lora-model-dir`. If not specified, the default is the current working directory.
@@ -187,10 +205,10 @@ Using formats of different precisions will yield results of varying quality.
187
205
Here's a simple example:
188
206
189
207
```
190
-
./bin/sd -m ../models/v1-5-pruned-emaonly-ggml-model-f16.bin -p "a lovely cat<lora:marblesh:1>" --lora-model-dir ../models
208
+
./bin/sd -m ../models/v1-5-pruned-emaonly-f16.gguf -p "a lovely cat<lora:marblesh:1>" --lora-model-dir ../models
191
209
```
192
210
193
-
`../models/marblesh-ggml-lora.bin` will be applied to the model
211
+
`../models/marblesh.gguf` will be applied to the model
0 commit comments