-
Notifications
You must be signed in to change notification settings - Fork 70
Give the user the option to override where model weights are stored #232
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
For a tensor with zero elements ggml_nbytes() was returning uint64_t::max, and this was causing graph allocation failure.
Here some results using
Dumping some timing info for TG, in a run with 5 tg128 evaluations I get
Here is the op timing breakdown for 5 x tg128 runs CPU
GPU
|
Is the cost call overhead or throughput?
Also how do you generate these op timing breakdowns? |
I don't know. Haven't gone into the back-end code to break it down. But I suspect most of it is synchronization inefficiencies as there isn't much data to be sent back-and-fort when doing TG.
I set |
For TG cost for copying data back-and-fort is negligible. Here is a rough breakdown of the 16% overhead:
For PP copying data back-and-fort is more significant. I tested with a context of 1024 and I see about 11% spent in |
Update:I made a mistake in the above. I was using a model file that did not have the additional tensors required for MLA. But
So, ~20% slower than standard attention. CUDA does not like MLA. I need to investigate why. |
I have observed the same phenomenon as you. After a single inference is completed, there is a lot of D2H copy work. At present, I also use multiple processing parallelism to "bypass" this problem, just like the solution you mentioned. I am not sure if we don't need to cache the results, can we directly abandon this part of the work? I would like to hear your opinion. PS: I am actually a rookie who has only been exposed to the llama.cpp source code for a week. |
It seems this PR amounts to most of the "secret sauce" of KTransformers.
We add a command line option to override where model weights are stored using regular expressions. This allows to keep the MoE experts on the CPU and to offload only the attention and not repeating layers to the GPU. The PR is inspired by ggml-org/llama.cpp#11397, but
ik_llama.cpp
has now diverged so much from mainline that I had to do most of it new.Unfortunately I cannot test with DeepSeekV3/R1, but here is what I get for DeepSeek-Lite (very similar MoE architecture) using
The argument to the new
-ot
or--override-tensor
option isIn the above example we first ask all model layers to be offloaded to the GPU (
-ngl 100
), but then override all model tensors that match the regular expression\.ffn_.*_exps\.
to be kept on the CPU (and also not offloaded to the GPU to perform operations on them).The PR is still a bit rough around the edges (not much error handling,
mmap
gets disabled for the tensors with buffer type override, etc.), but throwing it out there to get feedback.Would love to hear from someone having a GPU with enough VRAM to fit all DeepSeekV3/R1 model weights on the GPU except the experts.