-
Notifications
You must be signed in to change notification settings - Fork 12.4k
llama : reuse compute graphs #14482
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
llama : reuse compute graphs #14482
Conversation
2f577c5
to
30b4d4e
Compare
f61b0f7
to
d9e1781
Compare
fc4fdf6
to
76681e3
Compare
This should be ready for review. Currently, there is some small gain for Metal where It would be interesting to try to reuse the Metal command buffers to speed this up even further on the backend side. Currently, we use |
res &= self_kq_mask->ne[0] == mctx->get_n_kv(); | ||
res &= self_kq_mask->ne[1] == GGML_PAD(params.ubatch.n_tokens, GGML_KQ_MASK_PAD); | ||
|
||
res &= mctx->get_supports_set_rows(); // TODO: tmp |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If update()
is implemented for the recurrent cache, I think it could work even without adapting it to ggml_set_rows
, because the head
offset tends to be the same for similar consecutive ubatches in find_slot
.
That might not work as well once multiple recurrent state cells per sequence are implemented (because they won't get re-used as much), but at that point it should be possible to use ggml_set_rows
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it has to work. As long as the check for the head
is correctly added to the update()
of the respective inputs it should be good.
An update on this - it seems that Edit: prototyped this in #14570. Does not seem worth pursuing as the gains are microscopic. |
I tested this on CUDA on various models I had lying around and I see there a perf regression on a larger model, not sure if I'm doing something wrong. I cherry-picked this PR on top of #14551 and compared with commit1 = 14551, and commit2 = 14551 + this PR. Also interesting is 2x(?) speed up on qwen2lvl 3B
To be sure, I ran it again
|
The The But before benchmarking, I think you should make sure that the generated results when graph reuse is enabled are coherent (see the
And also from the results that you posted it seems that there is a lot of variability on your system for this |
@ggerganov I re-ran with r=100, and tg 64 and 128, I see quite a bit of variability at tg128, but tg64 is pretty tight (<1% variability). Also confirming the
|
7a9e3f4
to
600e69f
Compare
@@ -961,6 +963,7 @@ void llama_kv_cache_unified::set_input_kq_mask(ggml_tensor * dst, const llama_ub | |||
// xxxxx----- | |||
// xxxxx----- | |||
// To visualize the mask, see https://github.com/ggml-org/llama.cpp/pull/12615 | |||
// TODO: optimize this section |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note for later: there is opportunity for optimizing the KQ mask initialization here. For large n_kv
this section becomes measurable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ce77041
to
8303a68
Compare
ggml-ci
ggml-ci
8303a68
to
3d28b3b
Compare
int32_t llama_context::graph_max_nodes() const { | ||
return std::max<int32_t>(65536, 5*model.n_tensors()); | ||
uint32_t llama_context::graph_max_nodes() const { | ||
return std::max<uint32_t>(65536u, 5u*model.n_tensors()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On a side note, I don't know why the minimum number of nodes is so big. This is a waste of memory for the smaller models. The change seems to be from #11571.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will follow-up with a PR to fix this.
I am not sure what would happen if e.g. in a call to |
ggml-ci
// reset the previous graph result to make sure that it won't be reused | ||
// TODO: change the mctx->apply() to return information if a graph reserve is needed | ||
// reset the graph result only if the memory module did reset the scheduler | ||
gf_res_prev->reset(); | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure what would happen if e.g. in a call to kv_self_update the scheduler is reset. Is this detected to prevent reusing the graph?
I added a reset of the previous graph result every time the memory module makes an update to guarantee that if a scheduler reset occurs, we would not attempt to reuse the previous graph.
This is too conservative because sometimes a memory module update would not involve a scheduler reset (for example, KV cache buffer copy from one stream to another). Will follow-up with an update to avoid this overhead per the TODO.
In acaf4b7 I made a relatively significant change to how the @slaren Might be worth to take another look just in case. Even if not perfect, the results should be correct and graph reuse with split KV cache should now be functional. I will be improving the implementation in a follow-up PR. |
0bdb209
to
c7ccf38
Compare
target #14285
Reuse computation graphs from the previous ubatch when possible. Works with any batch size and any model.
Note
To enable this functionality, there is a temporary requirement
LLAMA_SET_ROWS=1
to be set in your environment variable. In the future, this will become the default.This functionality requires the
ggml_set_rows()
operator to be supported (see #14285). In order to be able to reuse a compute graph, its topology (shapes, strides, parameters, etc.) has to be entirely defined by the set of input tensors (e.g.inp_embd
,inp_pos
,inp_attn
, etc.).This PR adds logic to update a previous
llm_graph_result
by verifying that the newllm_graph_params
would result in the same tensor shapes. For this to work, we should no longer preemptively reset the scheduler after processing a batch so that all buffers from the previous graph remain allocated and ready for reuse, in case the newubatch
is compatible. See the newllm_graph_result::update()
method:llama.cpp/src/llama-graph.h
Lines 506 to 525 in fc4fdf6
The other change that is needed is to introduce a way to swap the
llama_memory_context
of all graph inputs, so that the new call tollm_graph_result_i::set_inputs()
uses the correct context from the currentubatch
. This is performed by calling thellm_graph_input_i::update()
method of all input tensors.To enable this feature, define the
LLAMA_SET_ROWS
environment variable.To debug, define
LLAMA_GRAPH_RESULT_DEBUG=2
and add-lv 1
to the CLI args.Tests
LLAMA_SET_ROWS=1 ./bin/llama-cli -m ../models/llama-3.2-3b-instruct/ggml-model-q8_0.gguf -p "I believe the meaning of life is" -n 32 --top-k 1 -fa LLAMA_SET_ROWS=1 ./bin/llama-parallel -m ../models/qwen2.5-3b-coder/ggml-model-q8_0.gguf -np 8 -ns 128 -s 1 -c 4096 -fa -n 128
Benchmark on M2 Ultra:
TODO
is_same
methods?Next PRs
llama_graph_result_i
interface - does not seem to have any purpose (graph : refactor context to not pass gf explicitly #14629)ggml_cgraph * gf
everywhere. Simply move it tollm_graph_context
(graph : refactor context to not pass gf explicitly #14629)