Skip to content

server : disable speculative decoding for SWA models #13970

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Jun 2, 2025

Conversation

ggerganov
Copy link
Member

@ggerganov ggerganov commented Jun 2, 2025

ref #13963

To properly support his, we first need to fix this TODO:

llama_memory_state_ptr llama_kv_cache_unified_iswa::init_batch(const llama_batch & batch, uint32_t n_ubatch, bool embd_pooled, bool logits_all) {
GGML_UNUSED(embd_pooled);
// TODO: if we fail with split_simple, we should attempt different splitting strategies
// but to do that properly, we first have to refactor the batches to be more flexible
auto sbatch = llama_sbatch(batch, hparams.n_embd, true, logits_all);
std::vector<llama_ubatch> ubatches;

@ggerganov ggerganov changed the title server : use swa-full fo draft context server : disable speculative decoding for SWA models Jun 2, 2025
@ggerganov ggerganov marked this pull request as ready for review June 2, 2025 18:07
@ggerganov ggerganov requested a review from ngxson as a code owner June 2, 2025 18:07
@ggerganov ggerganov merged commit 3637576 into master Jun 2, 2025
46 checks passed
@ggerganov ggerganov deleted the gg/server-spec-swa-clear branch June 2, 2025 18:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant