server : disable speculative decoding for SWA models #13970

ggerganov · 2025-06-02T12:50:51Z

To properly support his, we first need to fix this TODO:

llama.cpp/src/llama-kv-cache-unified-iswa.cpp

Lines 98 to 106 in 272df3f

    
           llama_memory_state_ptr llama_kv_cache_unified_iswa::init_batch(const llama_batch & batch, uint32_t n_ubatch, bool embd_pooled, bool logits_all) { 
        
               GGML_UNUSED(embd_pooled); 
        
               // TODO: if we fail with split_simple, we should attempt different splitting strategies 
        
               //       but to do that properly, we first have to refactor the batches to be more flexible 
        
               auto sbatch = llama_sbatch(batch, hparams.n_embd, true, logits_all); 
        
               std::vector<llama_ubatch> ubatches;

ggml-ci

server : use swa-full fo draft context

b4f7dcf

ggml-ci

ggerganov mentioned this pull request Jun 2, 2025

Misc. bug: Using draft model with Gemma producing error "get_logits_ith: invalid logits id 0" #13963

Open

github-actions bot added examples server labels Jun 2, 2025

server : disable speculative decoding for SWA models

272df3f

ggerganov changed the title ~~server : use swa-full fo draft context~~ server : disable speculative decoding for SWA models Jun 2, 2025

ggerganov marked this pull request as ready for review June 2, 2025 18:07

ggerganov requested a review from ngxson as a code owner June 2, 2025 18:07

ggerganov merged commit 3637576 into master Jun 2, 2025
46 checks passed

ggerganov deleted the gg/server-spec-swa-clear branch June 2, 2025 18:34

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

server : disable speculative decoding for SWA models #13970

server : disable speculative decoding for SWA models #13970

Uh oh!

ggerganov commented Jun 2, 2025 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

	llama_memory_state_ptr llama_kv_cache_unified_iswa::init_batch(const llama_batch & batch, uint32_t n_ubatch, bool embd_pooled, bool logits_all) {
	GGML_UNUSED(embd_pooled);

	// TODO: if we fail with split_simple, we should attempt different splitting strategies
	// but to do that properly, we first have to refactor the batches to be more flexible

	auto sbatch = llama_sbatch(batch, hparams.n_embd, true, logits_all);

	std::vector<llama_ubatch> ubatches;

server : disable speculative decoding for SWA models #13970

server : disable speculative decoding for SWA models #13970

Uh oh!

Conversation

ggerganov commented Jun 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ggerganov commented Jun 2, 2025 •

edited

Loading