convert: add eagle2 draft arch #13908

pockers21 · 2025-05-30T03:28:13Z

EAGLE-2 Speculative Decoding Support for llama.cpp - Phase 1 Submission

Overview

This PR introduces EAGLE-2 (Extrapolation Algorithm for Greater Language-model Efficiency) speculative decoding support for llama.cpp. EAGLE-2 is an advanced speculative decoding technique that uses a smaller draft model to accelerate inference of larger target models.

Paper: EAGLE-2: Faster Inference of Language Models with Dynamic Draft Trees
Official Repository: https://github.com/SafeAILab/EAGLE

Implementation Status

✅ EAGLE-2 is fully implemented and functional - complete with model conversion, loading, inference, and speculative decoding algorithm.

For maintainability and review efficiency, we’re submitting the implementation in multiple focused phases rather than as one large PR.

Phase 1: Model Conversion Infrastructure

This submission focuses on the model conversion component - enabling EAGLE-2 draft models to be converted from safetensors to GGUF format.

Why Phased Submission?

We've chosen a phased submission approach for the following reasons:

Review Efficiency: Smaller, focused PRs are easier for maintainers to review thoroughly
Testing & Validation: Allows community testing of each component before building on it
Incremental Integration: Reduces risk of conflicts and integration issues
Code Quality: Enables focused feedback on specific functionality

Submission Roadmap

Phase 1 (Current): Model conversion and GGUF format support
Phase 2 (Next): Draft model loading and computation graph implementation
Phase 3 (Final): Core speculative decoding algorithm and inference logic

All phases are already implemented and tested - this roadmap represents submission phases only.

Current Submission Features

Enhanced Model Conversion

✅ Support for EAGLE-2 draft model architecture in convert_hf_to_gguf.py
✅ Proper handling of draft model specific layers and weights
✅ Seamless integration with existing GGUF conversion pipeline

Model Size Optimization

EAGLE-2 draft models offer significant size advantages:

Model Type	Original Size	GGUF F16	GGUF Q4_K_M	Compression Ratio
Qwen2-7B-Instruct	~15GB	13.5GB	4.6GB	3.3x
EAGLE-Qwen2-1.4B Draft	N/A	2.7GB	0.93GB	16x vs original

Draft models achieve extreme compression as they extract and optimize single decoder layers from the original model.

Demonstrated Performance

The complete EAGLE-2 implementation has been tested on NVIDIA RTX 4080 with few data come from ShareGPT dataset:

Qwen2-7B-Instruct + EAGLE-Qwen2-1.4B Draft Model

Sampling Strategy	Target Model	Draft Model	Baseline	With EAGLE-2	Speedup
Temperature 0.0	BF16	F16	37 tok/s	50 tok/s	1.35x
Temperature 0.1	BF16	F16	25 tok/s	53 tok/s	2.1x
Mixed Sampling	BF16	F16	40 tok/s	44.6 tok/s	1.12x
Temperature 0.0	BF16	Q4_K_M	37 tok/s	57 tok/s	1.54x
Temperature 0.1	BF16	Q4_K_M	25 tok/s	54 tok/s	2.1x
Mixed Sampling	BF16	Q4_K_M	40 tok/s	50 tok/s	1.25x

Performance Notes:

Mixed sampling includes temperature + top_k + top_p
Results demonstrate up to 2.1x acceleration
Performance matches expectations from original Python implementation

Technical Implementation

Architecture Support

Extended convert_hf_to_gguf.py with EAGLE-2 model detection
Proper weight mapping for draft model specific layers
Full compatibility with existing llama.cpp infrastructure

Memory Efficiency

Draft models use single decoder layers, dramatically reducing memory footprint

Testing Resources

Pre-converted Models

For immediate testing and validation:

EAGLE-Qwen2 Draft Model (F16):

🤗 Safetensor Models & HuggingFace Model
Size: 2.7GB
Compatible with: Qwen2-7B-Instruct target models

Testing the Conversion

# Convert EAGLE-2 draft model to GGUF
python convert_hf_to_gguf.py /path/to/eagle-draft-model --outfile eagle-draft.gguf

Current Implementation Characteristics

Operational Parameters

Platform Performance: Optimized for GPU inference; CPU inference optimization pending.
Sequence Processing: Single-sequence inference (consistent with original EAGLE-2 research design)
Model Support: Currently validated with Qwen2; architecture supports LLaMA 2/3, Vicuna, etc.

Architecture Design

EAGLE-2 algorithm is designed for GPU acceleration scenarios
Implementation follows original research specifications
Full compatibility with llama.cpp ecosystem

Future Submissions

The remaining implementation components will be submitted in subsequent PRs:

Phase 2: Draft model loading, computation graph, and memory management
Phase 3: Complete speculative decoding algorithm and inference engine

Acknowledgments

Original EAGLE-2 research team at SafeAI Lab
llama.cpp community for the robust infrastructure
Early adopters and testers providing valuable feedback

This Phase 1 submission provides the model conversion foundation for EAGLE-2 speculative decoding in llama.cpp. The complete implementation demonstrates significant inference acceleration while maintaining compatibility with the llama.cpp ecosystem.

CISC · 2025-05-30T06:56:24Z

convert_hf_to_gguf.py

@@ -2712,6 +2712,25 @@ def modify_tensors(self, data_torch: Tensor, name: str, bid: int | None) -> Iter
        yield from super().modify_tensors(data_torch, name, bid)


+@ModelBase.register("Eagle2DraftForCausalLM")
+class Eagle2DraftModel(TextModel):


Since this is a Qwen2 model and it has no distinguishing conversion features, make it inherit from Qwen2Model and leave out everything but model_arch.

Since this is a Qwen2 model and it has no distinguishing conversion features, make it inherit from Qwen2Model and leave out everything but model_arch.

Fixed in latest commit , now inherits from Qwen2Model with only model_arch override.

CISC · 2025-05-30T07:04:08Z

gguf-py/gguf/tensor_mapping.py

@@ -58,6 +58,10 @@ class TensorNameMap:
            "wpe",                             # gpt2
        ),

+        # eagle2 draft model
+        MODEL_TENSOR.FC: (
+            "model.fc",


I'm not sure about this name (also it appears to be just fc in your GGUF), what is the purpose of this tensor?

I'm not sure about this name (also it appears to be just fc in your GGUF), what is the purpose of this tensor?

The EAGLE-2 paper doesn't express this mechanism particularly clearly, but you can refer to Figure 6 in the EAGLE-1 paper at https://arxiv.org/html/2401.15077v3.
The FC layer concatenates the current hidden_states with the hidden_states passed from previous inference steps, forming a tensor of 2 * config.hidden_size dimensions. This FC layer then maps the 2 * config.hidden_size dimensional tensor back to a config.hidden_size dimensional tensor, as demonstrated in the code at https://github.com/SafeAILab/EAGLE/blob/main/eagle/model/cnets1.py#L524.

…extModel

convert: add eagle2 draft arch

87daf40

github-actions bot added the python python script changes label May 30, 2025

fix: resolve code formatting issues

4b4975c

CISC reviewed May 30, 2025

View reviewed changes

saood06 mentioned this pull request Jun 1, 2025

Speculative decoding support ikawrakow/ik_llama.cpp#322

Open

4 tasks

refactor: make Eagle2DraftModel inherits from Qwen2Model instead of T…

ec22e4b

…extModel

pockers21 force-pushed the feature-convert-eagle2-draft branch from e0714aa to ec22e4b Compare June 1, 2025 10:45

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

convert: add eagle2 draft arch #13908

convert: add eagle2 draft arch #13908

Uh oh!

pockers21 commented May 30, 2025 •

edited

Loading

Uh oh!

CISC May 30, 2025

Uh oh!

pockers21 Jun 1, 2025 •

edited

Loading

Uh oh!

CISC May 30, 2025

Uh oh!

pockers21 Jun 1, 2025

Uh oh!

Uh oh!

convert: add eagle2 draft arch #13908

Are you sure you want to change the base?

convert: add eagle2 draft arch #13908

Uh oh!

Conversation

pockers21 commented May 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

EAGLE-2 Speculative Decoding Support for llama.cpp - Phase 1 Submission

Overview

Implementation Status

Phase 1: Model Conversion Infrastructure

Why Phased Submission?

Submission Roadmap

Current Submission Features

Enhanced Model Conversion

Model Size Optimization

Demonstrated Performance

Qwen2-7B-Instruct + EAGLE-Qwen2-1.4B Draft Model

Technical Implementation

Architecture Support

Memory Efficiency

Testing Resources

Pre-converted Models

Testing the Conversion

Current Implementation Characteristics

Operational Parameters

Architecture Design

Future Submissions

Acknowledgments

Uh oh!

CISC May 30, 2025

Choose a reason for hiding this comment

Uh oh!

pockers21 Jun 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

CISC May 30, 2025

Choose a reason for hiding this comment

Uh oh!

pockers21 Jun 1, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

pockers21 commented May 30, 2025 •

edited

Loading

pockers21 Jun 1, 2025 •

edited

Loading