Skip to content

convert: add eagle2 draft arch #13908

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

pockers21
Copy link
Contributor

@pockers21 pockers21 commented May 30, 2025

EAGLE-2 Speculative Decoding Support for llama.cpp - Phase 1 Submission

Overview

This PR introduces EAGLE-2 (Extrapolation Algorithm for Greater Language-model Efficiency) speculative decoding support for llama.cpp. EAGLE-2 is an advanced speculative decoding technique that uses a smaller draft model to accelerate inference of larger target models.

Paper: EAGLE-2: Faster Inference of Language Models with Dynamic Draft Trees
Official Repository: https://github.com/SafeAILab/EAGLE

Implementation Status

EAGLE-2 is fully implemented and functional - complete with model conversion, loading, inference, and speculative decoding algorithm.

For maintainability and review efficiency, we’re submitting the implementation in multiple focused phases rather than as one large PR.

Phase 1: Model Conversion Infrastructure

This submission focuses on the model conversion component - enabling EAGLE-2 draft models to be converted from safetensors to GGUF format.

Why Phased Submission?

We've chosen a phased submission approach for the following reasons:

  1. Review Efficiency: Smaller, focused PRs are easier for maintainers to review thoroughly
  2. Testing & Validation: Allows community testing of each component before building on it
  3. Incremental Integration: Reduces risk of conflicts and integration issues
  4. Code Quality: Enables focused feedback on specific functionality

Submission Roadmap

  • Phase 1 (Current): Model conversion and GGUF format support
  • Phase 2 (Next): Draft model loading and computation graph implementation
  • Phase 3 (Final): Core speculative decoding algorithm and inference logic

All phases are already implemented and tested - this roadmap represents submission phases only.

Current Submission Features

Enhanced Model Conversion

  • ✅ Support for EAGLE-2 draft model architecture in convert_hf_to_gguf.py
  • ✅ Proper handling of draft model specific layers and weights
  • ✅ Seamless integration with existing GGUF conversion pipeline

Model Size Optimization

EAGLE-2 draft models offer significant size advantages:

Model Type Original Size GGUF F16 GGUF Q4_K_M Compression Ratio
Qwen2-7B-Instruct ~15GB 13.5GB 4.6GB 3.3x
EAGLE-Qwen2-1.4B Draft N/A 2.7GB 0.93GB 16x vs original

Draft models achieve extreme compression as they extract and optimize single decoder layers from the original model.

Demonstrated Performance

The complete EAGLE-2 implementation has been tested on NVIDIA RTX 4080 with few data come from ShareGPT dataset:

Qwen2-7B-Instruct + EAGLE-Qwen2-1.4B Draft Model

Sampling Strategy Target Model Draft Model Baseline With EAGLE-2 Speedup
Temperature 0.0 BF16 F16 37 tok/s 50 tok/s 1.35x
Temperature 0.1 BF16 F16 25 tok/s 53 tok/s 2.1x
Mixed Sampling BF16 F16 40 tok/s 44.6 tok/s 1.12x
Temperature 0.0 BF16 Q4_K_M 37 tok/s 57 tok/s 1.54x
Temperature 0.1 BF16 Q4_K_M 25 tok/s 54 tok/s 2.1x
Mixed Sampling BF16 Q4_K_M 40 tok/s 50 tok/s 1.25x

Performance Notes:

  • Mixed sampling includes temperature + top_k + top_p
  • Results demonstrate up to 2.1x acceleration
  • Performance matches expectations from original Python implementation

Technical Implementation

Architecture Support

  • Extended convert_hf_to_gguf.py with EAGLE-2 model detection
  • Proper weight mapping for draft model specific layers
  • Full compatibility with existing llama.cpp infrastructure

Memory Efficiency

  • Draft models use single decoder layers, dramatically reducing memory footprint

Testing Resources

Pre-converted Models

For immediate testing and validation:

EAGLE-Qwen2 Draft Model (F16):

Testing the Conversion

# Convert EAGLE-2 draft model to GGUF
python convert_hf_to_gguf.py /path/to/eagle-draft-model --outfile eagle-draft.gguf

Current Implementation Characteristics

Operational Parameters

  1. Platform Performance: Optimized for GPU inference; CPU inference optimization pending.
  2. Sequence Processing: Single-sequence inference (consistent with original EAGLE-2 research design)
  3. Model Support: Currently validated with Qwen2; architecture supports LLaMA 2/3, Vicuna, etc.

Architecture Design

  • EAGLE-2 algorithm is designed for GPU acceleration scenarios
  • Implementation follows original research specifications
  • Full compatibility with llama.cpp ecosystem

Future Submissions

The remaining implementation components will be submitted in subsequent PRs:

  • Phase 2: Draft model loading, computation graph, and memory management
  • Phase 3: Complete speculative decoding algorithm and inference engine

Acknowledgments

  • Original EAGLE-2 research team at SafeAI Lab
  • llama.cpp community for the robust infrastructure
  • Early adopters and testers providing valuable feedback

This Phase 1 submission provides the model conversion foundation for EAGLE-2 speculative decoding in llama.cpp. The complete implementation demonstrates significant inference acceleration while maintaining compatibility with the llama.cpp ecosystem.

@github-actions github-actions bot added the python python script changes label May 30, 2025
@@ -2712,6 +2712,25 @@ def modify_tensors(self, data_torch: Tensor, name: str, bid: int | None) -> Iter
yield from super().modify_tensors(data_torch, name, bid)


@ModelBase.register("Eagle2DraftForCausalLM")
class Eagle2DraftModel(TextModel):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this is a Qwen2 model and it has no distinguishing conversion features, make it inherit from Qwen2Model and leave out everything but model_arch.

Copy link
Contributor Author

@pockers21 pockers21 Jun 1, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this is a Qwen2 model and it has no distinguishing conversion features, make it inherit from Qwen2Model and leave out everything but model_arch.

Fixed in latest commit , now inherits from Qwen2Model with only model_arch override.

@@ -58,6 +58,10 @@ class TensorNameMap:
"wpe", # gpt2
),

# eagle2 draft model
MODEL_TENSOR.FC: (
"model.fc",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure about this name (also it appears to be just fc in your GGUF), what is the purpose of this tensor?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure about this name (also it appears to be just fc in your GGUF), what is the purpose of this tensor?

The EAGLE-2 paper doesn't express this mechanism particularly clearly, but you can refer to Figure 6 in the EAGLE-1 paper at https://arxiv.org/html/2401.15077v3.
The FC layer concatenates the current hidden_states with the hidden_states passed from previous inference steps, forming a tensor of 2 * config.hidden_size dimensions. This FC layer then maps the 2 * config.hidden_size dimensional tensor back to a config.hidden_size dimensional tensor, as demonstrated in the code at https://github.com/SafeAILab/EAGLE/blob/main/eagle/model/cnets1.py#L524.

@pockers21 pockers21 force-pushed the feature-convert-eagle2-draft branch from e0714aa to ec22e4b Compare June 1, 2025 10:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
python python script changes
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants