-
Notifications
You must be signed in to change notification settings - Fork 12k
convert: add eagle2 draft arch #13908
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
convert_hf_to_gguf.py
Outdated
@@ -2712,6 +2712,25 @@ def modify_tensors(self, data_torch: Tensor, name: str, bid: int | None) -> Iter | |||
yield from super().modify_tensors(data_torch, name, bid) | |||
|
|||
|
|||
@ModelBase.register("Eagle2DraftForCausalLM") | |||
class Eagle2DraftModel(TextModel): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since this is a Qwen2 model and it has no distinguishing conversion features, make it inherit from Qwen2Model
and leave out everything but model_arch
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since this is a Qwen2 model and it has no distinguishing conversion features, make it inherit from
Qwen2Model
and leave out everything butmodel_arch
.
Fixed in latest commit , now inherits from Qwen2Model with only model_arch override.
@@ -58,6 +58,10 @@ class TensorNameMap: | |||
"wpe", # gpt2 | |||
), | |||
|
|||
# eagle2 draft model | |||
MODEL_TENSOR.FC: ( | |||
"model.fc", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure about this name (also it appears to be just fc
in your GGUF), what is the purpose of this tensor?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure about this name (also it appears to be just
fc
in your GGUF), what is the purpose of this tensor?
The EAGLE-2 paper doesn't express this mechanism particularly clearly, but you can refer to Figure 6 in the EAGLE-1 paper at https://arxiv.org/html/2401.15077v3.
The FC layer concatenates the current hidden_states with the hidden_states passed from previous inference steps, forming a tensor of 2 * config.hidden_size dimensions. This FC layer then maps the 2 * config.hidden_size dimensional tensor back to a config.hidden_size dimensional tensor, as demonstrated in the code at https://github.com/SafeAILab/EAGLE/blob/main/eagle/model/cnets1.py#L524.
e0714aa
to
ec22e4b
Compare
EAGLE-2 Speculative Decoding Support for llama.cpp - Phase 1 Submission
Overview
This PR introduces EAGLE-2 (Extrapolation Algorithm for Greater Language-model Efficiency) speculative decoding support for llama.cpp. EAGLE-2 is an advanced speculative decoding technique that uses a smaller draft model to accelerate inference of larger target models.
Paper: EAGLE-2: Faster Inference of Language Models with Dynamic Draft Trees
Official Repository: https://github.com/SafeAILab/EAGLE
Implementation Status
✅ EAGLE-2 is fully implemented and functional - complete with model conversion, loading, inference, and speculative decoding algorithm.
For maintainability and review efficiency, we’re submitting the implementation in multiple focused phases rather than as one large PR.
Phase 1: Model Conversion Infrastructure
This submission focuses on the model conversion component - enabling EAGLE-2 draft models to be converted from safetensors to GGUF format.
Why Phased Submission?
We've chosen a phased submission approach for the following reasons:
Submission Roadmap
All phases are already implemented and tested - this roadmap represents submission phases only.
Current Submission Features
Enhanced Model Conversion
convert_hf_to_gguf.py
Model Size Optimization
EAGLE-2 draft models offer significant size advantages:
Draft models achieve extreme compression as they extract and optimize single decoder layers from the original model.
Demonstrated Performance
The complete EAGLE-2 implementation has been tested on NVIDIA RTX 4080 with few data come from ShareGPT dataset:
Qwen2-7B-Instruct + EAGLE-Qwen2-1.4B Draft Model
Performance Notes:
Technical Implementation
Architecture Support
convert_hf_to_gguf.py
with EAGLE-2 model detectionMemory Efficiency
Testing Resources
Pre-converted Models
For immediate testing and validation:
EAGLE-Qwen2 Draft Model (F16):
Testing the Conversion
# Convert EAGLE-2 draft model to GGUF python convert_hf_to_gguf.py /path/to/eagle-draft-model --outfile eagle-draft.gguf
Current Implementation Characteristics
Operational Parameters
Architecture Design
Future Submissions
The remaining implementation components will be submitted in subsequent PRs:
Acknowledgments
This Phase 1 submission provides the model conversion foundation for EAGLE-2 speculative decoding in llama.cpp. The complete implementation demonstrates significant inference acceleration while maintaining compatibility with the llama.cpp ecosystem.