Skip to main content
Background Image

Online EAGLE Draft Learning

Manoj
Author
Manoj
ML Engineer @ 7-Eleven
Table of Contents

Core Idea
#

Speculative decoding produces a clean feedback signal:

  • accepted draft tokens were good,
  • rejected draft tokens were bad.

Instead of discarding that signal, use it to continuously improve a draft head for the actual production query distribution.

Background
#

EAGLE accelerates autoregressive decoding by predicting future features or tokens and letting the target model verify them. The typical training loop is offline. The production accept/reject stream is a natural online dataset.

Training Loop
#

flowchart TD
  Traffic[Production traffic] --> Draft[Draft head proposes tokens]
  Draft --> Verify[Target model verifies]
  Verify --> Accepted[Accepted tokens]
  Verify --> Rejected[Rejected tokens]
  Accepted --> Buffer[Online training buffer]
  Rejected --> Buffer
  Buffer --> Update[Small draft-head update]
  Update --> Draft

Research Questions
#

  • How much online updating is safe before distribution drift hurts?
  • Can updates be batched nightly instead of truly online?
  • Does domain-specific traffic improve acceptance rate enough to matter?
  • Can the system prevent catastrophic degradation with rollback tests?

Metrics
#

  • acceptance rate,
  • tokens per target forward pass,
  • latency improvement,
  • quality equivalence to vanilla decoding,
  • stability over time.

Novelty Opinion
#

High and very clean. The signal is already present, the model being updated is small, and the outcome is easy to measure.

Tenure And Complexity
#

  • Prototype: 4-6 weeks.
  • Paper-grade: 2-4 months.
  • Complexity: Medium.
  • Main risk: production traffic may be too sparse or non-stationary for reliable online updates.