Skip to main content
Background Image

SpecDraft Cloud

Manoj
Author
Manoj
ML Engineer @ 7-Eleven
Table of Contents

Pitch
#

SpecDraft Cloud wraps an LLM deployment with managed speculative decoding. Its draft heads improve over time using the customer’s own accept/reject stream.

Flywheel
#

More traffic produces more accept/reject labels. More labels improve the draft head. A better draft head increases acceptance rate. Higher acceptance rate reduces decode cost.

flowchart LR
  Traffic --> Labels[Accept/reject labels]
  Labels --> Train[Draft-head tuning]
  Train --> Acceptance[Higher acceptance rate]
  Acceptance --> Savings[Lower latency and cost]
  Savings --> Traffic

Customer
#

API companies and LLM SaaS products with domain-specific traffic patterns.

Risks
#

  • Needs enough traffic to personalize.
  • Customers may not want a managed service in their inference path.
  • Maintaining output equivalence is non-negotiable.