Skip to main content
Background Image

DraftOS

Manoj
Author
Manoj
ML Engineer @ 7-Eleven
Table of Contents

Pitch
#

GPU instances ship with many CPU cores that often sit underused during LLM serving. DraftOS runs a small draft model on those CPU cores while the GPU verifies previous draft tokens.

Why It Is Clever
#

Speculative decoding usually consumes extra GPU memory for the draft model. DraftOS tries to make the draft path “free” by using CPU resources the customer already rents.

Pipeline
#

sequenceDiagram
  participant CPU as CPU draft model
  participant GPU as GPU target model
  participant Out as Output stream
  CPU->>GPU: draft tokens for step t
  GPU->>GPU: verify tokens from step t-1
  GPU->>Out: accept/reject and emit
  CPU->>CPU: draft next tokens while GPU works

MVP
#

  • llama.cpp or GGML draft model.
  • vLLM plugin or proxy.
  • Adaptive draft length based on GPU verification time.
  • Acceptance-rate dashboard.

Risks
#

  • CPU draft may not keep up under realistic traffic.
  • Cross-device coordination overhead may eat gains.
  • Best draft models may still need GPU acceleration.