DraftOS

Table of Contents

Pitch
#

GPU instances ship with many CPU cores that often sit underused during LLM serving. DraftOS runs a small draft model on those CPU cores while the GPU verifies previous draft tokens.

Why It Is Clever
#

Speculative decoding usually consumes extra GPU memory for the draft model. DraftOS tries to make the draft path “free” by using CPU resources the customer already rents.

Pipeline
#

sequenceDiagram
  participant CPU as CPU draft model
  participant GPU as GPU target model
  participant Out as Output stream
  CPU->>GPU: draft tokens for step t
  GPU->>GPU: verify tokens from step t-1
  GPU->>Out: accept/reject and emit
  CPU->>CPU: draft next tokens while GPU works

MVP
#

llama.cpp or GGML draft model.
vLLM plugin or proxy.
Adaptive draft length based on GPU verification time.
Acceptance-rate dashboard.

Risks
#

CPU draft may not keep up under realistic traffic.
Cross-device coordination overhead may eat gains.
Best draft models may still need GPU acceleration.

Pitch#

Why It Is Clever#

Pipeline#

MVP#

Risks#

Pitch
#

Why It Is Clever
#

Pipeline
#

MVP
#

Risks
#