Skip to main content
Background Image

SLO-Aware KV Cache Tiering

Manoj
Author
Manoj
ML Engineer @ 7-Eleven
Table of Contents

Core Idea
#

Paged KV cache systems already think in blocks. The next question is where each block should live.

Instead of treating all requests equally:

  • premium interactive requests keep KV in GPU HBM,
  • standard requests can spill to CPU DRAM,
  • batch jobs can spill further to SSD,
  • the scheduler prefetches blocks before each decode step.

Architecture
#

flowchart TD
  Scheduler[SLO-aware scheduler] --> Classify[Classify request tier]
  Classify --> HBM[HBM: premium hot blocks]
  Classify --> DRAM[CPU DRAM: warm standard blocks]
  Classify --> SSD[SSD: cold batch blocks]
  Scheduler --> Prefetch[Predict next sequences]
  Prefetch --> HBM
  HBM --> Decode[Decode step]
  DRAM --> Prefetch
  SSD --> Prefetch

Background
#

FlexGen and DeepSpeed ZeRO-Inference show the value of offloading. TensorRT-LLM KV cache reuse includes priority-based eviction and KV cache events. The research contribution here is connecting memory placement to explicit SLO contracts.

Research Questions
#

  • Which requests deserve HBM under contention?
  • Can prefetch hide DRAM/SSD latency?
  • Is proactive preemption better than waiting for OOM?
  • How does the policy affect P99 TTFT for premium users vs throughput for batch users?

Novelty Opinion
#

Medium-high. Pieces exist, but an SLO-native policy with cache placement, prefetching, and admission control would be valuable.

Tenure And Complexity
#

  • Prototype: 4-8 weeks in a simulator.
  • vLLM-grade implementation: 3-5 months.
  • Complexity: Medium-high.
  • Main risk: migration overhead can erase scheduling gains.