SLO-Aware KV Cache Tiering

Table of Contents

Core Idea
#

Paged KV cache systems already think in blocks. The next question is where each block should live.

Instead of treating all requests equally:

premium interactive requests keep KV in GPU HBM,
standard requests can spill to CPU DRAM,
batch jobs can spill further to SSD,
the scheduler prefetches blocks before each decode step.

Architecture
#

flowchart TD
  Scheduler[SLO-aware scheduler] --> Classify[Classify request tier]
  Classify --> HBM[HBM: premium hot blocks]
  Classify --> DRAM[CPU DRAM: warm standard blocks]
  Classify --> SSD[SSD: cold batch blocks]
  Scheduler --> Prefetch[Predict next sequences]
  Prefetch --> HBM
  HBM --> Decode[Decode step]
  DRAM --> Prefetch
  SSD --> Prefetch

Background
#

FlexGen and DeepSpeed ZeRO-Inference show the value of offloading. TensorRT-LLM KV cache reuse includes priority-based eviction and KV cache events. The research contribution here is connecting memory placement to explicit SLO contracts.

Research Questions
#

Which requests deserve HBM under contention?
Can prefetch hide DRAM/SSD latency?
Is proactive preemption better than waiting for OOM?
How does the policy affect P99 TTFT for premium users vs throughput for batch users?

Novelty Opinion
#

Medium-high. Pieces exist, but an SLO-native policy with cache placement, prefetching, and admission control would be valuable.

Tenure And Complexity
#

Prototype: 4-8 weeks in a simulator.
vLLM-grade implementation: 3-5 months.
Complexity: Medium-high.
Main risk: migration overhead can erase scheduling gains.

Core Idea#

Architecture#

Background#

Research Questions#

Novelty Opinion#

Tenure And Complexity#