Core Idea#
Paged KV cache systems already think in blocks. The next question is where each block should live.
Instead of treating all requests equally:
- premium interactive requests keep KV in GPU HBM,
- standard requests can spill to CPU DRAM,
- batch jobs can spill further to SSD,
- the scheduler prefetches blocks before each decode step.
Architecture#
flowchart TD Scheduler[SLO-aware scheduler] --> Classify[Classify request tier] Classify --> HBM[HBM: premium hot blocks] Classify --> DRAM[CPU DRAM: warm standard blocks] Classify --> SSD[SSD: cold batch blocks] Scheduler --> Prefetch[Predict next sequences] Prefetch --> HBM HBM --> Decode[Decode step] DRAM --> Prefetch SSD --> Prefetch
Background#
FlexGen and DeepSpeed ZeRO-Inference show the value of offloading. TensorRT-LLM KV cache reuse includes priority-based eviction and KV cache events. The research contribution here is connecting memory placement to explicit SLO contracts.
Research Questions#
- Which requests deserve HBM under contention?
- Can prefetch hide DRAM/SSD latency?
- Is proactive preemption better than waiting for OOM?
- How does the policy affect P99 TTFT for premium users vs throughput for batch users?
Novelty Opinion#
Medium-high. Pieces exist, but an SLO-native policy with cache placement, prefetching, and admission control would be valuable.
Tenure And Complexity#
- Prototype: 4-8 weeks in a simulator.
- vLLM-grade implementation: 3-5 months.
- Complexity: Medium-high.
- Main risk: migration overhead can erase scheduling gains.

