Temporal TurboQuant KV Tiering

Table of Contents

Core Idea
#

Not every token in the KV cache deserves the same number of bits.

Recent tokens often matter more for local coherence. Very old tokens may still matter, but many of them contribute weakly or episodically. The workshop idea was to make KV precision a function of token age:

last N tokens: FP16 or FP8,
middle tokens: INT4,
old tokens: INT2,
optional sink tokens: protected at higher precision.

TurboQuant adds a second ingredient: rotate the vectors before quantization so low-bit representations preserve inner products better.

Architecture
#

flowchart TD
  KV[KV cache blocks] --> Age[Age / recency classifier]
  Age --> Recent[Recent tier: FP16 or FP8]
  Age --> Warm[Warm tier: TurboQuant INT4]
  Age --> Cold[Cold tier: TurboQuant INT2]
  Recent --> Attention[Attention kernel]
  Warm --> Dequant[Mixed precision dequant]
  Cold --> Dequant
  Dequant --> Attention

Background
#

TurboQuant reports strong KV cache quantization quality using rotation and quantized Johnson-Lindenstrauss correction. StreamingLLM shows that attention sinks and recent-window retention are powerful for streaming settings. The gap is the combination: tiered temporal precision plus rotation-aware low-bit storage.

Research Questions
#

Are fixed recency boundaries enough, or should boundaries be content-aware?
Does TurboQuant preserve attention score quality better for old tokens than uniform INT4?
Should keys and values have different tier policies?
Can the kernel avoid losing all gains to mixed-precision dequant overhead?

Experiment Plan
#

Start with a toy PyTorch attention implementation before touching vLLM:

Implement full-precision baseline.
Add temporal tiers without TurboQuant.
Add TurboQuant per tier.
Add attention-sink protection.
Evaluate on LongBench, RULER, needle-in-a-haystack, and summarization.

The clean ablation table:

Method	Memory	TTFT	ITL	Quality
FP16 KV	baseline	baseline	baseline	baseline
Uniform INT4	lower	faster	faster	possible loss
Temporal tiers	lower	faster	faster	unknown
TurboQuant only	lower	faster	faster	expected strong
Temporal TurboQuant	lowest	target	target	main claim

Novelty Opinion
#

Medium-high to high. Temporal retention exists. KV quantization exists. The combined system, especially with a practical mixed-tier attention kernel, is a real systems contribution.

Tenure And Complexity
#

Prototype: 2-4 weeks.
Kernel-level version: 2-4 months.
Complexity: Medium-high.
Main risk: the attention kernel becomes slower because it must handle too many precision formats.

Core Idea#

Architecture#

Background#

Research Questions#

Experiment Plan#

Novelty Opinion#

Tenure And Complexity#