
Research Topics
The workshop discussion kept circling one bottleneck: inference is becoming a memory, scheduling, and reuse problem as much as a modeling problem.
Below is the organized research map. The individual pages go deeper on the highest-leverage ideas. The broader list is kept here so the whole brainstorm remains searchable.
Priority Research Tracks#
| Track | Core Question | Novelty | Complexity | Time Horizon |
|---|---|---|---|---|
| Position-Invariant Document KV Cache | Can we cache document KV independent of prompt position? | High | High | 3-6 months for paper prototype |
| Temporal TurboQuant KV Tiering | Can old tokens be stored at lower precision than recent tokens? | High | Medium-high | 2-4 months |
| Roofline-Adaptive Inference Scheduler | Can the scheduler chase the GPU ridge point in real time? | High | Medium | 2-3 months |
| Speculative Prefill | Can a draft model precompute approximate KV for long prompts? | High | High | 3-5 months |
| Quantization Divergence Hallucination Signal | Can FP8/INT4 vs FP16 logit drift signal uncertainty? | High | Medium | 2-3 months |
| Online EAGLE Draft Learning | Can accepted/rejected draft tokens train the draft head online? | High | Medium | 2-4 months |
| SLO-Aware KV Cache Tiering | Can premium users get HBM while batch jobs spill to DRAM/SSD? | Medium-high | Medium-high | 3-5 months |
| Attention Head Similarity Pruning | Can redundant heads be pruned per input during inference? | Medium | Medium | 1-2 months |
| Unlearning Layer in Attention | Can an attention mask adapter weaken specific associations? | Medium-high | Medium-high | 3-6 months |
| Hardware-Aware Inference CPU Ideas | What software layer is needed if AI CPUs become real? | Medium-high | High | 6-12 months |
Full Idea Inventory#
Workshop-Derived Ideas#
- Temporal / distance-aware dynamic quantization of KV cache - keep recent tokens in high precision, compress old tokens more aggressively.
- TurboQuant for pre-softmax attention scores - test whether rotation-based quantization reduces score-matrix outliers.
- TurboQuant with LoRA fine-tuning - study whether task adapters can compensate for inference-time KV quantization error.
- TurboQuant plus temporal compression - combine recency tiers with rotation-aware quantization.
- Input-adaptive attention head pruning - prune heads that become redundant on a specific prompt.
- Unlearning layer inside MHA - use an inference-time or lightly trained mask to weaken token associations.
- KV sharing via prefix hashing - already partly deployed as prefix caching, but position-awareness still limits reuse.
- KV compression for video models - natural fit because frame time maps to token distance.
- Parallel transformer blocks vs sequential depth - useful but crowded architecture search territory.
Book-Derived Ideas#
- Roofline-aware adaptive batching.
- Cross-layer KV cache aliasing by cosine similarity.
- Speculative prefill with a draft model.
- SLO-differentiated KV cache tiering.
- Attention-sink sliding windows with RoPE re-anchoring.
- Different quantization for prefill and decode GPU pools.
- CUDA graph capture for padding-free packed variable-length batches.
- Subliminal preference transfer auditing for distilled models.
- CPU draft plus GPU verify speculative decoding.
- Position-invariant prefix caching via RoPE-agnostic keys.
- Continuous batching with SLO priority preemption.
- FlashAttention-style tiling for Mamba / SSM selective scan.
- Ridge-chasing speculative decoding window size.
- Multi-turn KV persistence with forgetting curves.
- Quantization error as an uncertainty / hallucination signal.
- MoE-style attention head routing.
- Thermal-budget-aware edge inference.
- Shared document KV cache for RAG.
- Online learning for EAGLE draft heads.
- Disaggregated world models for embodied AI.
Mentor / Hardware Ideas#
The mentor material and Bjarke Roune document push the same theme down to silicon:
- AI CPUs with systolic arrays and large SRAM.
- Compiler backend studios for new accelerators.
- HBM minimization planners.
- SSD-backed long memory.
- DMA compression engines that combine lossy quantization with lossless entropy coding.
- 1:2 and 2:4 sparsity toolkits.
- Memory hierarchy explorers.
- MoE token-router and network-topology co-design.
- Tiled software pipeline libraries.
- Tokens-per-dollar observability.
- AI chip co-design search.
Background Reading#
- TurboQuant: Online Vector Quantization with Near-optimal Distortion Rate
- Efficient Streaming Language Models with Attention Sinks
- FlashAttention
- vLLM automatic prefix caching
- PagedAttention / vLLM paper
- Cache-Craft: Managing Chunk-Caches for Efficient RAG
- RAGCache
- TurboRAG
- Orca: iteration-level scheduling
- EAGLE speculative decoding
- LoRA
- FlexGen
- Mamba