Research Topics

The workshop discussion kept circling one bottleneck: inference is becoming a memory, scheduling, and reuse problem as much as a modeling problem.

Below is the organized research map. The individual pages go deeper on the highest-leverage ideas. The broader list is kept here so the whole brainstorm remains searchable.

Priority Research Tracks
#

Track	Core Question	Novelty	Complexity	Time Horizon
Position-Invariant Document KV Cache	Can we cache document KV independent of prompt position?	High	High	3-6 months for paper prototype
Temporal TurboQuant KV Tiering	Can old tokens be stored at lower precision than recent tokens?	High	Medium-high	2-4 months
Roofline-Adaptive Inference Scheduler	Can the scheduler chase the GPU ridge point in real time?	High	Medium	2-3 months
Speculative Prefill	Can a draft model precompute approximate KV for long prompts?	High	High	3-5 months
Quantization Divergence Hallucination Signal	Can FP8/INT4 vs FP16 logit drift signal uncertainty?	High	Medium	2-3 months
Online EAGLE Draft Learning	Can accepted/rejected draft tokens train the draft head online?	High	Medium	2-4 months
SLO-Aware KV Cache Tiering	Can premium users get HBM while batch jobs spill to DRAM/SSD?	Medium-high	Medium-high	3-5 months
Attention Head Similarity Pruning	Can redundant heads be pruned per input during inference?	Medium	Medium	1-2 months
Unlearning Layer in Attention	Can an attention mask adapter weaken specific associations?	Medium-high	Medium-high	3-6 months
Hardware-Aware Inference CPU Ideas	What software layer is needed if AI CPUs become real?	Medium-high	High	6-12 months

Full Idea Inventory
#

Workshop-Derived Ideas
#

Temporal / distance-aware dynamic quantization of KV cache - keep recent tokens in high precision, compress old tokens more aggressively.
TurboQuant for pre-softmax attention scores - test whether rotation-based quantization reduces score-matrix outliers.
TurboQuant with LoRA fine-tuning - study whether task adapters can compensate for inference-time KV quantization error.
TurboQuant plus temporal compression - combine recency tiers with rotation-aware quantization.
Input-adaptive attention head pruning - prune heads that become redundant on a specific prompt.
Unlearning layer inside MHA - use an inference-time or lightly trained mask to weaken token associations.
KV sharing via prefix hashing - already partly deployed as prefix caching, but position-awareness still limits reuse.
KV compression for video models - natural fit because frame time maps to token distance.
Parallel transformer blocks vs sequential depth - useful but crowded architecture search territory.

Book-Derived Ideas
#

Roofline-aware adaptive batching.
Cross-layer KV cache aliasing by cosine similarity.
Speculative prefill with a draft model.
SLO-differentiated KV cache tiering.
Attention-sink sliding windows with RoPE re-anchoring.
Different quantization for prefill and decode GPU pools.
CUDA graph capture for padding-free packed variable-length batches.
Subliminal preference transfer auditing for distilled models.
CPU draft plus GPU verify speculative decoding.
Position-invariant prefix caching via RoPE-agnostic keys.
Continuous batching with SLO priority preemption.
FlashAttention-style tiling for Mamba / SSM selective scan.
Ridge-chasing speculative decoding window size.
Multi-turn KV persistence with forgetting curves.
Quantization error as an uncertainty / hallucination signal.
MoE-style attention head routing.
Thermal-budget-aware edge inference.
Shared document KV cache for RAG.
Online learning for EAGLE draft heads.
Disaggregated world models for embodied AI.

Mentor / Hardware Ideas
#

The mentor material and Bjarke Roune document push the same theme down to silicon:

AI CPUs with systolic arrays and large SRAM.
Compiler backend studios for new accelerators.
HBM minimization planners.
SSD-backed long memory.
DMA compression engines that combine lossy quantization with lossless entropy coding.
1:2 and 2:4 sparsity toolkits.
Memory hierarchy explorers.
MoE token-router and network-topology co-design.
Tiled software pipeline libraries.
Tokens-per-dollar observability.
AI chip co-design search.

Research Topics

Priority Research Tracks
#

Full Idea Inventory
#

Workshop-Derived Ideas
#

Book-Derived Ideas
#

Mentor / Hardware Ideas
#

Background Reading
#

2026

Priority Research Tracks#

Full Idea Inventory#

Workshop-Derived Ideas#

Book-Derived Ideas#

Mentor / Hardware Ideas#

Background Reading#

2026

Priority Research Tracks
#

Full Idea Inventory
#

Workshop-Derived Ideas
#

Book-Derived Ideas
#

Mentor / Hardware Ideas
#

Background Reading
#