<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>Inference-Engineering on jonam'Log</title><link>https://www.jonam.io/categories/inference-engineering/</link><description>Recent content in Inference-Engineering on jonam'Log</description><generator>Hugo -- gohugo.io</generator><language>en</language><copyright>&amp;copy; 2026 Manoj. All Rights Reserved.</copyright><lastBuildDate>Mon, 18 May 2026 00:00:00 +0000</lastBuildDate><atom:link href="https://www.jonam.io/categories/inference-engineering/index.xml" rel="self" type="application/rss+xml"/><item><title>DocVault</title><link>https://www.jonam.io/journal/inference-engineering/product-ideas/docvault/</link><pubDate>Mon, 18 May 2026 00:00:00 +0000</pubDate><guid>https://www.jonam.io/journal/inference-engineering/product-ideas/docvault/</guid><description>Compute any document&amp;rsquo;s context once, serve it to every user forever.</description></item><item><title>Position-Invariant Document KV Cache</title><link>https://www.jonam.io/journal/inference-engineering/research-topics/position-invariant-document-kv-cache/</link><pubDate>Mon, 18 May 2026 00:00:00 +0000</pubDate><guid>https://www.jonam.io/journal/inference-engineering/research-topics/position-invariant-document-kv-cache/</guid><description>Can document KV states be cached independent of prompt position and reused across RAG queries?</description></item><item><title>PrefillX</title><link>https://www.jonam.io/journal/inference-engineering/product-ideas/prefillx/</link><pubDate>Mon, 18 May 2026 00:00:00 +0000</pubDate><guid>https://www.jonam.io/journal/inference-engineering/product-ideas/prefillx/</guid><description>Cut TTFT for long-context document applications by precomputing and repairing reusable KV states.</description></item><item><title>Temporal TurboQuant KV Tiering</title><link>https://www.jonam.io/journal/inference-engineering/research-topics/temporal-turboquant-kv-tiering/</link><pubDate>Mon, 18 May 2026 00:00:00 +0000</pubDate><guid>https://www.jonam.io/journal/inference-engineering/research-topics/temporal-turboquant-kv-tiering/</guid><description>Recent tokens stay high precision, older tokens degrade to INT4 or INT2, and TurboQuant makes the low-bit tiers less painful.</description></item><item><title>InferGrid</title><link>https://www.jonam.io/journal/inference-engineering/product-ideas/infergrid/</link><pubDate>Mon, 18 May 2026 00:00:00 +0000</pubDate><guid>https://www.jonam.io/journal/inference-engineering/product-ideas/infergrid/</guid><description>Measure why your GPU bill is high, then tune batching, speculation, and quantization automatically.</description></item><item><title>Roofline-Adaptive Inference Scheduler</title><link>https://www.jonam.io/journal/inference-engineering/research-topics/roofline-adaptive-inference-scheduler/</link><pubDate>Mon, 18 May 2026 00:00:00 +0000</pubDate><guid>https://www.jonam.io/journal/inference-engineering/research-topics/roofline-adaptive-inference-scheduler/</guid><description>Move from static max_num_seqs to a feedback loop that chases the hardware ridge point.</description></item><item><title>DraftOS</title><link>https://www.jonam.io/journal/inference-engineering/product-ideas/draftos/</link><pubDate>Mon, 18 May 2026 00:00:00 +0000</pubDate><guid>https://www.jonam.io/journal/inference-engineering/product-ideas/draftos/</guid><description>Use idle CPU cores on GPU instances to draft tokens while the GPU verifies.</description></item><item><title>Speculative Prefill</title><link>https://www.jonam.io/journal/inference-engineering/research-topics/speculative-prefill/</link><pubDate>Mon, 18 May 2026 00:00:00 +0000</pubDate><guid>https://www.jonam.io/journal/inference-engineering/research-topics/speculative-prefill/</guid><description>Speculative decoding is common; this asks whether speculation can reduce long-prompt prefill latency.</description></item><item><title>Quantization Divergence As Hallucination Signal</title><link>https://www.jonam.io/journal/inference-engineering/research-topics/quantization-divergence-hallucination-signal/</link><pubDate>Mon, 18 May 2026 00:00:00 +0000</pubDate><guid>https://www.jonam.io/journal/inference-engineering/research-topics/quantization-divergence-hallucination-signal/</guid><description>If FP8/INT4 and FP16 disagree sharply, the model may be in a fragile region.</description></item><item><title>SLOGuard</title><link>https://www.jonam.io/journal/inference-engineering/product-ideas/sloguard/</link><pubDate>Mon, 18 May 2026 00:00:00 +0000</pubDate><guid>https://www.jonam.io/journal/inference-engineering/product-ideas/sloguard/</guid><description>Protect enterprise P99 latency without buying more GPUs.</description></item><item><title>HaloscoreAI</title><link>https://www.jonam.io/journal/inference-engineering/product-ideas/haloscoreai/</link><pubDate>Mon, 18 May 2026 00:00:00 +0000</pubDate><guid>https://www.jonam.io/journal/inference-engineering/product-ideas/haloscoreai/</guid><description>A low-latency uncertainty signal for regulated AI applications.</description></item><item><title>Online EAGLE Draft Learning</title><link>https://www.jonam.io/journal/inference-engineering/research-topics/online-eagle-draft-learning/</link><pubDate>Mon, 18 May 2026 00:00:00 +0000</pubDate><guid>https://www.jonam.io/journal/inference-engineering/research-topics/online-eagle-draft-learning/</guid><description>Speculative decoding throws away a useful supervision signal: which draft tokens were accepted.</description></item><item><title>DistillAudit</title><link>https://www.jonam.io/journal/inference-engineering/product-ideas/distillaudit/</link><pubDate>Mon, 18 May 2026 00:00:00 +0000</pubDate><guid>https://www.jonam.io/journal/inference-engineering/product-ideas/distillaudit/</guid><description>Detect hidden preference transfer from teacher models to students.</description></item><item><title>SLO-Aware KV Cache Tiering</title><link>https://www.jonam.io/journal/inference-engineering/research-topics/slo-aware-kv-cache-tiering/</link><pubDate>Mon, 18 May 2026 00:00:00 +0000</pubDate><guid>https://www.jonam.io/journal/inference-engineering/research-topics/slo-aware-kv-cache-tiering/</guid><description>Premium users get hot KV blocks; batch users spill to cheaper memory tiers.</description></item><item><title>Attention Head Similarity Pruning</title><link>https://www.jonam.io/journal/inference-engineering/research-topics/attention-head-similarity-pruning/</link><pubDate>Mon, 18 May 2026 00:00:00 +0000</pubDate><guid>https://www.jonam.io/journal/inference-engineering/research-topics/attention-head-similarity-pruning/</guid><description>Measure cross-head similarity on a prompt and skip heads that are redundant for that input.</description></item><item><title>ConvoCache</title><link>https://www.jonam.io/journal/inference-engineering/product-ideas/convocache/</link><pubDate>Mon, 18 May 2026 00:00:00 +0000</pubDate><guid>https://www.jonam.io/journal/inference-engineering/product-ideas/convocache/</guid><description>Store and rehydrate the conversation state that actually mattered.</description></item><item><title>SpecDraft Cloud</title><link>https://www.jonam.io/journal/inference-engineering/product-ideas/specdraft-cloud/</link><pubDate>Mon, 18 May 2026 00:00:00 +0000</pubDate><guid>https://www.jonam.io/journal/inference-engineering/product-ideas/specdraft-cloud/</guid><description>A draft model service that learns from accepted and rejected tokens.</description></item><item><title>Unlearning Layer In Attention</title><link>https://www.jonam.io/journal/inference-engineering/research-topics/unlearning-layer-in-attention/</link><pubDate>Mon, 18 May 2026 00:00:00 +0000</pubDate><guid>https://www.jonam.io/journal/inference-engineering/research-topics/unlearning-layer-in-attention/</guid><description>Can we attenuate undesirable token associations inside attention without full retraining?</description></item><item><title>Hardware-Aware AI CPU Ideas</title><link>https://www.jonam.io/journal/inference-engineering/research-topics/hardware-aware-inference-cpu-ideas/</link><pubDate>Mon, 18 May 2026 00:00:00 +0000</pubDate><guid>https://www.jonam.io/journal/inference-engineering/research-topics/hardware-aware-inference-cpu-ideas/</guid><description>The software layer that becomes valuable if inference hardware shifts from GPU-first to AI CPU and custom accelerator designs.</description></item><item><title>NeuralEdge</title><link>https://www.jonam.io/journal/inference-engineering/product-ideas/neuraledge/</link><pubDate>Mon, 18 May 2026 00:00:00 +0000</pubDate><guid>https://www.jonam.io/journal/inference-engineering/product-ideas/neuraledge/</guid><description>Schedule inference around thermal limits and split reflexes on-device from planning in the cloud.</description></item><item><title>Research Topics</title><link>https://www.jonam.io/journal/inference-engineering/research-topics/</link><pubDate>Mon, 18 May 2026 00:00:00 +0000</pubDate><guid>https://www.jonam.io/journal/inference-engineering/research-topics/</guid><description>Novel and practical research directions around KV cache compression, scheduling, speculation, quantization, and hardware-aware serving.</description></item><item><title>Inference Engineering</title><link>https://www.jonam.io/journal/inference-engineering/</link><pubDate>Mon, 18 May 2026 00:00:00 +0000</pubDate><guid>https://www.jonam.io/journal/inference-engineering/</guid><description>A living notebook for inference engineering research topics and product ideas.</description></item><item><title>Product Ideas</title><link>https://www.jonam.io/journal/inference-engineering/product-ideas/</link><pubDate>Mon, 18 May 2026 00:00:00 +0000</pubDate><guid>https://www.jonam.io/journal/inference-engineering/product-ideas/</guid><description>Ten product directions built from KV cache reuse, roofline scheduling, speculative decoding, and inference observability.</description></item></channel></rss>