Awesome-LLM-KV-Cache: A curated list of 📙Awesome LLM KV Cache Papers with Codes.
-
Updated
Jun 17, 2026
Awesome-LLM-KV-Cache: A curated list of 📙Awesome LLM KV Cache Papers with Codes.
Self-hosted AI agent OS. Your memory, chat, agents, and files stay on hardware you own, offline by default, cloud by choice. Offline AI memory (taOSmd), self-hosted multi-framework group chat, a full web desktop + app store, and auto-clustering across the consumer hardware you already have (Orange/Raspberry Pi, Mac mini, gaming PC).
[ICLR 2025] Palu: Compressing KV-Cache with Low-Rank Projection
Empirical study of KV-cache quantization in self-forcing video generation
W4A4 and INT8 KV-cache quantization for Infinity VAR models. Optimized for high-fidelity generative AI deployment on edge GPUs (e.g. NVIDIA Jetson).
Measure MLX quantization quality loss — KL divergence, perplexity, top-token agreement for KV cache and weights
Evaluation harness and norm-direction method for KV cache compression. Cross-model worst-case quality metrics.
Reproduction of TurboQuant.
MLX-native port of KVarN — variance-normalized KV-cache quantization for Apple Silicon. 3.3× compression at 71% FP16 speed, matches FP16 on GSM8K within ~2%.
Deploy Nemotron 3 Nano 30B with 1M context window on NVIDIA DGX Spark using llama.cpp (Blackwell sm_121, Q4_0 KV cache quantization)
Neutral tensor-level KV-cache quantization benchmark: 11 methods as first-class peers, per-metric leaderboards, matched-budget Pareto, multi-seed CI, and per-method fidelity tags.
Add a description, image, and links to the kv-cache-quantization topic page so that developers can more easily learn about it.
To associate your repository with the kv-cache-quantization topic, visit your repo's landing page and select "manage topics."