Tag

#KV cache

3 articles

The KV Cache Compression Race: TurboQuant vs OSCAR vs EpiCache

As KV cache memory outpaces model weights in large language models, three compression techniques—TurboQuant, OSCAR, and EpiCache—are emerging as key contenders. While each offers distinct methods for optimization, they are seen as complementary rather than competitive.

Jun 1850

Together AI Open-Sources OSCAR: An Attention-Aware 2-Bit KV Cache Quantization System for Long-Context LLM Serving

Together AI open-sources OSCAR, an attention-aware 2-bit KV cache quantization system that significantly reduces memory usage and improves decoding speed for long-context LLMs.

May 2551

Researchers from MIT, NVIDIA, and Zhejiang University Propose TriAttention: A KV Cache Compression Method That Matches Full Attention at 2.5× Higher Throughput

Learn how TriAttention, a new AI method, compresses memory in large language models to make them 2.5x faster without losing accuracy.

Apr 1154