Together AI Open-Sources OSCAR: An Attention-Aware 2-Bit KV Cache Quantization System for Long-Context LLM Serving

Together AI open-sources OSCAR, an attention-aware 2-bit KV cache quantization system that significantly reduces memory usage and improves decoding speed for long-context LLMs.

Together AI has made a significant stride in optimizing large language models (LLMs) for long-context tasks by open-sourcing OSCAR, a novel 2-bit key-value (KV) cache quantization system. This innovation targets one of the most pressing challenges in LLM deployment: managing memory and computational efficiency when handling extended context lengths. Unlike traditional methods that rely on data-oblivious transformations such as Hadamard matrices, OSCAR introduces an attention-aware approach that computes distinct rotations for keys and values based on offline covariance estimation.

Technical Breakthrough and Performance Gains

OSCAR operates at just 2.28 bits per KV element, a remarkably low precision that still maintains high accuracy. According to benchmarks, the system narrows the accuracy gap with full BF16 precision to only 3.78 points on Qwen3-4B-Thinking-2507 and 1.42 points on Qwen3-8B. These results highlight its effectiveness in preserving model performance even under extreme quantization. Additionally, OSCAR achieves approximately an 8x reduction in KV memory usage and up to a 3x speedup in decoding, particularly at 100K context lengths—a crucial advantage for real-world deployment in long-context LLM serving.

Implications for the LLM Ecosystem

The introduction of OSCAR is a major development in the ongoing effort to make LLMs more scalable and accessible. As models grow larger and context windows expand, the computational and memory demands increase exponentially. By enabling efficient KV cache quantization without sacrificing much accuracy, OSCAR could significantly lower the barrier for deploying long-context models in production environments. This advancement also underscores the importance of attention-aware optimization techniques, which may inspire further innovations in quantization and model compression strategies.

With this release, Together AI continues to push the boundaries of efficient AI inference, offering developers and researchers a powerful new tool to tackle the challenges of long-context LLM serving.

Together AI Open-Sources OSCAR: An Attention-Aware 2-Bit KV Cache Quantization System for Long-Context LLM Serving

Technical Breakthrough and Performance Gains

Implications for the LLM Ecosystem

Related Articles

Character.AI wants a piece of the microdrama pie

Say hello to Claude Wrapped

Meta says its new AI model is ready to compete on coding