Tag
2 articles
Together AI open-sources OSCAR, an attention-aware 2-bit KV cache quantization system that significantly reduces memory usage and improves decoding speed for long-context LLMs.
Learn how TriAttention, a new AI method, compresses memory in large language models to make them 2.5x faster without losing accuracy.