Tag
5 articles
Google introduces TurboQuant, a new compression algorithm that reduces LLM key-value cache memory by 6x and delivers up to 8x speedup without accuracy loss.
Israeli AI startup NeuReality has appointed former Google AI director Shalini Agarwal to guide its NR-NEXUS inference operating system into the market.
Paged Attention emerges as a key solution to the GPU memory bottleneck in large language models, enabling more efficient memory usage and higher concurrency in AI inference systems.
Learn what AI inference chips are, how they work, and why they're crucial for making AI systems faster and more efficient. This explainer explains the basics of inference chips using simple analogies.
Gimlet Labs raises $80 million Series A to solve AI inference bottlenecks across multiple chip architectures including NVIDIA, AMD, and Intel.