Tag

#AI inference

5 articles

Google Introduces TurboQuant: A New Compression Algorithm that Reduces LLM Key-Value Cache Memory by 6x and Delivers Up to 8x Speedup, All with Zero Accuracy Loss

Google introduces TurboQuant, a new compression algorithm that reduces LLM key-value cache memory by 6x and delivers up to 8x speedup without accuracy loss.

Mar 249

tech

NeuReality taps former Google AI director to steer its inference operating system into the market

Israeli AI startup NeuReality has appointed former Google AI director Shalini Agarwal to guide its NR-NEXUS inference operating system into the market.

Mar 249

Paged Attention in Large Language Models LLMs

Paged Attention emerges as a key solution to the GPU memory bottleneck in large language models, enabling more efficient memory usage and higher concurrency in AI inference systems.

Mar 247

Arm’s first CPU ever will plug into Meta’s AI datacenters later this year

Learn what AI inference chips are, how they work, and why they're crucial for making AI systems faster and more efficient. This explainer explains the basics of inference chips using simple analogies.

Mar 247

Startup Gimlet Labs is solving the AI inference bottleneck in a surprisingly elegant way

Gimlet Labs raises $80 million Series A to solve AI inference bottlenecks across multiple chip architectures including NVIDIA, AMD, and Intel.

Mar 2315