Tag

#quantization

7 articles

Google DeepMind Releases Gemma 4 QAT Checkpoints: Q4_0 and a New Mobile Format Cut On-Device Memory

This article explains how Google DeepMind's Gemma 4 QAT checkpoints, particularly the Q4_0 and mobile formats, optimize large language models for edge deployment by reducing memory usage and computational requirements through advanced quantization techniques.

Jun 526

Together AI Open-Sources OSCAR: An Attention-Aware 2-Bit KV Cache Quantization System for Long-Context LLM Serving

Together AI open-sources OSCAR, an attention-aware 2-bit KV cache quantization system that significantly reduces memory usage and improves decoding speed for long-context LLMs.

May 2547

A Coding Implementation to Compress and Benchmark Instruction-Tuned LLMs with FP8, GPTQ, and SmoothQuant Quantization using llmcompressor

Learn to compress instruction-tuned language models using FP8, GPTQ, and SmoothQuant quantization techniques with llmcompressor, and benchmark their performance.

May 1742

A Coding Implementation on Microsoft’s Phi-4-Mini for Quantized Inference Reasoning Tool Use RAG and LoRA Fine-Tuning

This article explains how Microsoft's Phi-4-Mini AI model uses quantization, RAG, and LoRA techniques to create efficient, powerful language models that can answer questions and use tools.

Apr 2053

A End-to-End Coding Guide to Running OpenAI GPT-OSS Open-Weight Models with Advanced Inference Workflows

This article explains open-weight language models, how they work, and why they matter for making AI more accessible to everyone.

Apr 1766

What Google's TurboQuant can and can't do for AI's spiraling cost

This explainer explores Google's TurboQuant technology, a real-time quantization approach that reduces AI computational costs and enables local deployment of large models.

Mar 30105

Multiverse Computing pushes its compressed AI models into the mainstream

Learn about model compression techniques that reduce the size and computational requirements of large AI models while maintaining performance, enabling broader AI deployment.

Mar 1895