Google speeds up Gemma 4 threefold with multi-token prediction

Learn how to implement multi-token prediction for text generation using Google's Gemma 4 model, demonstrating how generating multiple tokens simultaneously can speed up text generation by up to three times.

Introduction

In this tutorial, you'll learn how to implement multi-token prediction for text generation using Google's Gemma 4 open model. Multi-token prediction is a technique that allows models to generate multiple words or tokens at once, rather than one at a time, significantly speeding up text generation. This approach uses a small auxiliary model to suggest several tokens simultaneously while the main model validates them in a single pass.

By the end of this tutorial, you'll have built a simple text generation system that demonstrates multi-token prediction concepts, helping you understand how Google improved Gemma 4's performance by up to three times.

Prerequisites

Before starting this tutorial, you'll need:

A computer with Python 3.8 or higher installed
Basic understanding of Python programming
Internet connection to download model files
Approximately 2GB of free disk space for model files

Step-by-Step Instructions

1. Set up your Python environment

First, create a new directory for this project and set up a virtual environment to keep dependencies isolated:

mkdir gemma_multitoken_tutorial
 cd gemma_multitoken_tutorial
python3 -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

This creates a clean environment where we can install the required packages without affecting your system Python.

2. Install required packages

Install the necessary libraries for working with machine learning models:

pip install torch transformers accelerate

We're installing PyTorch for deep learning operations, Transformers library for easy model loading, and Accelerate for efficient model management.

3. Create a basic text generation class

Create a file called multitoken_generator.py and add the following code:

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

class MultiTokenGenerator:
    def __init__(self, model_name="google/gemma-4-2b"):
        self.tokenizer = AutoTokenizer.from_pretrained(model_name)
        self.model = AutoModelForCausalLM.from_pretrained(model_name)
        self.model.eval()
        
    def generate_text(self, prompt, max_new_tokens=50, num_beams=1):
        inputs = self.tokenizer(prompt, return_tensors="pt")
        with torch.no_grad():
            outputs = self.model.generate(
                **inputs,
                max_new_tokens=max_new_tokens,
                num_beams=num_beams,
                do_sample=False
            )
        return self.tokenizer.decode(outputs[0], skip_special_tokens=True)

# Test the basic implementation
if __name__ == "__main__":
    generator = MultiTokenGenerator()
    result = generator.generate_text("The future of AI is", max_new_tokens=30)
    print(result)

This sets up a basic text generator that loads the Gemma model and generates text from a given prompt. The generate_text method handles tokenization and model inference.

4. Implement multi-token prediction logic

Now, let's modify the class to implement multi-token prediction:

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

class MultiTokenPredictor:
    def __init__(self, model_name="google/gemma-4-2b"):
        self.tokenizer = AutoTokenizer.from_pretrained(model_name)
        self.model = AutoModelForCausalLM.from_pretrained(model_name)
        self.model.eval()
        
    def multi_token_generate(self, prompt, max_new_tokens=50, num_tokens=5):
        inputs = self.tokenizer(prompt, return_tensors="pt")
        
        # Generate tokens in batches
        with torch.no_grad():
            # First, get the initial tokens
            outputs = self.model.generate(
                **inputs,
                max_new_tokens=num_tokens,
                do_sample=False,
                num_beams=1
            )
            
            # Get the generated tokens
            generated_tokens = outputs[0][len(inputs['input_ids'][0]):]
            
            # Convert back to text
            generated_text = self.tokenizer.decode(generated_tokens, skip_special_tokens=True)
            
            # Continue generation for remaining tokens
            full_prompt = prompt + generated_text
            remaining_inputs = self.tokenizer(full_prompt, return_tensors="pt")
            
            # Generate additional tokens
            final_outputs = self.model.generate(
                **remaining_inputs,
                max_new_tokens=max_new_tokens - num_tokens,
                do_sample=False,
                num_beams=1
            )
            
            # Combine all tokens
            all_generated_tokens = final_outputs[0][len(remaining_inputs['input_ids'][0]):]
            complete_text = self.tokenizer.decode(all_generated_tokens, skip_special_tokens=True)
            
        return prompt + generated_text + complete_text

# Test the multi-token implementation
if __name__ == "__main__":
    predictor = MultiTokenPredictor()
    result = predictor.multi_token_generate("The impact of artificial intelligence on society", max_new_tokens=60, num_tokens=10)
    print(result)

This implementation shows how we can generate multiple tokens at once by first generating a batch of tokens, then continuing with additional generation. This approach reduces the number of model calls needed.

5. Create a demonstration script

Create a file called demo.py to showcase the performance improvement:

from multitoken_generator import MultiTokenPredictor
import time

# Initialize our predictor
predictor = MultiTokenPredictor()

# Test prompt
prompt = "Machine learning is a subset of artificial intelligence that"

print("Testing single token vs multi-token generation:")
print(f"Prompt: {prompt}")
print("\n" + "="*50)

# Single token generation
start_time = time.time()
result1 = predictor.multi_token_generate(prompt, max_new_tokens=30, num_tokens=1)
end_time = time.time()
single_token_time = end_time - start_time

print(f"Single token result: {result1}")
print(f"Time taken: {single_token_time:.2f} seconds")

# Multi-token generation
start_time = time.time()
result2 = predictor.multi_token_generate(prompt, max_new_tokens=30, num_tokens=5)
end_time = time.time()
multi_token_time = end_time - start_time

print(f"\nMulti-token result: {result2}")
print(f"Time taken: {multi_token_time:.2f} seconds")
print(f"\nSpeed improvement: {single_token_time/multi_token_time:.1f}x faster")

This script compares the performance of single-token vs multi-token generation to demonstrate the speed improvement.

6. Run the demonstration

Execute the demonstration script to see the multi-token prediction in action:

python demo.py

You should see output showing both generation methods and the speed improvement achieved through multi-token prediction.

Summary

In this tutorial, you've learned how to implement multi-token prediction for text generation using Google's Gemma 4 model. You've built a system that demonstrates how generating multiple tokens simultaneously can significantly improve performance - up to three times faster than traditional single-token generation.

The key concepts covered include:

Loading and using pre-trained language models with Hugging Face Transformers
Implementing batch token generation
Measuring performance improvements
Understanding how auxiliary models can suggest multiple tokens

This approach mirrors what Google implemented in Gemma 4, where a small auxiliary model suggests several tokens at once while the main model validates them in a single pass. The practical demonstration shows how this technique can be applied to real-world text generation tasks.