Google quietly launched an AI dictation app that works offline

Learn to build an offline speech-to-text application using Google's Gemma AI models with real-time audio capture and local inference capabilities.

Introduction

In this tutorial, you'll learn how to create an offline speech-to-text application using Google's Gemma AI models. This practical guide will walk you through building a dictation app that works without internet connectivity, similar to Google's new offline dictation app. We'll explore the core concepts of local AI inference, model loading, and real-time audio processing using Python and the Hugging Face Transformers library.

Prerequisites

Python 3.8 or higher installed on your system
Basic understanding of Python programming and machine learning concepts
Installed packages: transformers, torch, sounddevice, scipy, numpy
At least 8GB RAM and a modern CPU for efficient local inference
Basic knowledge of audio processing concepts

Step 1: Setting Up Your Development Environment

Install Required Dependencies

First, we need to install all necessary packages for our offline dictation app. The key libraries include transformers for model handling, torch for GPU/CPU operations, and audio processing tools.

pip install transformers torch sounddevice scipy numpy

Why this step? These packages provide the essential building blocks for local AI inference and audio processing. Transformers handles model loading and inference, while sounddevice enables real-time audio capture.

Step 2: Loading the Gemma AI Model

Initialize the Model and Tokenizer

We'll load a Gemma model optimized for speech-to-text tasks. The model needs to be downloaded and cached locally for offline use.

from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor
from transformers import pipeline
import torch

# Load the Gemma model for speech recognition
model_id = "google/gemma-2b-it"
model = AutoModelForSpeechSeq2Seq.from_pretrained(model_id, torch_dtype=torch.bfloat16, low_cpu_mem_usage=True)
processor = AutoProcessor.from_pretrained(model_id)

# Set up the pipeline for speech recognition
pipe = pipeline(
    "automatic-speech-recognition",
    model=model,
    processor=processor,
    device=0 if torch.cuda.is_available() else "cpu"
)

Why this step? This initializes our AI model with the appropriate configuration for offline speech recognition. We're using the bfloat16 data type for efficient memory usage and setting up the pipeline for automatic speech recognition tasks.

Step 3: Creating Audio Capture Functionality

Implement Real-time Audio Recording

Next, we'll create a function to capture audio from your microphone in real-time. This is crucial for building a dictation app that works as you speak.

import sounddevice as sd
import numpy as np
from scipy.io.wavfile import write

# Audio recording parameters
SAMPLE_RATE = 16000
DURATION = 10  # seconds

# Function to record audio
def record_audio(duration=DURATION, sample_rate=SAMPLE_RATE):
    print("Recording... Speak now")
    audio_data = sd.rec(int(duration * sample_rate), samplerate=sample_rate, channels=1)
    sd.wait()  # Wait until recording is finished
    print("Recording complete")
    return audio_data

# Function to save audio to file
def save_audio(data, filename="recording.wav"):
    write(filename, SAMPLE_RATE, data)
    print(f"Audio saved to {filename}")

Why this step? Real-time audio capture is fundamental to dictation apps. We're using sounddevice to capture audio at 16kHz sample rate, which is optimal for speech recognition tasks.

Step 4: Implementing Speech Recognition Pipeline

Process Audio and Generate Text

Now we'll create the core functionality that converts audio to text using our loaded Gemma model.

import os
from transformers import pipeline

# Function to transcribe audio
def transcribe_audio(audio_file):
    try:
        # Load audio file
        result = pipe(audio_file)
        return result["text"]
    except Exception as e:
        print(f"Error during transcription: {e}")
        return None

# Complete transcription function
def dictation_app():
    # Record audio
    audio_data = record_audio()
    
    # Save audio
    save_audio(audio_data)
    
    # Transcribe audio
    text = transcribe_audio("recording.wav")
    
    if text:
        print("\nTranscribed text:")
        print(text)
        return text
    else:
        print("\nTranscription failed")
        return None

Why this step? This pipeline connects our audio capture with the AI model's transcription capabilities. The function handles the complete workflow from recording to text output.

Step 5: Optimizing for Offline Performance

Configure Model for Local Inference

To ensure optimal offline performance, we need to optimize our model loading and inference parameters.

# Optimized model loading for offline use
model_id = "google/gemma-2b-it"

# Configure model for offline inference
model = AutoModelForSpeechSeq2Seq.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,
    low_cpu_mem_usage=True,
    use_safetensors=True
)

# Optimize for inference
model.eval()

# Configure processor
processor = AutoProcessor.from_pretrained(model_id)

# Set up pipeline with optimized parameters
pipe = pipeline(
    "automatic-speech-recognition",
    model=model,
    processor=processor,
    device=0 if torch.cuda.is_available() else "cpu",
    return_timestamps=False
)

Why this step? These optimizations ensure our app runs efficiently offline by using bfloat16 precision, loading from safetensors for faster loading, and setting appropriate inference parameters.

Step 6: Testing Your Dictation App

Run and Validate Your Implementation

Finally, let's test our complete dictation app with a simple execution script.

# Main execution
if __name__ == "__main__":
    print("Starting Offline Dictation App")
    print("================================")
    
    # Run dictation
    transcribed_text = dictation_app()
    
    if transcribed_text:
        print("\nDictation completed successfully!")
        print(f"\nFinal text: {transcribed_text}")
    else:
        print("\nDictation failed. Please try again.")

Why this step? Testing validates our implementation and ensures all components work together properly. This is where we verify that our offline dictation app functions as expected.

Summary

In this tutorial, you've learned how to build an offline speech-to-text application using Google's Gemma AI models. You've covered key concepts including model loading, audio capture, real-time processing, and offline optimization. The application you've built can process speech locally without internet connectivity, similar to Google's new offline dictation app.

This implementation demonstrates the power of local AI inference and shows how modern AI models can be deployed for privacy-focused, offline applications. The skills you've learned can be extended to create more sophisticated voice assistants, transcription services, or accessibility tools.

Key takeaways include understanding how to work with Hugging Face Transformers for offline AI inference, implementing real-time audio capture, and optimizing models for local execution. These techniques form the foundation for building robust, privacy-preserving AI applications.