Introduction
In this tutorial, you'll learn how to interact with StepAudio 2.5 Realtime, a cutting-edge voice model developed by StepFun. This model is designed for real-time voice interactions and supports both Chinese and English languages. You'll build a simple voice chat application that demonstrates how to connect to the StepAudio API using WebSocket technology.
StepAudio 2.5 Realtime stands out because it supports customizable personas and understands paralinguistic features (like tone and emotion) in speech. By the end of this tutorial, you'll have a working application that can have conversations with the AI using your voice.
Prerequisites
To follow this tutorial, you'll need:
- A basic understanding of Python programming
- Python 3.7 or higher installed on your computer
- Basic knowledge of how WebSocket connections work
- An internet connection
- A microphone and speakers or headphones for audio input/output
Step-by-Step Instructions
1. Set Up Your Python Environment
First, create a new Python project directory and set up a virtual environment to keep your dependencies isolated.
mkdir stepaudio_project
cd stepaudio_project
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
This creates a clean environment for our project. The virtual environment ensures that we don't interfere with other Python projects on your system.
2. Install Required Libraries
We'll need several Python libraries to handle WebSocket connections, audio processing, and speech synthesis:
pip install websockets pyaudio pygame
These libraries provide:
websockets- For connecting to the StepAudio API via WebSocketpyaudio- For capturing audio from your microphonepygame- For playing audio output
3. Create the Main Application Structure
Create a new Python file called stepaudio_client.py and start by importing the necessary modules:
import asyncio
import websockets
import json
import pyaudio
import pygame
import time
This imports the core libraries we'll use for WebSocket communication, audio handling, and timing.
4. Initialize Audio Settings
Before we can capture and play audio, we need to set up the audio parameters:
CHUNK = 1024
FORMAT = pyaudio.paInt16
CHANNELS = 1
RATE = 16000
These settings define how audio will be captured:
CHUNK- The number of audio frames per bufferFORMAT- The audio format (16-bit integers)CHANNELS- Single channel (mono) audioRATE- Sample rate of 16,000 Hz, which is standard for voice models
5. Set Up Audio Capture and Playback
We need to initialize the PyAudio library and create functions to capture and play audio:
def init_audio():
p = pyaudio.PyAudio()
return p
async def record_audio(p, websocket):
stream = p.open(format=FORMAT,
channels=CHANNELS,
rate=RATE,
input=True,
frames_per_buffer=CHUNK)
print("Recording... Speak now.")
try:
while True:
data = stream.read(CHUNK)
await websocket.send(data)
except KeyboardInterrupt:
print("Stopping recording.")
stream.stop_stream()
stream.close()
async def play_audio(audio_data):
pygame.mixer.init()
pygame.mixer.music.load('temp_audio.wav')
pygame.mixer.music.play()
while pygame.mixer.music.get_busy():
time.sleep(0.1)
This code sets up audio recording and playback functions. The recording function sends audio chunks to the WebSocket connection, while the playback function handles audio output.
6. Connect to StepAudio 2.5 Realtime
Now, let's create the main function that connects to the StepAudio API:
async def main():
uri = "wss://api.stepfun.com/stepaudio/v2.5/realtime"
async with websockets.connect(uri) as websocket:
print("Connected to StepAudio 2.5 Realtime")
# Send initialization message
init_message = {
"type": "init",
"language": "zh",
"persona": "assistant"
}
await websocket.send(json.dumps(init_message))
# Initialize audio
p = init_audio()
# Start recording and listening
try:
await asyncio.gather(
record_audio(p, websocket),
listen_for_response(websocket)
)
except KeyboardInterrupt:
print("Exiting...")
finally:
p.terminate()
This connects to the StepAudio API using the WebSocket protocol. The initialization message tells the API what language to use and what persona to adopt (in this case, an assistant).
7. Handle Response from the Model
We need to listen for responses from the StepAudio model:
async def listen_for_response(websocket):
try:
while True:
response = await websocket.recv()
if isinstance(response, bytes):
# Save audio data to file
with open('temp_audio.wav', 'wb') as f:
f.write(response)
# Play the audio
await play_audio(response)
else:
print(f"Received message: {response}")
except websockets.exceptions.ConnectionClosed:
print("Connection closed")
This function listens for audio responses from the model and saves them to a file for playback. It also handles text responses from the model.
8. Run the Application
Add the final execution block to your script:
if __name__ == "__main__":
asyncio.run(main())
This ensures that the main function runs when you execute the script.
Summary
In this tutorial, you've built a basic voice chat application that connects to StepAudio 2.5 Realtime. You learned how to:
- Set up a Python virtual environment
- Install necessary libraries for audio and WebSocket communication
- Initialize audio capture and playback using PyAudio and Pygame
- Connect to the StepAudio WebSocket API
- Send voice input and receive voice responses
This application demonstrates the real-time capabilities of StepAudio 2.5 Realtime, which supports both Chinese and English languages and can be customized with different personas. The model's ability to understand paralinguistic features means it can respond appropriately to tone and emotion in your speech.
While this is a simplified example, it shows the core concepts needed to work with voice models like StepAudio. In practice, you might want to add features like:
- Speech-to-text conversion for text-based interaction
- More sophisticated audio processing
- Integration with other AI services
- Improved error handling and connection management
This foundation gives you the tools to build more complex voice-based applications using StepAudio 2.5 Realtime.



