Alibaba Qwen Team Introduces Qwen3.5-LiveTranslate-Flash: Real-Time Multimodal Interpretation Across 60 Languages at 2.8-Second Latency

Learn how to interact with Alibaba's Qwen3.5-LiveTranslate-Flash real-time multimodal translation model using WebSocket API. Build a Python application that sends audio input and receives translated speech output.

Introduction

In this tutorial, you'll learn how to interact with Alibaba's Qwen3.5-LiveTranslate-Flash model using the WebSocket API provided through Alibaba Cloud Model Studio. This real-time multimodal translation model supports 60 input languages and 29 output languages with a latency of just 2.8 seconds. You'll build a simple Python application that connects to the WebSocket endpoint, sends audio/video input, and receives translated speech output.

Prerequisites

Basic understanding of Python programming
Python 3.7 or higher installed
Alibaba Cloud account with access to Model Studio
WebSocket client library (websockets)
Audio file in WAV format for testing

Step-by-Step Instructions

1. Setting Up Your Environment

1.1 Install Required Libraries

First, install the necessary Python packages:

pip install websockets

This library provides WebSocket client functionality needed to connect to Alibaba's API.

1.2 Obtain API Credentials

Before proceeding, you'll need to access Alibaba Cloud Model Studio and obtain your API credentials. Navigate to the Qwen3.5-LiveTranslate-Flash model page and generate an API key. Store this key securely, as you'll need it in your code.

2. Creating the WebSocket Client

2.1 Initialize the WebSocket Connection

Create a Python script named live_translate_client.py and start by setting up the basic connection:

import asyncio
import websockets
import json
import base64
import os

# Replace with your actual API key
API_KEY = os.getenv('ALIBABA_API_KEY')

async def connect_to_model():
    uri = "wss://qwen3-5-livetranslate-flash.cn-hangzhou.aliyuncs.com/ws"
    
    headers = {
        "Authorization": f"Bearer {API_KEY}",
        "Content-Type": "application/json"
    }
    
    async with websockets.connect(uri, extra_headers=headers) as websocket:
        print("Connected to Qwen3.5-LiveTranslate-Flash")
        # Further implementation will go here
        
asyncio.run(connect_to_model())

This establishes the connection to the model's WebSocket endpoint using your API key for authentication.

2.2 Prepare Audio Input

Next, we'll prepare an audio file for transmission:

def prepare_audio_data(file_path):
    with open(file_path, "rb") as audio_file:
        audio_data = audio_file.read()
        # Encode audio to base64 for transmission
        encoded_audio = base64.b64encode(audio_data).decode('utf-8')
    return encoded_audio

# Usage example
audio_file_path = "test_audio.wav"
encoded_audio = prepare_audio_data(audio_file_path)
print(f"Audio prepared with {len(encoded_audio)} characters")

Audio files must be encoded in base64 for transmission over WebSocket, as binary data cannot be directly sent through the protocol.

3. Sending Translation Requests

3.1 Define Translation Parameters

Specify the input and output languages, along with other parameters:

def create_translation_request(input_language, output_language, audio_data):
    request = {
        "input": {
            "language": input_language,
            "audio": audio_data,
            "type": "audio"
        },
        "output": {
            "language": output_language,
            "type": "speech"
        },
        "config": {
            "enable_voice_cloning": True,
            "enable_vision_comprehension": True,
            "dynamic_keywords": []
        }
    }
    return request

This configuration enables real-time speaker voice cloning and vision-enhanced comprehension, which are key features of the model.

3.2 Send the Request

Modify your main function to send the translation request:

async def send_translation_request(websocket, request):
    await websocket.send(json.dumps(request))
    print("Translation request sent")
    
    # Receive the response
    response = await websocket.recv()
    print("Received response:")
    print(response)
    
    # Parse the response
    response_data = json.loads(response)
    return response_data

The response will contain the translated audio output that you can save or play back.

4. Handling the Response

4.1 Extract and Save Translated Audio

Process the received audio data and save it to a file:

def save_translated_audio(response_data, output_file_path):
    if 'output' in response_data and 'audio' in response_data['output']:
        # Decode base64 audio data
        audio_data = base64.b64decode(response_data['output']['audio'])
        
        # Save to file
        with open(output_file_path, 'wb') as audio_file:
            audio_file.write(audio_data)
        print(f"Translated audio saved to {output_file_path}")
    else:
        print("No audio data in response")

This step extracts the translated audio from the model's response and saves it for playback.

4.2 Complete Integration

Now, integrate all components into a complete working script:

async def main():
    # Connect to the model
    uri = "wss://qwen3-5-livetranslate-flash.cn-hangzhou.aliyuncs.com/ws"
    headers = {
        "Authorization": f"Bearer {API_KEY}",
        "Content-Type": "application/json"
    }
    
    async with websockets.connect(uri, extra_headers=headers) as websocket:
        print("Connected to Qwen3.5-LiveTranslate-Flash")
        
        # Prepare audio input
        encoded_audio = prepare_audio_data(audio_file_path)
        
        # Create translation request
        request = create_translation_request(
            input_language="zh-CN",
            output_language="en-US",
            audio_data=encoded_audio
        )
        
        # Send request and receive response
        response_data = await send_translation_request(websocket, request)
        
        # Save translated audio
        save_translated_audio(response_data, "translated_output.wav")

# Run the main function
asyncio.run(main())

This complete script demonstrates the full workflow of connecting to the model, sending audio input, and receiving translated output.

5. Testing and Optimization

5.1 Test with Different Languages

Try different language combinations to see how the model performs:

test_cases = [
    ("zh-CN", "en-US"),  # Chinese to English
    ("en-US", "es-ES"),  # English to Spanish
    ("ja-JP", "ko-KR"),  # Japanese to Korean
]

for input_lang, output_lang in test_cases:
    request = create_translation_request(
        input_language=input_lang,
        output_language=output_lang,
        audio_data=encoded_audio
    )
    response_data = await send_translation_request(websocket, request)
    save_translated_audio(response_data, f"translated_{input_lang}_to_{output_lang}.wav")

Testing with various language pairs will help you understand the model's capabilities and performance.

5.2 Adjust Configuration Parameters

Experiment with different configuration options:

config = {
    "enable_voice_cloning": True,
    "enable_vision_comprehension": True,
    "dynamic_keywords": ["business", "meeting", "presentation"],
    "speed": "normal",
    "pitch": "medium"
}

These parameters allow you to customize the translation experience for specific domains or requirements.

Summary

In this tutorial, you've learned how to interact with Alibaba's Qwen3.5-LiveTranslate-Flash model using a WebSocket API. You've built a Python application that connects to the model, sends audio input in 60 languages, and receives translated speech output in 29 languages with low latency. The tutorial covered establishing WebSocket connections, preparing audio data for transmission, sending translation requests with appropriate configurations, and handling model responses. You've also learned how to test with different language combinations and customize translation parameters for domain-specific use cases.

This implementation provides a foundation for building more complex real-time translation applications that leverage Alibaba's multimodal capabilities, including speaker voice cloning and vision-enhanced comprehension features.