Tag
35 articles
Alibaba's Qwen team launches Qwen3.7-Plus, a multimodal AI model on the Bailian platform, featuring vision understanding, deep reasoning, tool invocation, and autonomous iteration.
Chinese AI company MiniMax has unveiled M3, the first open-weight model combining top-tier coding performance, a one-million-token context window, and native multimodality, challenging proprietary leaders in the AI space.
Learn how MiniMax M3, a new AI model, can process massive amounts of information and handle multiple types of data like text, images, and video.
Learn how to work with vision-language models like Step 3.7 Flash using Hugging Face Transformers, including multimodal input processing and MoE architecture concepts.
Google showcases Gemini Omni and Gemini 3.5's advanced multimodal capabilities through 9 compelling demonstrations. These AI models demonstrate unprecedented versatility in processing multiple data types and complex reasoning tasks.
This explainer examines Google's Gemini Omni AI system, which combines advanced video generation, identity preservation, and natural language processing to create photorealistic video content from text prompts. We explore the technical architecture, key innovations, and broader implications of this multimodal AI platform.
Researchers have developed a complete multimodal RLVR pipeline using the TuringEnterprises/Open-MM-RL dataset, integrating vision-language prompting, reward scoring, and GRPO export capabilities.
Microsoft's Fara1.5 is a new family of browser computer-use agents that can navigate and interact with web interfaces to perform complex tasks. This advancement showcases the growing capabilities of multimodal AI systems in real-world, interactive environments.
Learn what a universal AI interface is and how it could revolutionize how we interact with technology by understanding multiple types of information at once.
ByteDance's Intelligent Creation Lab has released Lance, an open-source unified multimodal model capable of image and video understanding, generation, and editing in a single framework using just 3 billion parameters.
This explainer explores the advanced AI technologies behind YouTube Shorts Remix, including multimodal modeling, video understanding, and generative synthesis techniques.
Google has launched Gemini Omni Flash, a multimodal video-generation model with avatar mode and default SynthID watermarking. Speech-editing features are being held back for further development.