Tag

#multimodal AI

43 articles

Thinking Machines Lab Drops Its First Model

Thinking Machines Lab launches Inkling, a 975-billion-parameter open source model trained to understand video and audio, positioning itself against competitors like Anthropic and OpenAI.

Jul 1520

Thinking Machines Lab Releases Inkling: A 975B-Parameter Open-Weights Multimodal MoE With 41B Active Parameters And Controllable Thinking Effort

Learn how to set up and run inference with the Inkling multimodal AI model from Thinking Machines Lab, including text and image processing with controllable thinking effort.

Jul 154

Building a VideoAgent-Style Multi-Agent System: Intent Parsing, Graph Planning, and Tool Routing for Video Editing Tasks

Researchers have reconstructed the VideoAgent workflow into a functional, API-key-free multi-agent system for AI-powered video editing, enabling natural language interactions and automated video processing.

Jul 1321

Meta Superintelligence Labs Releases Muse Spark 1.1: A Multimodal Reasoning Model for Agentic Tasks on Meta Model API

Meta Superintelligence Labs introduces Muse Spark 1.1, a multimodal reasoning model for agentic tasks, featuring a 1,000,000-token context window and multi-agent delegation capabilities.

Jul 918

Why the next leap in AI video is teaching avatars to see and listen

The next leap in AI video is not just about improving visual fidelity, but teaching avatars to see, hear, and interact in real time. This shift is transforming how we think about digital experiences.

Jul 239

What to expect from WWDC 2026: Siri’s highly anticipated revamp and Apple Intelligence updates

Explains Apple's advanced 'Apple Intelligence' framework, detailing how transformer-based architectures, multimodal processing, and privacy-preserving techniques will revolutionize AI assistants and human-computer interaction.

Jun 662

The latest AI news we announced in May 2026

Google AI announced major advancements in multimodal models, safety measures, and enterprise applications in May 2026. The company's Gemini 2.0 release represents a significant leap in AI capabilities and accessibility.

Jun 548

Google Deepmind's Gemma 4 12B squeezes multimodal AI onto a laptop with just 16 GB of RAM

Google Deepmind's Gemma 4 12B is an open-source multimodal AI model that runs efficiently on laptops with just 16 GB of RAM, nearly matching the performance of its larger 26B counterpart.

Jun 343

Alibaba’s Qwen Team Launches Qwen3.7-Plus, Adding Vision, Deep Reasoning, Tool Invocation, and Autonomous Iteration on the Bailian Platform

Alibaba's Qwen team launches Qwen3.7-Plus, a multimodal AI model on the Bailian platform, featuring vision understanding, deep reasoning, tool invocation, and autonomous iteration.

Jun 247

MiniMax M3: Open-weight model with a million-token context challenges proprietary leaders

Chinese AI company MiniMax has unveiled M3, the first open-weight model combining top-tier coding performance, a one-million-token context window, and native multimodality, challenging proprietary leaders in the AI space.

Jun 162

MiniMax Releases MiniMax M3 with MSA Architecture Supporting 1M-Token Context, Native Multimodality, and Agentic Coding

Learn how MiniMax M3, a new AI model, can process massive amounts of information and handle multiple types of data like text, images, and video.

Jun 162

StepFun Releases Step 3.7 Flash: A 198B MoE Vision-Language Model for Coding Agents and Search Workflows

Learn how to work with vision-language models like Step 3.7 Flash using Hugging Face Transformers, including multimodal input processing and MoE architecture concepts.

May 2937