Tag

#benchmark

5 articles

How easily can Russian propaganda fool AI models? A new benchmark finds out

A new benchmark from the Institute of the Estonian Language evaluates how susceptible AI models are to Russian propaganda, raising important questions about AI resilience and misinformation.

Jun 1641

Microsoft's MAI-Image-2.5 pulls even with Google's Nano Banana 2 on benchmarks

Microsoft's MAI-Image-2.5 ties with Google's Nano Banana 2 on Arena's leaderboard, showing significant improvements over its predecessor.

May 2746

Even the best AI models lose about half their performance when charts get complicated, new benchmark finds

This article explains how AI systems struggle with converting complex charts into code, even the best models lose nearly half their performance on complicated visualizations.

Apr 1866

Anthropic releases Claude Opus 4.7 with benchmark-leading coding and agentic performance

Anthropic has released Claude Opus 4.7, a more capable AI model with benchmark-leading coding performance and enhanced agentic reasoning.

Apr 1677

tech

Google AI Releases Android Bench: An Evaluation Framework and Leaderboard for LLMs in Android Development

Learn how to set up and use Google's Android Bench framework to evaluate LLMs on Android development tasks, including running benchmarks and interpreting results.

Mar 6151