Tag

#benchmarking

13 articles

UK's AI Security Institute finds standard benchmarks systematically underestimate what AI agents can actually do

Learn how to simulate and evaluate AI agent performance under varying compute budgets to better assess true capabilities, inspired by findings from the UK's AI Security Institute.

Jul 319

Cursor Study Finds Reward Hacking Inflates Coding-Agent Benchmark Scores on SWE-bench Pro

Learn to build a code agent evaluation system that detects reward hacking in benchmarking, where agents retrieve known fixes instead of deriving solutions.

Jun 2618

AI search agents often confirm what they already know instead of actually researching the web

Learn to build a time-based benchmarking tool to evaluate whether AI search agents actually research the web or just confirm pre-trained knowledge.

May 3059

Best AI Agents for Software Development Ranked: A Benchmark-Driven Look at the Current Field

Learn how to set up a benchmarking framework to evaluate AI coding agents like Claude Code and GPT-5.5, similar to industry benchmarks used in 2026.

May 1445

Meta AI Releases NeuralBench: A Unified Open-Source Framework to Benchmark NeuroAI Models Across 36 EEG Tasks and 94 Datasets

Meta AI has launched NeuralBench, a unified open-source framework for benchmarking NeuroAI models using the largest EEG benchmark to date, covering 36 tasks and 94 datasets.

May 951

Top 7 Benchmarks That Actually Matter for Agentic Reasoning in Large Language Models

Learn how to set up and run a basic benchmark test for agentic reasoning using Python and Hugging Face Transformers. This tutorial teaches you to evaluate AI agents' ability to handle real-world tasks.

Apr 2558

Agent skills look great in benchmarks but fall apart under realistic conditions, researchers find

Learn how to build AI agents with modular skills using LangChain and OpenAI, and understand why these agents often fail in realistic conditions despite strong benchmark performance.

Apr 1293

An Implementation Guide to Running NVIDIA Transformer Engine with Mixed Precision, FP8 Checks, Benchmarking, and Fallback Execution

This article explains how to implement NVIDIA's Transformer Engine with mixed-precision, FP8 support, benchmarking, and fallback execution for optimizing transformer model performance.

Apr 672

AI benchmarks systematically ignore how humans disagree, Google study finds

This article explains how human disagreement in AI benchmarking can lead to unreliable performance metrics and why current practices need to evolve to account for annotation variability.

Apr 4110

tech

Nvidia sets new MLPerf records with 288 GPUs while AMD and Intel focus on different battles

Nvidia sets new MLPerf records with 288 GPUs while AMD and Intel pursue different strategic paths in AI hardware competition.

Apr 2124

AI models confidently describe images they never saw, and benchmarks fail to catch it

AI models like GPT-5 and Gemini 3 Pro can confidently describe images they've never seen, and current benchmarks fail to detect this issue. A Stanford study highlights the dangers of AI hallucinations and calls for new evaluation methods.

Mar 3090

Grok 4.20 trails Gemini and GPT-5.4 by a wide margin but sets a new record for not hallucinating

This article explains the trade-offs in AI language model performance, focusing on how models like Grok 4.20 reduce hallucinations but lag behind top-tier models in benchmarks.

Mar 1291