Tag

#AI evaluation

12 articles

The Eval Stack: Proving the agents are right instead of claiming It

Sixtyfour's innovative 'Eval Stack' approach challenges traditional AI development by rigorously evaluating agents against expert benchmarks, ensuring accuracy over assumption.

Jul 1014

Arena, the AI leaderboard everyone uses, is now a $100M business

AI leaderboard platform Arena has reached a $100 million valuation after just a year, transitioning from a free tool to a commercial service in September.

Jun 2946

In the Weights is your new AI-centric vanity search

Learn about the In the Weights score, a novel AI evaluation metric that analyzes neural network parameters to predict model performance and optimize training.

Jun 2040

research

OpenAI Releases LifeSciBench, a 750-Task Benchmark Grading AI Models on Real Life-Science Research With Expert-Written Rubric

LifeSciBench is a comprehensive benchmark developed by OpenAI that evaluates AI models on real-life science research tasks, focusing on reasoning and decision-making rather than simple recall. It uses expert-authored rubrics to assess how well AI systems can handle complex scientific workflows.

Jun 1749

A shared playbook for trustworthy third party evaluations

OpenAI shares a comprehensive playbook to guide third-party evaluations of advanced AI systems, focusing on capabilities, safeguards, and validity.

May 2959

Anthropic's new benchmark claims Claude can match human experts in bioinformatics

This explainer explores Anthropic's BioMysteryBench, a new AI evaluation framework designed to test large language models in bioinformatics. It examines how the benchmark works, why it matters for AI development, and what it reveals about AI capabilities in specialized scientific domains.

Apr 3058

Galtea raises $3.2M to help enterprises test AI agents

Galtea raises $3.2M to help enterprises test AI agents, addressing the gap between demo and production performance.

Mar 25119

Grok 4.20 trails Gemini and GPT-5.4 by a wide margin but sets a new record for not hallucinating

This article explains the trade-offs in AI language model performance, focusing on how models like Grok 4.20 reduce hallucinations but lag behind top-tier models in benchmarks.

Mar 12100

Half of AI-written code that passes industry test would get rejected by real developers, new study finds

A new study by METR reveals that nearly half of AI-generated code that passes industry benchmarks would be rejected by real developers due to quality and maintainability issues.

Mar 11103

A new benchmark pits five AI models against each other as autonomous social media agents on X

AI benchmarking startup Arcada Labs is testing five leading AI models as autonomous agents on X, evaluating their real-world social media capabilities.

Feb 2897

Why we no longer evaluate SWE-bench Verified

OpenAI announces it will no longer evaluate SWE-bench Verified due to contamination and data leakage issues. The organization recommends SWE-bench Pro as a replacement.

Feb 23149

A Coding Guide to Instrumenting, Tracing, and Evaluating LLM Applications Using TruLens and OpenAI Models

A new tutorial from MarkTechPost demonstrates how to use TruLens and OpenAI models to build transparent and measurable evaluation pipelines for LLM applications.

Feb 23120