Tag

#AI reliability

12 articles

OpenAI finds roughly 30 percent of popular AI coding test is broken

OpenAI has found that around 30% of tasks in the popular SWE-Bench Pro benchmark are broken, leading the company to withdraw its endorsement of the test.

Jul 928

Separating signal from noise in coding evaluations

OpenAI's analysis reveals significant methodological flaws in SWE-Bench Pro, a popular coding benchmark, raising concerns about the reliability of AI model evaluations.

Jul 835

tech

More people get news from AI chatbots, but trust remains low

A new Reuters Institute report shows that 10% of people worldwide now use AI chatbots for news weekly, but only 4% regularly verify sources, highlighting a trust gap in AI-generated content.

Jun 1948

Probably raises $9M to build a more reliable kind of AI

AI startup Probably raises $9M to build more reliable AI systems that prevent hallucinations and factual errors, aiming for accuracy comparable to deterministic systems.

Jun 1637

KPMG pulls report on AI usage due to apparent hallucinations

Learn what AI hallucination means, how it happens, and why it matters for users and professionals. This beginner-friendly explainer covers the key concept behind recent AI reliability concerns.

Jun 1344

Claude’s new model is more ‘honest’ when it messes up

Anthropic has released Claude Opus 4.8, an updated AI model focused on improved honesty and accuracy. The new version aims to reduce AI hallucinations by better acknowledging uncertainty and avoiding unsupported claims.

May 2856

Anthropic’s Claude Opus 4.8 is its most honest AI model yet, and Mythos is coming in weeks

Anthropic's Claude Opus 4.8 is its most honest and reliable AI model yet, with enhanced self-correction and agentic performance. The company also announced that its next AI system, Mythos, will launch in weeks.

May 2848

From LLMs to hallucinations, here’s a simple guide to common AI terms

This article explains the technical mechanisms behind hallucinations in large language models, why they occur, and their implications for AI reliability and trustworthiness.

Apr 1282

AI analytics agents need guardrails, not more model size

AI analytics agents are delivering wrong answers due to lack of governance, not because models are too small. Organizations must implement better oversight to ensure accuracy.

Mar 19124

When language models hallucinate, they leave "spilled energy" in their own math

Researchers at Sapienza University of Rome have found that hallucinations in large language models leave measurable traces in their computations, offering a new method for detecting false outputs.

Mar 798

Which Agent Causes Task Failures and When?Researchers from PSU and Duke explores automated failure attribution of LLM Multi-Agent Systems

Researchers from PSU and Duke University develop a framework to automatically identify which agent in an LLM multi-agent system causes task failures and when the failure occurs.

Feb 26119

Retraction: After a routine code rejection, an AI agent published a hit piece on someone by name

This article explains the concept of AI content generation and the critical challenges of accountability and accuracy when AI systems publish information about real people.

Feb 25135