Systematic debugging for AI agents: Introducing the Agent Rx framework - Microsoft Research
Overview
Resources Resources
Publications
Code & data
People
Microsoft Research blog
Research areas: Intelligence Research areas: Intelligence
Artificial intelligence
Audio & acoustics
Computer vision
Graphics & multimedia
Human-computer interaction
Human language technologies
Search & information retrieval
Details
Research areas: Systems Research areas: Systems
Data platforms and analytics
Hardware & devices
Programming languages & software engineering
Quantum computing
Security, privacy & cryptography
Systems & networking
Research areas: Theory Research areas: Theory
Algorithms
Mathematics
Research areas: Other Sciences Research areas: Other Sciences
Ecology & environment
Economics
Medical, health & genomics
Social sciences
Technology for emerging markets
Resources Resources
Publications
Code & data
People
Microsoft Research blog
Research areas: Intelligence Research areas: Intelligence
Artificial intelligence
Audio & acoustics
Computer vision
Graphics & multimedia
Human-computer interaction
Human language technologies
Search & information retrieval
Research areas: Systems Research areas: Systems
Data platforms and analytics
Hardware & devices
Programming languages & software engineering
Quantum computing
Security, privacy & cryptography
Systems & networking
Research areas: Theory Research areas: Theory
Algorithms
Mathematics
Research areas: Other Sciences Research areas: Other Sciences
Ecology & environment
Economics
Medical, health & genomics
Social sciences
Technology for emerging markets
Academic programs
Events & academic conferences
Microsoft Research Forum
Behind the Tech podcast
Microsoft Research blog
Microsoft Research Forum
Microsoft Research podcast
Microsoft Research Labs Microsoft Research Labs
Africa
AI for Science
AI Frontiers
Asia-Pacific
Cambridge
Health Futures
India
Montreal
New England
New York City
Redmond
Other labs Other labs
Applied Sciences
Mixed Reality & AI - Cambridge
Mixed Reality & AI - Zurich
All Microsoft
Global
Microsoft Security
Azure
Dynamics 365
Microsoft 365
Microsoft Teams
Windows 365
Tech & innovation Tech & innovation
Microsoft AI
Azure Space
Mixed reality
Microsoft Holo Lens
Microsoft Viva
Quantum computing
Sustainability
Industries Industries
Education
Automotive
Financial services
Government
Healthcare
Manufacturing
Retail
Partners Partners
Find a partner
Become a partner
Partner Network
Microsoft Marketplace
Marketplace Rewards
Software development companies
Resources Resources
Blog
Microsoft Advertising
Developer Center
Documentation
Events
Licensing
Microsoft Learn
Microsoft Research
Global
Microsoft Security
Azure
Dynamics 365
Microsoft 365
Microsoft Teams
Windows 365
Tech & innovation Tech & innovation
Microsoft AI
Azure Space
Mixed reality
Microsoft Holo Lens
Microsoft Viva
Quantum computing
Sustainability
Industries Industries
Education
Automotive
Financial services
Government
Healthcare
Manufacturing
Retail
Partners Partners
Find a partner
Become a partner
Partner Network
Microsoft Marketplace
Marketplace Rewards
Software development companies
Resources Resources
Blog
Microsoft Advertising
Developer Center
Documentation
Events
Licensing
Microsoft Learn
Microsoft Research
Systematic debugging for AI agents: Introducing the Agent Rx framework
By
Shraddha Barke
,
Senior Researcher
Arnav Goyal
,
Research Fellow
Alind Khare
,
Senior Researcher
Chetan Bansal
,
Senior Principal Research Manager
Problem: Debugging AI agent failures is hard because trajectories are long, stochastic, and often multi-agent, so the true root cause gets buried.
Solution: Agent Rx (opens in new tab) pinpoints the first unrecoverable (“critical failure”) step by synthesizing guarded, executable constraints from tool schemas and domain policies, then logging evidence-backed violations step-by-step.
Benchmark + taxonomy: We release Agent Rx Benchmark (opens in new tab) with 115 manually annotated failed trajectories across τ-bench, Flash, and Magentic-One, plus a grounded nine-category failure taxonomy.
Results + release: Agent Rx improves failure localization (+23.6%) and root-cause attribution (+22.9%) over prompting baselines, and we are open-sourcing the framework and dataset.
As AI agents transition from simple chatbots to autonomous systems capable of managing cloud incidents, navigating complex web interfaces, and executing multi-step API workflows, a new challenge has emerged: transparency.
When a human makes a mistake, we can usually trace the logic. But when an AI agent fails, perhaps by hallucinating a tool output or deviating from a security policy ten steps into a fifty-step task, identifying exactly where and why things went wrong is an arduous, manual process.
Today, we are excited to announce the open-source release of Agent Rx (opens in new tab), an automated, domain-agnostic framework designed to pinpoint the “critical failure step” in agent trajectories. Alongside the framework, we are releasing the Agent Rx Benchmark (opens in new tab), a dataset of 115 manually annotated failed trajectories to help the community build more transparent, resilient agentic systems.
Long-horizon: They perform dozens of actions over extended periods.
Probabilistic: The same input might lead to different outputs, making reproduction difficult.
Multi-agent: Failures can be “passed” between agents, masking the original root cause.
Traditional success metrics (like “Did the task finish?”) don’t tell us enough. To build safe agents, we need to identify the exact moment a trajectory becomes unrecoverable and capture evidence for what went wrong at that step.
Introducing Agent Rx: An automated diagnostic “prescription”
Agent Rx (short for “Agent Diagnosis”) treats agent execution like a system trace that needs validation. Instead of relying on a single LLM to “guess” the error, Agent Rx uses a structured, multi-stage pipeline:
Trajectory normalization: Heterogeneous logs from different domains are converted into a common intermediate representation.
Constraint synthesis: The framework automatically generates executable constraints based on tool schemas (e.g., “The API must return a valid JSON response”) and domain policies (e.g., “Do not delete data without user confirmation”).
Guarded evaluation: Agent Rx evaluates constraints step-by-step, checking each constraint only when its guard condition applies, and produces an auditable validation log of evidence-backed violations.
LLM-based judging: Finally, an LLM judge uses the validation log and a grounded failure taxonomy to identify the Critical Failure Step—the first unrecoverable error.
The Agent Rx workflow: Given a failed trajectory, tool schemas, and domain policy, Agent Rx synthesizes guarded constraints, evaluates them step-by-step to produce an auditable violation log with evidence, and uses an LLM judge to predict the critical failure step and root-cause category.
To evaluate Agent Rx, we developed a manually annotated benchmark consisting of 115 failed trajectories across three complex domains:
τ-bench: Structured API workflows for retail and service tasks.
Flash: Real-world incident management and system troubleshooting.
Magentic-One: Open-ended web and file tasks using a generalist multi-agent system.
Using a grounded-theory approach, we derived a nine-category failure taxonomy that generalizes across these domains. This taxonomy helps developers distinguish between a “Plan Adherence Failure” (where the agent ignored its own steps) and an “Invention of New Information” (hallucination).
Ignored required steps / did extra unplanned actions
Tool call malformed / missing args / schema-invalid
Read tool output incorrectly; acted on wrong assumptions
Could not proceed because required info wasn’t available
Analysis of failure density across domains. In multi-agent systems like Magentic-One, trajectories often contain multiple errors, but Agent Rx focuses on identifying the first critical breach.
In our experiments, Agent Rx demonstrated significant improvements over existing LLM-based prompting baselines:
+23.6% absolute improvement in failure localization accuracy.
By providing the “why” behind a failure through an auditable log, Agent Rx allows developers to move beyond trial-and-error prompting and toward systematic agentic engineering.
We believe that agent reliability is a prerequisite for real-world deployment. To support this, we are open sourcing the Agent Rx framework and the complete annotated benchmark.
Read the Paper: Agent Rx: Diagnosing AI Agent Failures from Execution Trajectories
Explore the Code & Data: https://aka.ms/Agent Rx/Code (opens in new tab)
We invite researchers and developers to use Agent Rx to diagnose their own agentic workflows and contribute to the growing library of failure constraints. Together, we can build AI agents that are not just powerful, but auditable, and reliable.
We would like to thank Avaljot Singh and Suman Nath for contributing to this project.
Agent Rx: Diagnosing AI Agent Failures from Execution Trajectories
Multimodal reinforcement learning with agentic verifier for AI agents
Agent Lightning: Adding reinforcement learning to AI agents without code rewrites
Debug-gym: an environment for AI coding tools to learn how to debug code like programmers
AIOps Lab: Building AI agents for autonomous clouds
Key Takeaways
-
Resources Resources
Publications Code & data People Microsoft Research blog -
Research areas: Intelligence Research areas: Intelligence
Artificial intelligence Audio & acoustics Computer vision Graphics & multimedia Human-computer interaction Human language technologies Search & information retrieval -
Research areas: Systems Research areas: Systems
Data platforms and analytics Hardware & devices Programming languages & software engineering Quantum computing Security, privacy & cryptography Systems & networking -
Research areas: Theory Research areas: Theory
Algorithms Mathematics -
Research areas: Other Sciences Research areas: Other Sciences
Ecology & environment Economics Medical, health & genomics Social sciences Technology for emerging markets



