Ask Runable forDesign-Driven General AI AgentTry Runable For Free
Runable
Back to Blog
Technology7 min read

Systematic debugging for AI agents: Introducing the AgentRx framework - Microsoft Research

As AI agents transition from simple chatbots to complex autonomous systems, finding and fixing their errors gets harder. AgentRx is an automated diagnostic f...

TechnologyInnovationBest PracticesGuideTutorial
Systematic debugging for AI agents: Introducing the AgentRx framework - Microsoft Research
Listen to Article
0:00
0:00
0:00

Systematic debugging for AI agents: Introducing the Agent Rx framework - Microsoft Research

Overview

Resources Resources

        Publications
        
    
    
        Code & data
        
    
    
        People
        
    
    
        Microsoft Research blog

Research areas: Intelligence Research areas: Intelligence

        Artificial intelligence
        
    
    
        Audio & acoustics
        
    
    
        Computer vision
        
    
    
        Graphics & multimedia
        
    
    
        Human-computer interaction
        
    
    
        Human language technologies
        
    
    
        Search & information retrieval

Details

Research areas: Systems Research areas: Systems

        Data platforms and analytics
        
    
    
        Hardware & devices
        
    
    
        Programming languages & software engineering
        
    
    
        Quantum computing
        
    
    
        Security, privacy & cryptography
        
    
    
        Systems & networking

Research areas: Theory Research areas: Theory

        Algorithms
        
    
    
        Mathematics

Research areas: Other Sciences Research areas: Other Sciences

        Ecology & environment
        
    
    
        Economics
        
    
    
        Medical, health & genomics
        
    
    
        Social sciences
        
    
    
        Technology for emerging markets

Resources Resources

        Publications
        
    
    
        Code & data
        
    
    
        People
        
    
    
        Microsoft Research blog

Research areas: Intelligence Research areas: Intelligence

        Artificial intelligence
        
    
    
        Audio & acoustics
        
    
    
        Computer vision
        
    
    
        Graphics & multimedia
        
    
    
        Human-computer interaction
        
    
    
        Human language technologies
        
    
    
        Search & information retrieval

Research areas: Systems Research areas: Systems

        Data platforms and analytics
        
    
    
        Hardware & devices
        
    
    
        Programming languages & software engineering
        
    
    
        Quantum computing
        
    
    
        Security, privacy & cryptography
        
    
    
        Systems & networking

Research areas: Theory Research areas: Theory

        Algorithms
        
    
    
        Mathematics

Research areas: Other Sciences Research areas: Other Sciences

        Ecology & environment
        
    
    
        Economics
        
    
    
        Medical, health & genomics
        
    
    
        Social sciences
        
    
    
        Technology for emerging markets

Academic programs

        Events & academic conferences
        
    
    
        Microsoft Research Forum

Behind the Tech podcast

        Microsoft Research blog
        
    
    
        Microsoft Research Forum
        
    
    
        Microsoft Research podcast

Microsoft Research Labs Microsoft Research Labs

        Africa
        
    
    
        AI for Science
        
    
    
        AI Frontiers
        
    
    
        Asia-Pacific
        
    
    
        Cambridge
        
    
    
        Health Futures
        
    
    
        India
        
    
    
        Montreal
        
    
    
        New England
        
    
    
        New York City
        
    
    
        Redmond

Other labs Other labs

        Applied Sciences
        
    
    
        Mixed Reality & AI - Cambridge
        
    
    
        Mixed Reality & AI - Zurich

All Microsoft

        Global
        
    
        Microsoft Security
        
    
    
        Azure
        
    
    
        Dynamics 365
        
    
    
        Microsoft 365
        
    
    
        Microsoft Teams
        
    
    
        Windows 365

Tech & innovation Tech & innovation

        Microsoft AI
        
    
    
        Azure Space
        
    
    
        Mixed reality
        
    
    
        Microsoft Holo Lens
        
    
    
        Microsoft Viva
        
    
    
        Quantum computing
        
    
    
        Sustainability

Industries Industries

        Education
        
    
    
        Automotive
        
    
    
        Financial services
        
    
    
        Government
        
    
    
        Healthcare
        
    
    
        Manufacturing
        
    
    
        Retail

Partners Partners

        Find a partner
        
    
    
        Become a partner
        
    
    
        Partner Network
        
    
    
        Microsoft Marketplace
        
    
    
        Marketplace Rewards
        
    
    
        Software development companies

Resources Resources

        Blog
        
    
    
        Microsoft Advertising
        
    
    
        Developer Center
        
    
    
        Documentation
        
    
    
        Events
        
    
    
        Licensing
        
    
    
        Microsoft Learn
        
    
    
        Microsoft Research

Global

        Microsoft Security
        
    
    
        Azure
        
    
    
        Dynamics 365
        
    
    
        Microsoft 365
        
    
    
        Microsoft Teams
        
    
    
        Windows 365

Tech & innovation Tech & innovation

        Microsoft AI
        
    
    
        Azure Space
        
    
    
        Mixed reality
        
    
    
        Microsoft Holo Lens
        
    
    
        Microsoft Viva
        
    
    
        Quantum computing
        
    
    
        Sustainability

Industries Industries

        Education
        
    
    
        Automotive
        
    
    
        Financial services
        
    
    
        Government
        
    
    
        Healthcare
        
    
    
        Manufacturing
        
    
    
        Retail

Partners Partners

        Find a partner
        
    
    
        Become a partner
        
    
    
        Partner Network
        
    
    
        Microsoft Marketplace
        
    
    
        Marketplace Rewards
        
    
    
        Software development companies

Resources Resources

        Blog
        
    
    
        Microsoft Advertising
        
    
    
        Developer Center
        
    
    
        Documentation
        
    
    
        Events
        
    
    
        Licensing
        
    
    
        Microsoft Learn
        
    
    
        Microsoft Research

Systematic debugging for AI agents: Introducing the Agent Rx framework

By

								Shraddha Barke								
						
					
												,
						
						Senior Researcher

Arnav Goyal

												,
						
						Research Fellow

Alind Khare

												,
						
						Senior Researcher

Chetan Bansal

												,
						
						Senior Principal Research Manager

Problem: Debugging AI agent failures is hard because trajectories are long, stochastic, and often multi-agent, so the true root cause gets buried.

Solution: Agent Rx (opens in new tab) pinpoints the first unrecoverable (“critical failure”) step by synthesizing guarded, executable constraints from tool schemas and domain policies, then logging evidence-backed violations step-by-step.

Benchmark + taxonomy: We release Agent Rx Benchmark (opens in new tab) with 115 manually annotated failed trajectories across τ-bench, Flash, and Magentic-One, plus a grounded nine-category failure taxonomy.

Results + release: Agent Rx improves failure localization (+23.6%) and root-cause attribution (+22.9%) over prompting baselines, and we are open-sourcing the framework and dataset.

As AI agents transition from simple chatbots to autonomous systems capable of managing cloud incidents, navigating complex web interfaces, and executing multi-step API workflows, a new challenge has emerged: transparency.

When a human makes a mistake, we can usually trace the logic. But when an AI agent fails, perhaps by hallucinating a tool output or deviating from a security policy ten steps into a fifty-step task, identifying exactly where and why things went wrong is an arduous, manual process.

Today, we are excited to announce the open-source release of Agent Rx (opens in new tab), an automated, domain-agnostic framework designed to pinpoint the “critical failure step” in agent trajectories. Alongside the framework, we are releasing the Agent Rx Benchmark (opens in new tab), a dataset of 115 manually annotated failed trajectories to help the community build more transparent, resilient agentic systems.

Long-horizon: They perform dozens of actions over extended periods.

Probabilistic: The same input might lead to different outputs, making reproduction difficult.

Multi-agent: Failures can be “passed” between agents, masking the original root cause.

Traditional success metrics (like “Did the task finish?”) don’t tell us enough. To build safe agents, we need to identify the exact moment a trajectory becomes unrecoverable and capture evidence for what went wrong at that step.

Introducing Agent Rx: An automated diagnostic “prescription”

Agent Rx (short for “Agent Diagnosis”) treats agent execution like a system trace that needs validation. Instead of relying on a single LLM to “guess” the error, Agent Rx uses a structured, multi-stage pipeline:

Trajectory normalization: Heterogeneous logs from different domains are converted into a common intermediate representation.

Constraint synthesis: The framework automatically generates executable constraints based on tool schemas (e.g., “The API must return a valid JSON response”) and domain policies (e.g., “Do not delete data without user confirmation”).

Guarded evaluation: Agent Rx evaluates constraints step-by-step, checking each constraint only when its guard condition applies, and produces an auditable validation log of evidence-backed violations.

LLM-based judging: Finally, an LLM judge uses the validation log and a grounded failure taxonomy to identify the Critical Failure Step—the first unrecoverable error.

The Agent Rx workflow: Given a failed trajectory, tool schemas, and domain policy, Agent Rx synthesizes guarded constraints, evaluates them step-by-step to produce an auditable violation log with evidence, and uses an LLM judge to predict the critical failure step and root-cause category.

To evaluate Agent Rx, we developed a manually annotated benchmark consisting of 115 failed trajectories across three complex domains:

τ-bench: Structured API workflows for retail and service tasks.

Flash: Real-world incident management and system troubleshooting.

Magentic-One: Open-ended web and file tasks using a generalist multi-agent system.

Using a grounded-theory approach, we derived a nine-category failure taxonomy that generalizes across these domains. This taxonomy helps developers distinguish between a “Plan Adherence Failure” (where the agent ignored its own steps) and an “Invention of New Information” (hallucination).

Ignored required steps / did extra unplanned actions

Tool call malformed / missing args / schema-invalid

Read tool output incorrectly; acted on wrong assumptions

Could not proceed because required info wasn’t available

Analysis of failure density across domains. In multi-agent systems like Magentic-One, trajectories often contain multiple errors, but Agent Rx focuses on identifying the first critical breach.

In our experiments, Agent Rx demonstrated significant improvements over existing LLM-based prompting baselines:

+23.6% absolute improvement in failure localization accuracy.

By providing the “why” behind a failure through an auditable log, Agent Rx allows developers to move beyond trial-and-error prompting and toward systematic agentic engineering.

We believe that agent reliability is a prerequisite for real-world deployment. To support this, we are open sourcing the Agent Rx framework and the complete annotated benchmark.

Read the Paper: Agent Rx: Diagnosing AI Agent Failures from Execution Trajectories

Explore the Code & Data: https://aka.ms/Agent Rx/Code (opens in new tab)

We invite researchers and developers to use Agent Rx to diagnose their own agentic workflows and contribute to the growing library of failure constraints. Together, we can build AI agents that are not just powerful, but auditable, and reliable.

We would like to thank Avaljot Singh and Suman Nath for contributing to this project.

Agent Rx: Diagnosing AI Agent Failures from Execution Trajectories

Multimodal reinforcement learning with agentic verifier for AI agents

Agent Lightning: Adding reinforcement learning to AI agents without code rewrites

Debug-gym: an environment for AI coding tools to learn how to debug code like programmers

AIOps Lab: Building AI agents for autonomous clouds

Key Takeaways

  • Resources Resources

          Publications
          
      
      
          Code & data
          
      
      
          People
          
      
      
          Microsoft Research blog
    
  • Research areas: Intelligence Research areas: Intelligence

          Artificial intelligence
          
      
      
          Audio & acoustics
          
      
      
          Computer vision
          
      
      
          Graphics & multimedia
          
      
      
          Human-computer interaction
          
      
      
          Human language technologies
          
      
      
          Search & information retrieval
    
  • Research areas: Systems Research areas: Systems

          Data platforms and analytics
          
      
      
          Hardware & devices
          
      
      
          Programming languages & software engineering
          
      
      
          Quantum computing
          
      
      
          Security, privacy & cryptography
          
      
      
          Systems & networking
    
  • Research areas: Theory Research areas: Theory

          Algorithms
          
      
      
          Mathematics
    
  • Research areas: Other Sciences Research areas: Other Sciences

          Ecology & environment
          
      
      
          Economics
          
      
      
          Medical, health & genomics
          
      
      
          Social sciences
          
      
      
          Technology for emerging markets
    

Cut Costs with Runable

Cost savings are based on average monthly price per user for each app.

Which apps do you use?

Apps to replace

ChatGPTChatGPT
$20 / month
LovableLovable
$25 / month
Gamma AIGamma AI
$25 / month
HiggsFieldHiggsField
$49 / month
Leonardo AILeonardo AI
$12 / month
TOTAL$131 / month

Runable price = $9 / month

Saves $122 / month

Runable can save upto $1464 per year compared to the non-enterprise price of your apps.