Asgard Bench: A benchmark for visually grounded interactive planning - Microsoft Research

Overview

Resources Resources

        Publications
        
    
    
        Code & data
        
    
    
        People
        
    
    
        Microsoft Research blog

Research areas: Intelligence Research areas: Intelligence

        Artificial intelligence
        
    
    
        Audio & acoustics
        
    
    
        Computer vision
        
    
    
        Graphics & multimedia
        
    
    
        Human-computer interaction
        
    
    
        Human language technologies
        
    
    
        Search & information retrieval

Details

Research areas: Systems Research areas: Systems

        Data platforms and analytics
        
    
    
        Hardware & devices
        
    
    
        Programming languages & software engineering
        
    
    
        Quantum computing
        
    
    
        Security, privacy & cryptography
        
    
    
        Systems & networking

Research areas: Theory Research areas: Theory

        Algorithms
        
    
    
        Mathematics

Research areas: Other Sciences Research areas: Other Sciences

        Ecology & environment
        
    
    
        Economics
        
    
    
        Medical, health & genomics
        
    
    
        Social sciences
        
    
    
        Technology for emerging markets

Resources Resources

        Publications
        
    
    
        Code & data
        
    
    
        People
        
    
    
        Microsoft Research blog

Research areas: Intelligence Research areas: Intelligence

        Artificial intelligence
        
    
    
        Audio & acoustics
        
    
    
        Computer vision
        
    
    
        Graphics & multimedia
        
    
    
        Human-computer interaction
        
    
    
        Human language technologies
        
    
    
        Search & information retrieval

Research areas: Systems Research areas: Systems

        Data platforms and analytics
        
    
    
        Hardware & devices
        
    
    
        Programming languages & software engineering
        
    
    
        Quantum computing
        
    
    
        Security, privacy & cryptography
        
    
    
        Systems & networking

Research areas: Theory Research areas: Theory

        Algorithms
        
    
    
        Mathematics

Research areas: Other Sciences Research areas: Other Sciences

        Ecology & environment
        
    
    
        Economics
        
    
    
        Medical, health & genomics
        
    
    
        Social sciences
        
    
    
        Technology for emerging markets

Academic programs

        Events & academic conferences
        
    
    
        Microsoft Research Forum

Behind the Tech podcast

        Microsoft Research blog
        
    
    
        Microsoft Research Forum
        
    
    
        Microsoft Research podcast

Microsoft Research Labs Microsoft Research Labs

        Africa
        
    
    
        AI for Science
        
    
    
        AI Frontiers
        
    
    
        Asia-Pacific
        
    
    
        Cambridge
        
    
    
        Health Futures
        
    
    
        India
        
    
    
        Montreal
        
    
    
        New England
        
    
    
        New York City
        
    
    
        Redmond

Other labs Other labs

        Applied Sciences
        
    
    
        Mixed Reality & AI - Cambridge
        
    
    
        Mixed Reality & AI - Zurich

All Microsoft

        Global
        
    
        Microsoft Security
        
    
    
        Azure
        
    
    
        Dynamics 365
        
    
    
        Microsoft 365
        
    
    
        Microsoft Teams
        
    
    
        Windows 365

Tech & innovation Tech & innovation

        Microsoft AI
        
    
    
        Azure Space
        
    
    
        Mixed reality
        
    
    
        Microsoft Holo Lens
        
    
    
        Microsoft Viva
        
    
    
        Quantum computing
        
    
    
        Sustainability

Industries Industries

        Education
        
    
    
        Automotive
        
    
    
        Financial services
        
    
    
        Government
        
    
    
        Healthcare
        
    
    
        Manufacturing
        
    
    
        Retail

Partners Partners

        Find a partner
        
    
    
        Become a partner
        
    
    
        Partner Network
        
    
    
        Microsoft Marketplace
        
    
    
        Marketplace Rewards
        
    
    
        Software development companies

Resources Resources

        Blog
        
    
    
        Microsoft Advertising
        
    
    
        Developer Center
        
    
    
        Documentation
        
    
    
        Events
        
    
    
        Licensing
        
    
    
        Microsoft Learn
        
    
    
        Microsoft Research

Global

        Microsoft Security
        
    
    
        Azure
        
    
    
        Dynamics 365
        
    
    
        Microsoft 365
        
    
    
        Microsoft Teams
        
    
    
        Windows 365

Tech & innovation Tech & innovation

        Microsoft AI
        
    
    
        Azure Space
        
    
    
        Mixed reality
        
    
    
        Microsoft Holo Lens
        
    
    
        Microsoft Viva
        
    
    
        Quantum computing
        
    
    
        Sustainability

Industries Industries

        Education
        
    
    
        Automotive
        
    
    
        Financial services
        
    
    
        Government
        
    
    
        Healthcare
        
    
    
        Manufacturing
        
    
    
        Retail

Partners Partners

        Find a partner
        
    
    
        Become a partner
        
    
    
        Partner Network
        
    
    
        Microsoft Marketplace
        
    
    
        Marketplace Rewards
        
    
    
        Software development companies

Resources Resources

        Blog
        
    
    
        Microsoft Advertising
        
    
    
        Developer Center
        
    
    
        Documentation
        
    
    
        Events
        
    
    
        Licensing
        
    
    
        Microsoft Learn
        
    
    
        Microsoft Research

Asgard Bench: A benchmark for visually grounded interactive planning

								Andrea Tupini								
						
					
												,
						
						Research Software Engineer

Lars Liden

												,
						
						Principal Research Software Engineer Manager

Reuben Tan

												,
						
						Researcher

Yu Wang

												,
						
						Principal RSDE

Jianfeng Gao

												,
						
						Technical Fellow & Corporate Vice President

To successfully complete tasks, embodied AI agents must ground and update their plans based on visual feedback.

Asgard Bench isolates whether agents can use visual observations to revise their plans as tasks unfold.

Spanning 108 controlled task instances across 12 task types, the benchmark requires agents to adapt their plans based on what they observe.

Because objects can be in different positions and states (e.g., clean or dirty), the same instruction can require different action sequences, even in the same environment.

Imagine a robot tasked with cleaning a kitchen. It needs to observe its environment, decide what to do, and adjust when things don’t go as expected, for example, when the mug it was tasked to wash is already clean, or the sink is full of other items. This is the domain of embodied AI: systems that perceive their environment and act within it.

The field has made rapid progress, but evaluating these systems is harder than it looks. Many benchmarks test perception, navigation, and physical control all at once, making it difficult to isolate whether an AI agent is actually using what it perceives to make better decisions or just getting lucky because the environment is predictable enough to script around.

To address this, we created Asgard Bench. In the paper, Asgard Bench — Evaluating Visually Grounded Interactive Planning Under Minimal Feedback,” we describe how this benchmark poses a simple but demanding challenge: give an AI agent a household task, let it observe the environment through images, and see whether it can adjust its plan when what it perceives contradicts what it anticipated. Can it notice that the mug it needs to clean is already in the sink, or that it isn’t, and behave accordingly? That is the core question Asgard Bench is designed to answer.

Built on AI2-THOR, an interactive 3D simulation environment used to train and evaluate AI agents on household tasks, Asgard Bench positions agents near objects and gives them a small, fixed set of actions, such as find, pickup, put, clean, and toggle_on/off. At each turn, the agent proposes a full sequence of steps to complete the task, but only the first step executes. Throughout, the focus is squarely on plan adaptation, not whether an agent can navigate a room or manipulate an object, but whether it can use what it perceives to revise its next step.

For example, the agent may discover a mug to be clean, dirty, or filled with coffee, or it may observe that a sink contains many other items, so the same instruction can require different action sequences as the task unfolds. This process is illustrated in Figure 1.

Figure 1: Agent observations and corresponding action plans in Asgard Bench. Each image is paired with the plan generated from that observation. This illustrates how Asgard Bench requires agents to update or change their plans based on new visual evidence rather than following a fixed sequence.

Agents start in interaction-ready positions, so navigation and viewpoint selection are not factors. A find action brings objects into view, and the environment handles the details of container sizing and placement, so the agent does not need to reason about which cabinet or countertop to use. The only inputs are color images, a history of attempted actions with simple success or failure signals, and the agent’s own record of what it plans to do next.

At each turn, the agent proposes a complete sequence of steps to finish the task, but only the first step proceeds. It then receives new images and a simple signal—did that action succeed or fail? This prevents the agent from scripting everything upfront and forces it to re-evaluate and revise its plan at every step. Built-in limits on total steps and repeated actions prevent endless loops. Because the environment provides only simple feedback, the agent must be able to notice what it perceives (e.g., whether a mug is dirty, whether a faucet is running) and keep track of where it is in the task from one step to the next.

We tested several leading vision-capable models on Asgard Bench and observed that high-performing models require visual grounding to consistently succeed. Across the models, visual input substantially improved performance: most models more than doubled success rates when given images versus text-only descriptions of the scene. This is in contrast to some prior benchmarks where agents could perform reasonably well without vision by relying on textual feedback on what went wrong.

Providing that kind of detailed failure information raises performance for all models in Asgard Bench, too, but it can mask the real problem. The strongest vision-capable models still outperform text-only agents even when those agents are given detailed feedback, demonstrating that the benchmark requires visual grounding that text alone cannot replicate. Asgard Bench’s performance is illustrated in Figure 2.

Figure 2. Success rates for image-based and text-only conditions. Visual input substantially improves performance for all but the weakest agents, while text-only performance remains low, indicating that Asgard Bench requires perception-based reasoning.

The results also revealed where today’s agents consistently fall short. Across all models, the same problems kept appearing: agents attempted undoable actions (e.g., trying to clean a mug that was not in the sink), got stuck in repeated action loops, misinterpreted subtle visual cues (on/off, clean/dirty), and lost track of where they were in the task progress from one step to the next. This points to three weaknesses: the inability to distinguish subtle visual details in cluttered scenes, the inability to maintain an accurate picture of task progress across multiple steps, and the inability to consistently translate what the agent sees into timely updates to its plan. Taken together, these point to where the next generation of embodied agents will need to improve.

Join us for a continuous exchange of ideas about research in the era of general AI. Watch the first four episodes on demand.

Asgard Bench is useful as both a diagnostic and development tool. By varying what feedback agents receive (none, minimal, or detailed), researchers can isolate whether performance gains come from better perception, better memory, or better planning. Promising directions include systems that combine stronger visual understanding with better state tracking, training approaches that emphasize learning to repair plans mid-task, and evaluation methods that measure not just whether an agent succeeds but how well it adapted along the way.

The failure patterns Asgard Bench surfaces point toward a concrete next step: building systems that can make finer visual distinctions, keep track of what changed more reliably across steps, and learn to revise plans mid-task rather than plowing ahead on a script. Agents that make progress on these challenges should be meaningfully better equipped for the messiness of real-world environments: unexpected object states, cluttered scenes, and the constant need to adapt.

Asgard Bench is open source and available on Git Hub (opens in new tab), providing a foundation for advancing research in visually grounded planning.

We thank the AI2-THOR community for building the simulation platform and making reproducible embodied evaluation possible.

Asgard Bench— Evaluating Visually Grounded Interactive Planning Under Minimal Feedback

Grounded Plan Bench: Spatially grounded long-horizon task planning for robot manipulation

Mind Journey enables AI to explore simulated 3D worlds to improve spatial interpretation

Key Takeaways

Resources Resources

      Publications
      
  
  
      Code & data
      
  
  
      People
      
  
  
      Microsoft Research blog

Research areas: Intelligence Research areas: Intelligence

      Artificial intelligence
      
  
  
      Audio & acoustics
      
  
  
      Computer vision
      
  
  
      Graphics & multimedia
      
  
  
      Human-computer interaction
      
  
  
      Human language technologies
      
  
  
      Search & information retrieval

Research areas: Systems Research areas: Systems

      Data platforms and analytics
      
  
  
      Hardware & devices
      
  
  
      Programming languages & software engineering
      
  
  
      Quantum computing
      
  
  
      Security, privacy & cryptography
      
  
  
      Systems & networking

Research areas: Theory Research areas: Theory

      Algorithms
      
  
  
      Mathematics

Research areas: Other Sciences Research areas: Other Sciences

      Ecology & environment
      
  
  
      Economics
      
  
  
      Medical, health & genomics
      
  
  
      Social sciences
      
  
  
      Technology for emerging markets

AsgardBench: A benchmark for visually grounded interactive planning - Microsoft Research

Asgard Bench: A benchmark for visually grounded interactive planning - Microsoft Research

Overview

Details

Asgard Bench: A benchmark for visually grounded interactive planning

Asgard Bench— Evaluating Visually Grounded Interactive Planning Under Minimal Feedback

Grounded Plan Bench: Spatially grounded long-horizon task planning for robot manipulation

Mind Journey enables AI to explore simulated 3D worlds to improve spatial interpretation

Key Takeaways

Cut Costs with Runable

Which apps do you use?

Apps to replace

AsgardBench: A benchmark for visually grounded interactive planning - Microsoft Research

Asgard Bench: A benchmark for visually grounded interactive planning - Microsoft Research

Overview

Details

Asgard Bench: A benchmark for visually grounded interactive planning

Asgard Bench— Evaluating Visually Grounded Interactive Planning Under Minimal Feedback

Grounded Plan Bench: Spatially grounded long-horizon task planning for robot manipulation

Mind Journey enables AI to explore simulated 3D worlds to improve spatial interpretation

Key Takeaways

Cut Costs with Runable

Which apps do you use?

Apps to replace

Agents can do the work - The Intercom Blog