Ask Runable forDesign-Driven General AI AgentTry Runable For Free
Runable
Back to Blog
Technology9 min read

AsgardBench: A benchmark for visually grounded interactive planning - Microsoft Research

AsgardBench evaluates whether embodied agents can revise their plans based on visual observations as tasks unfold. By focusing on perception-driven planning,...

TechnologyInnovationBest PracticesGuideTutorial
AsgardBench: A benchmark for visually grounded interactive planning - Microsoft Research
Listen to Article
0:00
0:00
0:00

Asgard Bench: A benchmark for visually grounded interactive planning - Microsoft Research

Overview

Resources Resources

        Publications
        
    
    
        Code & data
        
    
    
        People
        
    
    
        Microsoft Research blog

Research areas: Intelligence Research areas: Intelligence

        Artificial intelligence
        
    
    
        Audio & acoustics
        
    
    
        Computer vision
        
    
    
        Graphics & multimedia
        
    
    
        Human-computer interaction
        
    
    
        Human language technologies
        
    
    
        Search & information retrieval

Details

Research areas: Systems Research areas: Systems

        Data platforms and analytics
        
    
    
        Hardware & devices
        
    
    
        Programming languages & software engineering
        
    
    
        Quantum computing
        
    
    
        Security, privacy & cryptography
        
    
    
        Systems & networking

Research areas: Theory Research areas: Theory

        Algorithms
        
    
    
        Mathematics

Research areas: Other Sciences Research areas: Other Sciences

        Ecology & environment
        
    
    
        Economics
        
    
    
        Medical, health & genomics
        
    
    
        Social sciences
        
    
    
        Technology for emerging markets

Resources Resources

        Publications
        
    
    
        Code & data
        
    
    
        People
        
    
    
        Microsoft Research blog

Research areas: Intelligence Research areas: Intelligence

        Artificial intelligence
        
    
    
        Audio & acoustics
        
    
    
        Computer vision
        
    
    
        Graphics & multimedia
        
    
    
        Human-computer interaction
        
    
    
        Human language technologies
        
    
    
        Search & information retrieval

Research areas: Systems Research areas: Systems

        Data platforms and analytics
        
    
    
        Hardware & devices
        
    
    
        Programming languages & software engineering
        
    
    
        Quantum computing
        
    
    
        Security, privacy & cryptography
        
    
    
        Systems & networking

Research areas: Theory Research areas: Theory

        Algorithms
        
    
    
        Mathematics

Research areas: Other Sciences Research areas: Other Sciences

        Ecology & environment
        
    
    
        Economics
        
    
    
        Medical, health & genomics
        
    
    
        Social sciences
        
    
    
        Technology for emerging markets

Academic programs

        Events & academic conferences
        
    
    
        Microsoft Research Forum

Behind the Tech podcast

        Microsoft Research blog
        
    
    
        Microsoft Research Forum
        
    
    
        Microsoft Research podcast

Microsoft Research Labs Microsoft Research Labs

        Africa
        
    
    
        AI for Science
        
    
    
        AI Frontiers
        
    
    
        Asia-Pacific
        
    
    
        Cambridge
        
    
    
        Health Futures
        
    
    
        India
        
    
    
        Montreal
        
    
    
        New England
        
    
    
        New York City
        
    
    
        Redmond

Other labs Other labs

        Applied Sciences
        
    
    
        Mixed Reality & AI - Cambridge
        
    
    
        Mixed Reality & AI - Zurich

All Microsoft

        Global
        
    
        Microsoft Security
        
    
    
        Azure
        
    
    
        Dynamics 365
        
    
    
        Microsoft 365
        
    
    
        Microsoft Teams
        
    
    
        Windows 365

Tech & innovation Tech & innovation

        Microsoft AI
        
    
    
        Azure Space
        
    
    
        Mixed reality
        
    
    
        Microsoft Holo Lens
        
    
    
        Microsoft Viva
        
    
    
        Quantum computing
        
    
    
        Sustainability

Industries Industries

        Education
        
    
    
        Automotive
        
    
    
        Financial services
        
    
    
        Government
        
    
    
        Healthcare
        
    
    
        Manufacturing
        
    
    
        Retail

Partners Partners

        Find a partner
        
    
    
        Become a partner
        
    
    
        Partner Network
        
    
    
        Microsoft Marketplace
        
    
    
        Marketplace Rewards
        
    
    
        Software development companies

Resources Resources

        Blog
        
    
    
        Microsoft Advertising
        
    
    
        Developer Center
        
    
    
        Documentation
        
    
    
        Events
        
    
    
        Licensing
        
    
    
        Microsoft Learn
        
    
    
        Microsoft Research

Global

        Microsoft Security
        
    
    
        Azure
        
    
    
        Dynamics 365
        
    
    
        Microsoft 365
        
    
    
        Microsoft Teams
        
    
    
        Windows 365

Tech & innovation Tech & innovation

        Microsoft AI
        
    
    
        Azure Space
        
    
    
        Mixed reality
        
    
    
        Microsoft Holo Lens
        
    
    
        Microsoft Viva
        
    
    
        Quantum computing
        
    
    
        Sustainability

Industries Industries

        Education
        
    
    
        Automotive
        
    
    
        Financial services
        
    
    
        Government
        
    
    
        Healthcare
        
    
    
        Manufacturing
        
    
    
        Retail

Partners Partners

        Find a partner
        
    
    
        Become a partner
        
    
    
        Partner Network
        
    
    
        Microsoft Marketplace
        
    
    
        Marketplace Rewards
        
    
    
        Software development companies

Resources Resources

        Blog
        
    
    
        Microsoft Advertising
        
    
    
        Developer Center
        
    
    
        Documentation
        
    
    
        Events
        
    
    
        Licensing
        
    
    
        Microsoft Learn
        
    
    
        Microsoft Research

Asgard Bench: A benchmark for visually grounded interactive planning

By

								Andrea Tupini								
						
					
												,
						
						Research Software Engineer

Lars Liden

												,
						
						Principal Research Software Engineer Manager

Reuben Tan

												,
						
						Researcher

Yu Wang

												,
						
						Principal RSDE

Jianfeng Gao

												,
						
						Technical Fellow & Corporate Vice President

To successfully complete tasks, embodied AI agents must ground and update their plans based on visual feedback.

Asgard Bench isolates whether agents can use visual observations to revise their plans as tasks unfold.

Spanning 108 controlled task instances across 12 task types, the benchmark requires agents to adapt their plans based on what they observe.

Because objects can be in different positions and states (e.g., clean or dirty), the same instruction can require different action sequences, even in the same environment.

Imagine a robot tasked with cleaning a kitchen. It needs to observe its environment, decide what to do, and adjust when things don’t go as expected, for example, when the mug it was tasked to wash is already clean, or the sink is full of other items. This is the domain of embodied AI: systems that perceive their environment and act within it.

The field has made rapid progress, but evaluating these systems is harder than it looks. Many benchmarks test perception, navigation, and physical control all at once, making it difficult to isolate whether an AI agent is actually using what it perceives to make better decisions or just getting lucky because the environment is predictable enough to script around.

To address this, we created Asgard Bench. In the paper, Asgard Bench — Evaluating Visually Grounded Interactive Planning Under Minimal Feedback,” we describe how this benchmark poses a simple but demanding challenge: give an AI agent a household task, let it observe the environment through images, and see whether it can adjust its plan when what it perceives contradicts what it anticipated. Can it notice that the mug it needs to clean is already in the sink, or that it isn’t, and behave accordingly? That is the core question Asgard Bench is designed to answer.

Built on AI2-THOR, an interactive 3D simulation environment used to train and evaluate AI agents on household tasks, Asgard Bench positions agents near objects and gives them a small, fixed set of actions, such as find, pickup, put, clean, and toggle_on/off. At each turn, the agent proposes a full sequence of steps to complete the task, but only the first step executes. Throughout, the focus is squarely on plan adaptation, not whether an agent can navigate a room or manipulate an object, but whether it can use what it perceives to revise its next step.

For example, the agent may discover a mug to be clean, dirty, or filled with coffee, or it may observe that a sink contains many other items, so the same instruction can require different action sequences as the task unfolds. This process is illustrated in Figure 1.

Figure 1: Agent observations and corresponding action plans in Asgard Bench. Each image is paired with the plan generated from that observation. This illustrates how Asgard Bench requires agents to update or change their plans based on new visual evidence rather than following a fixed sequence.

Agents start in interaction-ready positions, so navigation and viewpoint selection are not factors. A find action brings objects into view, and the environment handles the details of container sizing and placement, so the agent does not need to reason about which cabinet or countertop to use. The only inputs are color images, a history of attempted actions with simple success or failure signals, and the agent’s own record of what it plans to do next.

At each turn, the agent proposes a complete sequence of steps to finish the task, but only the first step proceeds. It then receives new images and a simple signal—did that action succeed or fail? This prevents the agent from scripting everything upfront and forces it to re-evaluate and revise its plan at every step. Built-in limits on total steps and repeated actions prevent endless loops. Because the environment provides only simple feedback, the agent must be able to notice what it perceives (e.g., whether a mug is dirty, whether a faucet is running) and keep track of where it is in the task from one step to the next.

We tested several leading vision-capable models on Asgard Bench and observed that high-performing models require visual grounding to consistently succeed. Across the models, visual input substantially improved performance: most models more than doubled success rates when given images versus text-only descriptions of the scene. This is in contrast to some prior benchmarks where agents could perform reasonably well without vision by relying on textual feedback on what went wrong.

Providing that kind of detailed failure information raises performance for all models in Asgard Bench, too, but it can mask the real problem. The strongest vision-capable models still outperform text-only agents even when those agents are given detailed feedback, demonstrating that the benchmark requires visual grounding that text alone cannot replicate. Asgard Bench’s performance is illustrated in Figure 2.

Figure 2. Success rates for image-based and text-only conditions. Visual input substantially improves performance for all but the weakest agents, while text-only performance remains low, indicating that Asgard Bench requires perception-based reasoning.

The results also revealed where today’s agents consistently fall short. Across all models, the same problems kept appearing: agents attempted undoable actions (e.g., trying to clean a mug that was not in the sink), got stuck in repeated action loops, misinterpreted subtle visual cues (on/off, clean/dirty), and lost track of where they were in the task progress from one step to the next. This points to three weaknesses: the inability to distinguish subtle visual details in cluttered scenes, the inability to maintain an accurate picture of task progress across multiple steps, and the inability to consistently translate what the agent sees into timely updates to its plan. Taken together, these point to where the next generation of embodied agents will need to improve.

Join us for a continuous exchange of ideas about research in the era of general AI. Watch the first four episodes on demand.

Asgard Bench is useful as both a diagnostic and development tool. By varying what feedback agents receive (none, minimal, or detailed), researchers can isolate whether performance gains come from better perception, better memory, or better planning. Promising directions include systems that combine stronger visual understanding with better state tracking, training approaches that emphasize learning to repair plans mid-task, and evaluation methods that measure not just whether an agent succeeds but how well it adapted along the way.

The failure patterns Asgard Bench surfaces point toward a concrete next step: building systems that can make finer visual distinctions, keep track of what changed more reliably across steps, and learn to revise plans mid-task rather than plowing ahead on a script. Agents that make progress on these challenges should be meaningfully better equipped for the messiness of real-world environments: unexpected object states, cluttered scenes, and the constant need to adapt.

Asgard Bench is open source and available on Git Hub (opens in new tab), providing a foundation for advancing research in visually grounded planning.

We thank the AI2-THOR community for building the simulation platform and making reproducible embodied evaluation possible.

Asgard Bench— Evaluating Visually Grounded Interactive Planning Under Minimal Feedback

Grounded Plan Bench: Spatially grounded long-horizon task planning for robot manipulation

Mind Journey enables AI to explore simulated 3D worlds to improve spatial interpretation

Key Takeaways

  • Resources Resources

          Publications
          
      
      
          Code & data
          
      
      
          People
          
      
      
          Microsoft Research blog
    
  • Research areas: Intelligence Research areas: Intelligence

          Artificial intelligence
          
      
      
          Audio & acoustics
          
      
      
          Computer vision
          
      
      
          Graphics & multimedia
          
      
      
          Human-computer interaction
          
      
      
          Human language technologies
          
      
      
          Search & information retrieval
    
  • Research areas: Systems Research areas: Systems

          Data platforms and analytics
          
      
      
          Hardware & devices
          
      
      
          Programming languages & software engineering
          
      
      
          Quantum computing
          
      
      
          Security, privacy & cryptography
          
      
      
          Systems & networking
    
  • Research areas: Theory Research areas: Theory

          Algorithms
          
      
      
          Mathematics
    
  • Research areas: Other Sciences Research areas: Other Sciences

          Ecology & environment
          
      
      
          Economics
          
      
      
          Medical, health & genomics
          
      
      
          Social sciences
          
      
      
          Technology for emerging markets
    

Cut Costs with Runable

Cost savings are based on average monthly price per user for each app.

Which apps do you use?

Apps to replace

ChatGPTChatGPT
$20 / month
LovableLovable
$25 / month
Gamma AIGamma AI
$25 / month
HiggsFieldHiggsField
$49 / month
Leonardo AILeonardo AI
$12 / month
TOTAL$131 / month

Runable price = $9 / month

Saves $122 / month

Runable can save upto $1464 per year compared to the non-enterprise price of your apps.