Asgard Bench: A benchmark for visually grounded interactive planning - Microsoft Research
Overview
Resources Resources
Publications
Code & data
People
Microsoft Research blog
Research areas: Intelligence Research areas: Intelligence
Artificial intelligence
Audio & acoustics
Computer vision
Graphics & multimedia
Human-computer interaction
Human language technologies
Search & information retrieval
Details
Research areas: Systems Research areas: Systems
Data platforms and analytics
Hardware & devices
Programming languages & software engineering
Quantum computing
Security, privacy & cryptography
Systems & networking
Research areas: Theory Research areas: Theory
Algorithms
Mathematics
Research areas: Other Sciences Research areas: Other Sciences
Ecology & environment
Economics
Medical, health & genomics
Social sciences
Technology for emerging markets
Resources Resources
Publications
Code & data
People
Microsoft Research blog
Research areas: Intelligence Research areas: Intelligence
Artificial intelligence
Audio & acoustics
Computer vision
Graphics & multimedia
Human-computer interaction
Human language technologies
Search & information retrieval
Research areas: Systems Research areas: Systems
Data platforms and analytics
Hardware & devices
Programming languages & software engineering
Quantum computing
Security, privacy & cryptography
Systems & networking
Research areas: Theory Research areas: Theory
Algorithms
Mathematics
Research areas: Other Sciences Research areas: Other Sciences
Ecology & environment
Economics
Medical, health & genomics
Social sciences
Technology for emerging markets
Academic programs
Events & academic conferences
Microsoft Research Forum
Behind the Tech podcast
Microsoft Research blog
Microsoft Research Forum
Microsoft Research podcast
Microsoft Research Labs Microsoft Research Labs
Africa
AI for Science
AI Frontiers
Asia-Pacific
Cambridge
Health Futures
India
Montreal
New England
New York City
Redmond
Other labs Other labs
Applied Sciences
Mixed Reality & AI - Cambridge
Mixed Reality & AI - Zurich
All Microsoft
Global
Microsoft Security
Azure
Dynamics 365
Microsoft 365
Microsoft Teams
Windows 365
Tech & innovation Tech & innovation
Microsoft AI
Azure Space
Mixed reality
Microsoft Holo Lens
Microsoft Viva
Quantum computing
Sustainability
Industries Industries
Education
Automotive
Financial services
Government
Healthcare
Manufacturing
Retail
Partners Partners
Find a partner
Become a partner
Partner Network
Microsoft Marketplace
Marketplace Rewards
Software development companies
Resources Resources
Blog
Microsoft Advertising
Developer Center
Documentation
Events
Licensing
Microsoft Learn
Microsoft Research
Global
Microsoft Security
Azure
Dynamics 365
Microsoft 365
Microsoft Teams
Windows 365
Tech & innovation Tech & innovation
Microsoft AI
Azure Space
Mixed reality
Microsoft Holo Lens
Microsoft Viva
Quantum computing
Sustainability
Industries Industries
Education
Automotive
Financial services
Government
Healthcare
Manufacturing
Retail
Partners Partners
Find a partner
Become a partner
Partner Network
Microsoft Marketplace
Marketplace Rewards
Software development companies
Resources Resources
Blog
Microsoft Advertising
Developer Center
Documentation
Events
Licensing
Microsoft Learn
Microsoft Research
Asgard Bench: A benchmark for visually grounded interactive planning
By
Andrea Tupini
,
Research Software Engineer
Lars Liden
,
Principal Research Software Engineer Manager
Reuben Tan
,
Researcher
Yu Wang
,
Principal RSDE
Jianfeng Gao
,
Technical Fellow & Corporate Vice President
To successfully complete tasks, embodied AI agents must ground and update their plans based on visual feedback.
Asgard Bench isolates whether agents can use visual observations to revise their plans as tasks unfold.
Spanning 108 controlled task instances across 12 task types, the benchmark requires agents to adapt their plans based on what they observe.
Because objects can be in different positions and states (e.g., clean or dirty), the same instruction can require different action sequences, even in the same environment.
Imagine a robot tasked with cleaning a kitchen. It needs to observe its environment, decide what to do, and adjust when things don’t go as expected, for example, when the mug it was tasked to wash is already clean, or the sink is full of other items. This is the domain of embodied AI: systems that perceive their environment and act within it.
The field has made rapid progress, but evaluating these systems is harder than it looks. Many benchmarks test perception, navigation, and physical control all at once, making it difficult to isolate whether an AI agent is actually using what it perceives to make better decisions or just getting lucky because the environment is predictable enough to script around.
To address this, we created Asgard Bench. In the paper, Asgard Bench — Evaluating Visually Grounded Interactive Planning Under Minimal Feedback,” we describe how this benchmark poses a simple but demanding challenge: give an AI agent a household task, let it observe the environment through images, and see whether it can adjust its plan when what it perceives contradicts what it anticipated. Can it notice that the mug it needs to clean is already in the sink, or that it isn’t, and behave accordingly? That is the core question Asgard Bench is designed to answer.
Built on AI2-THOR, an interactive 3D simulation environment used to train and evaluate AI agents on household tasks, Asgard Bench positions agents near objects and gives them a small, fixed set of actions, such as find, pickup, put, clean, and toggle_on/off. At each turn, the agent proposes a full sequence of steps to complete the task, but only the first step executes. Throughout, the focus is squarely on plan adaptation, not whether an agent can navigate a room or manipulate an object, but whether it can use what it perceives to revise its next step.
For example, the agent may discover a mug to be clean, dirty, or filled with coffee, or it may observe that a sink contains many other items, so the same instruction can require different action sequences as the task unfolds. This process is illustrated in Figure 1.
Figure 1: Agent observations and corresponding action plans in Asgard Bench. Each image is paired with the plan generated from that observation. This illustrates how Asgard Bench requires agents to update or change their plans based on new visual evidence rather than following a fixed sequence.
Agents start in interaction-ready positions, so navigation and viewpoint selection are not factors. A find action brings objects into view, and the environment handles the details of container sizing and placement, so the agent does not need to reason about which cabinet or countertop to use. The only inputs are color images, a history of attempted actions with simple success or failure signals, and the agent’s own record of what it plans to do next.
At each turn, the agent proposes a complete sequence of steps to finish the task, but only the first step proceeds. It then receives new images and a simple signal—did that action succeed or fail? This prevents the agent from scripting everything upfront and forces it to re-evaluate and revise its plan at every step. Built-in limits on total steps and repeated actions prevent endless loops. Because the environment provides only simple feedback, the agent must be able to notice what it perceives (e.g., whether a mug is dirty, whether a faucet is running) and keep track of where it is in the task from one step to the next.
We tested several leading vision-capable models on Asgard Bench and observed that high-performing models require visual grounding to consistently succeed. Across the models, visual input substantially improved performance: most models more than doubled success rates when given images versus text-only descriptions of the scene. This is in contrast to some prior benchmarks where agents could perform reasonably well without vision by relying on textual feedback on what went wrong.
Providing that kind of detailed failure information raises performance for all models in Asgard Bench, too, but it can mask the real problem. The strongest vision-capable models still outperform text-only agents even when those agents are given detailed feedback, demonstrating that the benchmark requires visual grounding that text alone cannot replicate. Asgard Bench’s performance is illustrated in Figure 2.
Figure 2. Success rates for image-based and text-only conditions. Visual input substantially improves performance for all but the weakest agents, while text-only performance remains low, indicating that Asgard Bench requires perception-based reasoning.
The results also revealed where today’s agents consistently fall short. Across all models, the same problems kept appearing: agents attempted undoable actions (e.g., trying to clean a mug that was not in the sink), got stuck in repeated action loops, misinterpreted subtle visual cues (on/off, clean/dirty), and lost track of where they were in the task progress from one step to the next. This points to three weaknesses: the inability to distinguish subtle visual details in cluttered scenes, the inability to maintain an accurate picture of task progress across multiple steps, and the inability to consistently translate what the agent sees into timely updates to its plan. Taken together, these point to where the next generation of embodied agents will need to improve.
Join us for a continuous exchange of ideas about research in the era of general AI. Watch the first four episodes on demand.
Asgard Bench is useful as both a diagnostic and development tool. By varying what feedback agents receive (none, minimal, or detailed), researchers can isolate whether performance gains come from better perception, better memory, or better planning. Promising directions include systems that combine stronger visual understanding with better state tracking, training approaches that emphasize learning to repair plans mid-task, and evaluation methods that measure not just whether an agent succeeds but how well it adapted along the way.
The failure patterns Asgard Bench surfaces point toward a concrete next step: building systems that can make finer visual distinctions, keep track of what changed more reliably across steps, and learn to revise plans mid-task rather than plowing ahead on a script. Agents that make progress on these challenges should be meaningfully better equipped for the messiness of real-world environments: unexpected object states, cluttered scenes, and the constant need to adapt.
Asgard Bench is open source and available on Git Hub (opens in new tab), providing a foundation for advancing research in visually grounded planning.
We thank the AI2-THOR community for building the simulation platform and making reproducible embodied evaluation possible.
Asgard Bench— Evaluating Visually Grounded Interactive Planning Under Minimal Feedback
Grounded Plan Bench: Spatially grounded long-horizon task planning for robot manipulation
Mind Journey enables AI to explore simulated 3D worlds to improve spatial interpretation
Key Takeaways
-
Resources Resources
Publications Code & data People Microsoft Research blog -
Research areas: Intelligence Research areas: Intelligence
Artificial intelligence Audio & acoustics Computer vision Graphics & multimedia Human-computer interaction Human language technologies Search & information retrieval -
Research areas: Systems Research areas: Systems
Data platforms and analytics Hardware & devices Programming languages & software engineering Quantum computing Security, privacy & cryptography Systems & networking -
Research areas: Theory Research areas: Theory
Algorithms Mathematics -
Research areas: Other Sciences Research areas: Other Sciences
Ecology & environment Economics Medical, health & genomics Social sciences Technology for emerging markets



