ADe Le: Predicting and explaining AI performance across tasks - Microsoft Research
Overview
Resources Resources
Publications
Code & data
People
Microsoft Research blog
Research areas: Intelligence Research areas: Intelligence
Artificial intelligence
Audio & acoustics
Computer vision
Graphics & multimedia
Human-computer interaction
Human language technologies
Search & information retrieval
Details
Research areas: Systems Research areas: Systems
Data platforms and analytics
Hardware & devices
Programming languages & software engineering
Quantum computing
Security, privacy & cryptography
Systems & networking
Research areas: Theory Research areas: Theory
Algorithms
Mathematics
Research areas: Other Sciences Research areas: Other Sciences
Ecology & environment
Economics
Medical, health & genomics
Social sciences
Technology for emerging markets
Resources Resources
Publications
Code & data
People
Microsoft Research blog
Research areas: Intelligence Research areas: Intelligence
Artificial intelligence
Audio & acoustics
Computer vision
Graphics & multimedia
Human-computer interaction
Human language technologies
Search & information retrieval
Research areas: Systems Research areas: Systems
Data platforms and analytics
Hardware & devices
Programming languages & software engineering
Quantum computing
Security, privacy & cryptography
Systems & networking
Research areas: Theory Research areas: Theory
Algorithms
Mathematics
Research areas: Other Sciences Research areas: Other Sciences
Ecology & environment
Economics
Medical, health & genomics
Social sciences
Technology for emerging markets
Academic programs
Events & academic conferences
Microsoft Research Forum
Behind the Tech podcast
Microsoft Research blog
Microsoft Research Forum
Microsoft Research podcast
Microsoft Research Labs Microsoft Research Labs
Africa
AI for Science
AI Frontiers
Asia-Pacific
Cambridge
Health Futures
India
Montreal
New England
New York City
Redmond
Other labs Other labs
Applied Sciences
Mixed Reality & AI - Cambridge
Mixed Reality & AI - Zurich
All Microsoft
Global
Microsoft Security
Azure
Dynamics 365
Microsoft 365
Microsoft Teams
Windows 365
Tech & innovation Tech & innovation
Microsoft AI
Azure Space
Mixed reality
Microsoft Holo Lens
Microsoft Viva
Quantum computing
Sustainability
Industries Industries
Education
Automotive
Financial services
Government
Healthcare
Manufacturing
Retail
Partners Partners
Find a partner
Become a partner
Partner Network
Microsoft Marketplace
Marketplace Rewards
Software development companies
Resources Resources
Blog
Microsoft Advertising
Developer Center
Documentation
Events
Licensing
Microsoft Learn
Microsoft Research
Global
Microsoft Security
Azure
Dynamics 365
Microsoft 365
Microsoft Teams
Windows 365
Tech & innovation Tech & innovation
Microsoft AI
Azure Space
Mixed reality
Microsoft Holo Lens
Microsoft Viva
Quantum computing
Sustainability
Industries Industries
Education
Automotive
Financial services
Government
Healthcare
Manufacturing
Retail
Partners Partners
Find a partner
Become a partner
Partner Network
Microsoft Marketplace
Marketplace Rewards
Software development companies
Resources Resources
Blog
Microsoft Advertising
Developer Center
Documentation
Events
Licensing
Microsoft Learn
Microsoft Research
ADe Le: Predicting and explaining AI performance across tasks
By
Lexin Zhou
,
former Research Assistant
Xing Xie
,
Assistant Managing Director
AI benchmarks report performance on specific tasks but provide limited insight into underlying capabilities; ADe Le evaluates models by scoring both tasks and models across 18 core abilities, enabling direct comparison between task demands and model capabilities.
Using these ability scores, the method predicts performance on new tasks with ~88% accuracy, including for models such as GPT-4o and Llama-3.1.
It builds ability profiles and identifies where models are likely to succeed or fail, highlighting strengths and limitations across tasks.
By linking outcomes to task demands, ADe Le explains differences in performance, showing how it changes as task complexity increases.
AI benchmarks report how large language models (LLMs) perform on specific tasks but provide little insight into their underlying capabilities that drive their performance. They do not explain failures or reliably predict outcomes on new tasks. To address this, Microsoft researchers in collaboration with Princeton University and Universitat Politècnica de València introduce ADe Le (opens in new tab) (AI Evaluation with Demand Levels), a method that characterizes both models and tasks using a broad set of capabilities, such as reasoning and domain knowledge, so performance on new tasks can be predicted and linked to specific strengths and weaknesses in a model.
In a paper published in Nature, “General Scales Unlock AI Evaluation with Explanatory and Predictive Power (opens in new tab),” the team describes how ADe Le moves beyond aggregate benchmark scores. Rather than treating evaluation as a collection of isolated tests, it represents both benchmarks and LLMs using the same set of capability scores. These scores can then be used to estimate how a model will perform on tasks it has not encountered before. The research was supported by Microsoft’s Accelerating Foundation Models Research (AFMR) grant program.
ADe Le scores tasks across 18 core abilities, such as attention, reasoning, domain knowledge, and assigns each task a value from 0 to 5 based on how much it requires each ability. For example, a basic arithmetic problem might score low on quantitative reasoning, but an Olympiad-level proof would score much higher.
Evaluating a model across many such tasks produces an ability profile—a structured view of where the model performs and where it breaks down. Comparing this profile to the demands of a new task makes it possible to identify the specific gaps that lead to failure. The process is illustrated in Figure 1.
Figure 1. Top: (1) Model performance on the ADe Le benchmark and (2) the resulting ability profiles, showing each model’s strengths and limitations across core abilities. Bottom: (1) Application of 18 scoring criteria to each task and (2) the resulting task profiles, showing the abilities each task requires.
Using ADe Le, the team evaluated a range of AI benchmarks and model behaviors to understand what current evaluations capture and what they miss. The results show that many widely used benchmarks provide an incomplete and sometimes misleading picture of model capabilities and that a more structured approach can clarify those gaps and help predict how models will behave in new settings.
ADe Le shows that many benchmarks do not isolate the abilities they are intended to measure or only cover a limited range of difficulty levels. For example, a test designed to evaluate logical reasoning may also depend heavily on specialized knowledge or metacognition. Others focus on a narrow range of difficulty, omitting both simpler and more complex cases. By scoring tasks based on the abilities they require, ADe Le makes these mismatches visible and provides a way to diagnose existing benchmarks and design better ones.
Applying this framework to 15 LLMs, the team constructed ability profiles using 0–5 scores for each of 18 abilities. For each ability, the team measured how performance changes with task difficulty and used the difficulty level at which the model has a 50% chance of success as its ability score. Figure 2 illustrates these results as radial plots that show where the model performs well and where it breaks down.
Figure 2. Ability profiles for 15 LLMs across 18 abilities. Left: Open AI models. Middle: Llama models. Right: Deep Seek-R1 distilled models.
This analysis shows that models differ in their strengths and weaknesses across abilities. Newer models generally outperform older ones, but not consistently across all abilities. Performance on knowledge-heavy tasks depends strongly on model size and training, while reasoning-oriented models show clear gains on tasks requiring logic, learning, abstraction, and social inference. These patterns typically require multiple, separate analyses across different benchmarks and can still produce conflicting conclusions when task demands are not carefully controlled. ADe Le surfaces them within a single framework.
ADe Le also enables prediction. By comparing a model’s ability profile to the demands of a task, it can forecast whether the model will succeed, even on tasks that are unfamiliar. In experiments, this approach achieved approximately 88% accuracy for models like GPT-4o and LLa MA-3.1-405B, outperforming traditional methods. This makes it possible to both explain and anticipate potential failures before deployment, improving the reliability and predictability of AI model assessment.
Whether AI systems can truly reason is a central debate in the field. Some studies report strong reasoning performance, while others show they break down at scale. These results reflect differences in task difficulty. ADe Le shows that benchmarks labeled as measuring “reasoning” vary in what they require, from basic problem-solving to tasks that combine the need for advanced logic, abstraction, and domain knowledge. The same model can score above 90% on lower-demand tests and below 15% on more demanding ones, reflecting differences in task requirements rather than a change in capability.
Reasoning-oriented models like Open AI’s o 1 and GPT-5 show measurable gains over standard models—not only in logic and mathematics but also with interpreting user intent. However, performance declines as task demands increase. AI systems can reason, but only up to a point, and ADe Le identifies where that point is for each model.
Stay connected to the research community at Microsoft.
ADe Le is designed to evolve alongside advances in AI and can be extended to multimodal and embodied AI systems. It also has the potential to serve as a standardized framework for AI research, policymaking, and security auditing.
More broadly, it advances a more systematic approach to AI evaluation—one that explains system behavior and predicts performance. This work builds on earlier efforts, including Microsoft research on applying psychometrics to AI evaluation and recent work on Societal AI, emphasizing the importance of AI evaluation.
As general-purpose AI systems continue to outpace existing evaluation methods, approaches like ADe Le offer a path toward more rigorous and transparent assessment in real-world use. The research team is working to expand this effort through a broader community. Additional experiments, benchmark annotations, and resources are available on Git Hub (opens in new tab).
Predicting and explaining AI model performance: A new approach to evaluation
Magma: A foundation model for multimodal AI agents across digital and physical worlds
Eureka: Evaluating and understanding progress in AI
Key Takeaways
-
Resources Resources
Publications Code & data People Microsoft Research blog -
Research areas: Intelligence Research areas: Intelligence
Artificial intelligence Audio & acoustics Computer vision Graphics & multimedia Human-computer interaction Human language technologies Search & information retrieval -
Research areas: Systems Research areas: Systems
Data platforms and analytics Hardware & devices Programming languages & software engineering Quantum computing Security, privacy & cryptography Systems & networking -
Research areas: Theory Research areas: Theory
Algorithms Mathematics -
Research areas: Other Sciences Research areas: Other Sciences
Ecology & environment Economics Medical, health & genomics Social sciences Technology for emerging markets



