Xiaomi's Harness X rewrites its own AI scaffolding mid-task — and smaller models gain the most | Venture Beat
Overview
Xiaomi's Harness X rewrites its own AI scaffolding mid-task — and smaller models gain the most
As enterprise AI agents take on increasingly complex, long-horizon tasks, their performance is often restricted by their harness, the software scaffolding that connects the backbone LLM to its environment.
Details
Currently, harnesses are largely static and hand-crafted. Improving them is largely manual and they do not automatically improve based on the execution data they collect from their environment.
To address this engineering bottleneck, researchers at Xiaomi introduced Harness X, a framework that treats the AI harness as a composable object and autonomously applies improvements to its code.
In real-world enterprise applications, this automated adaptation enables AI systems to dynamically adjust to application-specific requirements. Practical tests showed Harness X delivering substantial performance gains across domains like software engineering and web interaction.
The results demonstrate that scaling the foundation model is not the only path to more capable AI — and for smaller models, it may not even be the best one. Harness X's harness evolution yielded an average +14.5% performance gain across 15 model-benchmark combinations; for the open-weight Qwen 3.5-9B, gains reached +44% on embodied planning tasks.
In AI applications, a foundation model's capability relies heavily on its surrounding harness. The harness acts as the operational layer that converts raw model outputs into structured, executable agent behaviors. It comprises the prompts, external tool integrations, memory management, and control flows that dictate how an AI system observes its environment, reasons through a problem, and takes action.
As enterprise agents take on more complex, long-horizon workflows, harness engineering has become a fundamental part of AI development. Despite its importance, harness development remains far from a mature engineering discipline and presents three key challenges.
First, harnesses are static and hand-engineered. Any shift in the underlying foundation model, the introduction of new tools, or a pivot to a different operational domain requires bespoke, manual code rewrites. Traditional harnesses lack mechanisms to autonomously learn and improve from past execution experiences.
Second, most existing harnesses suffer from architectural entanglement. They tightly couple prompt templates, tool wrappers, retry policies, and memory management within the same code paths. This entanglement means that tweaking one component can silently break others. Attempting to reuse a harness across different business domains often devolves into raw code copying rather than clean, modular composition.
Third, the harness and foundation model are optimized in isolation. When engineers run tests to improve the harness, the execution traces generated are typically discarded rather than used as training data to improve the model. Consequently, model upgrades do not naturally lead to harness improvements, creating a bottleneck where teams fail to capture the full value of their agent's operational data.
Harness X solves the engineering bottlenecks of manual harness development with what the researchers call a “unified harness foundry.”
The core innovation of Harness X is treating the harness as a "first-class object". In software engineering terms, this means the harness is an independently serializable, modular, and substitutable entity. By separating the model configuration (i.e., which AI model is operating) from the harness configuration, engineers can seamlessly swap, adapt, and evolve the scaffolding without touching the underlying model.
Harness X breaks agent behavior down into different components, such as context assembly, memory management, tool ecosystems, control flow, and observability. Every specific behavior is implemented as a "processor" that plugs into precise lifecycle hooks of the harness. This modular structure allows the system to swap, add, or remove these processors without breaking the surrounding pipeline.
To automate the optimization of this modular structure, Harness X introduces AEGIS, a trace-driven evolution engine. AEGIS frames harness adaptation as a reinforcement learning (RL) problem over the different symbolic components of the harness.
Framing harness optimization as a reinforcement learning problem introduces three pathologies the researchers had to explicitly engineer against:
Reward hacking: The system might exploit shortcuts to the solution instead of genuinely solving the task.
Reward hacking: The system might exploit shortcuts to the solution instead of genuinely solving the task.
Catastrophic forgetting: An edit that fixes a failure pattern in one domain might silently break a previously solved workflow in another.
Catastrophic forgetting: An edit that fixes a failure pattern in one domain might silently break a previously solved workflow in another.
Under-exploration: The system might iterate on minor prompt tweaks rather than exploring new, structurally superior tool configurations.
Under-exploration: The system might iterate on minor prompt tweaks rather than exploring new, structurally superior tool configurations.
To prevent these problems, AEGIS relies on full trace observability and a four-stage pipeline:
Digester: Compresses execution traces into structured summaries to identify where the agent failed.
Digester: Compresses execution traces into structured summaries to identify where the agent failed.
Planner: Analyzes these summaries to enable the system to explore structural changes rather than just local prompt tweaks.
Planner: Analyzes these summaries to enable the system to explore structural changes rather than just local prompt tweaks.
Evolver: Generates code-level harness edits and tests to ensure they run correctly before deployment.
Evolver: Generates code-level harness edits and tests to ensure they run correctly before deployment.
Critic and gate: A Critic assesses the edits to detect reward hacking, while a deterministic gate rejects any update that regresses a previously solved task to prevent catastrophic forgetting.
Critic and gate: A Critic assesses the edits to detect reward hacking, while a deterministic gate rejects any update that regresses a previously solved task to prevent catastrophic forgetting.
Harness X enters a growing field of self-improving harness research — but what separates it is harness-model co-evolution.
The researchers highlight that optimizing either component in isolation eventually hits a wall. Evolving only the harness hits a scaffolding ceiling if the underlying model lacks the reasoning capacity to use the new tools. Training only the model hits a training-signal ceiling if the harness never prompts the model to use its advanced capabilities.
Harness X interleaves harness evolution with model training. The execution traces generated while the harness attempts to adapt to tasks are converted into reinforcement learning signals for the foundation model. Every time the harness improves its strategy, the model simultaneously learns to better exploit that new strategy, breaking the capability ceilings of traditional AI agent development.
Harness X makes this co-evolution possible through cross-harness GRPO (Group Relative Policy Optimization). GRPO is the popular RL algorithm used to train reasoning models such as Deep Seek-R1.
When fine-tuning the model, cross-harness GRPO pools an agent's execution trajectories for the same task across entirely different versions of the application's harnesses. This allows the underlying model to internalize high-level strategy shifts, like using a new API endpoint or managing an execution budget, rather than just learning minor prompt-phrasing variations.
To validate the practical utility of Harness X, the researchers tested it across five benchmarks comprising software engineering, multi-turn customer service dialog, web navigation, open-ended multi-step reasoning, and embodied planning.
They separated the AI into two roles. The “meta-agent,” powered by Claude Opus 4.6, analyzed logs and wrote the code to evolve the harnesses. The “task agents” ran the actual workflows. To prove the framework is model-agnostic, they tested it on three different worker models: Claude Sonnet 4.6, GPT-5.4, and the open-weight Qwen 3.5-9B.
Harness X improves agent performance on key industry benchmarks without changing the underlying model (source: ar Xiv)
Harness X improves agent performance on key industry benchmarks without changing the underlying model (source: ar Xiv)
Harness X was compared against two primary baselines. The first was a static harness, representing how most enterprises deploy AI today, using hand-crafted, frozen setups with benchmark-specific prompts and tools. The second was the Claude Code SDK, a baseline representing a single-agent evolver to test if the complex, four-stage AEGIS pipeline outperformed asking a single language model to iterate on the code.
Dynamically evolving the harness yields significant gains on the same base model. Harness X improved performance in 14 out of 15 model-benchmark combinations. Across all tests, evolving the harness yielded an average absolute performance gain of +14.5%.
The weakest models benefited the most from dynamic harness improvement. The open-weight Qwen 3.5-9B saw a +44.0% performance jump on the ALFWorld embodied planning benchmark, and an +18.2% jump on SWE-bench Verified for software engineering.
Co-evolution also proved highly effective. When the researchers trained the foundation model using the data generated while evolving the harness, they saw an additional +4.7% average performance boost. Improving the harness and the model simultaneously yields the highest ceiling. The co-evolution gain applies only to open-weight models.
Anecdotal evidence from the experiments shows how Harness X solves pernicious problems when creating agent harnesses for real-world tasks. For example, in the GAIA multi-step reasoning benchmark, the task agent consistently failed because the headless browser tool it used to scrape Wikipedia timed out on the site's Java Script-heavy frontend. Harness X analyzed the execution traces, diagnosed the error, and wrote a new tool that bypassed the browser entirely and queried the Media Wiki API directly for plain text. It swapped this tool into the harness and instantly unlocked the failing tasks.
During the Web Shop e-commerce tests, the AI agent often got stuck in pagination loops, endlessly clicking "next page" and reformulating searches without ever committing to buying a product. Rather than just tweaking the prompt, Harness X built an advisory processor that detected when the agent was repeating navigation actions. It injected a warning into the context to force a decision, curing the looping behavior and raising performance.
One important caveat is that the system currently relies on powerful models to act as the meta-agent that rewrites the harness code. In their experiments, the researchers relied on closed frontier models like Claude Opus. Open-weight models are quickly improving, but their ability to serve as the meta-agent remains untested.
Another limitation worth considering is the intrinsic capabilities of the used models. If the underlying task model is fundamentally too weak to execute the complex workflows the new harness proposes, Harness X will not be able to improve the agent’s overall abilities (the researchers observed this with the Qwen 3.5-9B model on the SWE-bench coding tests).
Despite these limitations, Harness X makes a concrete case that harness engineering — not just model scaling — is a lever practitioners can pull now. For teams running smaller open-weight models on complex workflows, the gains here are large enough to justify evaluating harness evolution as a first step before reaching for a more expensive frontier model. The researchers plan to release the code in a future update.
Deep insights for enterprise AI, data, and security leaders
By submitting your email, you agree to our Terms and Privacy Notice.
Key Takeaways
-
Xiaomi's Harness X rewrites its own AI scaffolding mid-task — and smaller models gain the most
-
As enterprise AI agents take on increasingly complex, long-horizon tasks, their performance is often restricted by their harness, the software scaffolding that connects the backbone LLM to its environment
-
Currently, harnesses are largely static and hand-crafted
-
To address this engineering bottleneck, researchers at Xiaomi introduced Harness X, a framework that treats the AI harness as a composable object and autonomously applies improvements to its code
-
In real-world enterprise applications, this automated adaptation enables AI systems to dynamically adjust to application-specific requirements



