30 May 2026 Research Deep-Dive

EvoSkills: When AI Agents Write Better Playbooks Than Humans

A paper published this month by researchers at UIC, MBZUAI, McGill, Columbia, and UBC challenges a fundamental assumption in enterprise AI: that the best way to make an agent more capable is to have a human expert write its instructions. The results are worth paying attention to.

The Problem: Human-Machine Cognitive Misalignment

Anthropic's concept of "agent skills" has become a standard building block for production AI systems. A skill is not a single tool call — it is a structured bundle of instructions, scripts, and reference materials that teach an agent how to handle a multi-step professional task. Think of it as an operating manual: not "call this API" but "here is how to handle a clinical adverse event report from start to finish."

The problem, as the EvoSkills authors demonstrate through SkillsBench evaluations, is that human-written skills produce uneven results. Some domains benefit substantially. Others — including Natural Science — actually perform worse with human-curated skills than with no skills at all. The authors call this "human-machine cognitive misalignment": workflows and abstractions designed to be intuitive for human experts do not naturally match how LLM agents process context, reason, and act under execution constraints.

This is not a theoretical concern. If you are deploying AI agents in a regulated environment — clinical trials, financial compliance, safety-critical operations — a skill that works in testing but degrades in production is a liability, not an asset.


What EvoSkills Does

EvoSkills introduces a co-evolutionary framework with two components:

The Skill Generator — an LLM agent that creates an initial skill package (a SKILL.md file plus supporting scripts and references) and iteratively refines it based on structured feedback.

The Surrogate Verifier — an independent LLM agent that evaluates the skill's outputs against the task requirements, generates diagnostic feedback (failed test cases, root-cause analysis, actionable revision suggestions), and co-evolves alongside the generator.

The key insight is information isolation. The generator never sees the ground-truth test criteria — it only sees the verifier's feedback. The verifier only strengthens its tests when the ground-truth oracle exposes a gap between its assessment and reality. This creates an adversarial dynamic that drives both components to improve without either having access to the full answer.

The process runs for a small number of iterations (typically five), after which the evolved skill is finalised.


The Results: 20 Points Over Human-Curated Skills

On SkillsBench — 87 tasks across roughly 20 professional domains — the results are decisive:

Approach Pass Rate vs Baseline
No skill baseline 30.6%
Self-generated skills (no co-evolution) 32.0% +1.4pp
CoT-guided self-generation 30.7% +0.1pp
Skill-creator (Anthropic meta-skill) 34.1% +3.5pp
Human-curated skills (SkillsBench) 34.1% +3.5pp
EvoSkills (co-evolved) 71.1% +40.5pp

The self-generated skills without co-evolution — where an agent simply writes a skill and uses it without iterative verification — barely improve on the baseline. The gains come entirely from the verification loop, not from the skill-writing prompt.

More striking is the cross-model transfer result. Skills evolved by Claude Opus 4.6 were transferred to six additional models from five different providers (GPT-5.2, Sonnet 4.5, Haiku 4.5, Qwen3 Coder, DeepSeek V3, Mistral Large 3). Every model benefited substantially, with gains of 35–44 percentage points over their respective no-skill baselines. This confirms that the evolved skills encode reusable task structure, not model-specific artefacts.


Why This Matters for Enterprise AI

Three implications for enterprise leaders

Verification infrastructure matters more than prompt engineering. Skills transfer across models, reducing vendor lock-in. And the role of domain experts shifts from writing instructions to defining success criteria.

First, the way we build agent capabilities needs to change. The current practice — subject matter experts writing skill documents, reviewing them, deploying them — is labour-intensive and produces inconsistent results. EvoSkills suggests that the skill authoring process itself can be automated, with human experts shifting to a verification and oversight role rather than a writing role. This does not eliminate the need for domain expertise. It changes where that expertise is applied: from writing instructions to defining success criteria.

Second, verification is the bottleneck, not generation. The paper's ablation study makes this clear. Without co-evolutionary verification, self-generated skills perform almost identically to the no-skill baseline. The value is not in the agent's ability to write a skill — it is in the iterative loop that tests, diagnoses, and refines. For enterprise deployments, this means investment should go into evaluation infrastructure (test suites, deterministic verifiers, ground-truth benchmarks) rather than into prompt engineering for skill authoring.

Third, cross-model portability changes the economics. If skills evolved on one model transfer effectively to others, organisations are not locked into a single provider's ecosystem. A skill library built on Claude can be deployed on GPT, Qwen, or Mistral with minimal degradation. This reduces vendor lock-in risk and makes skill investment more durable across model generations.


The Limitations Worth Noting

The evaluation is on SkillsBench, which covers 87 tasks — a meaningful benchmark but not a production environment. Real-world tasks have fuzzier success criteria, longer horizons, and more ambiguous failure modes. The paper does not address how EvoSkills performs when the ground-truth oracle is imperfect or unavailable — which is the common case in enterprise settings.

The co-evolutionary loop also assumes that tasks can be evaluated with deterministic pass/fail criteria. Many enterprise workflows — client communication, regulatory interpretation, clinical judgment — do not have binary success metrics. Extending this approach to those domains will require new verification strategies.

Finally, the five-iteration evolution window is small. The authors note that context overflow becomes a constraint beyond this point. Scaling to more complex skill bundles (dozens of files, multi-system integrations) will require architectural changes to the evolution process.


What This Means Practically

If you are building AI agent systems today, three actions are worth considering:

  1. Invest in verification infrastructure before investing in skill authoring. Build test suites, deterministic verifiers, and evaluation benchmarks for your key workflows. The paper shows that this is where the performance gains come from.
  2. Design skills for machine consumption, not human readability. The human-machine cognitive misalignment finding suggests that the optimal skill format for an LLM agent may look very different from what a human expert would write. Let the agent evolve its own structure.
  3. Build skill libraries with portability in mind. If skills transfer across models, invest in skill development as a durable asset rather than a model-specific configuration.

Paper Details

Title
EvoSkills: Self-Evolving Agent Skills via Co-Evolutionary Verification
Authors
Hanrong Zhang, Shicheng Fan, Henry Peng Zou, Yankai Chen, Zhenting Wang, Jiayu Zhou, Chengze Li, Wei-Chieh Huang, Yifei Yao, Kening Zheng, Xue (Steve) Liu, Xiaoxiao Li, Philip S. Yu
Institutions
University of Illinois Chicago, MBZUAI, McGill University, Columbia University, Zhejiang University, University of British Columbia
Status
Preprint, under review (2026)
ArXiv
2605.23478

Want to discuss what this means for your AI agent strategy?

Get in Touch