SkillsBench is a new benchmark that measures how much curated agent skills improve LLM agents across diverse tasks. Learn the key results, what worked best, and why self-generated skills often fail.
Agent skills are supposed to make LLM agents better at real work. But how much do they actually help? A new paper SkillsBench: Benchmarking How Well Agent Skills Work Across Diverse Tasks builds a benchmark to test exactly that — measuring whether giving agents structured skills actually improves task success, and spotting cases where it makes things worse.
What is SkillsBench?
SkillsBench is a benchmark that tests how well LLM agents use structured skills at inference time. The interesting part: it doesn’t just ask whether the agent can solve a task. It compares performance with skills, without skills, and with skills the model wrote itself.
The benchmark covers 84 tasks across 11 domains. Each task gets paired with curated skill modules, and results are scored by programmatic pass/fail tests — no subjective judging involved. Every task runs under three conditions: no skills, curated skills, and self-generated skills.

What counts as an “agent skill” here?
In SkillsBench, an agent skill isn’t just a prompt. It’s a structured package that can include a SKILL.md file with step-by-step instructions, reusable templates, scripts, examples, and optional reference materials.
Think of it as a how-to playbook. A lot of real tasks fail not because the model lacks knowledge, but because it doesn’t know the right procedure. Skills fill that gap.
How SkillsBench is evaluated
Each task runs in a containerized environment and gets scored with pass/fail tests. The benchmark tests across multiple commercial agent platforms and frontier models, all run at temperature 0 for repeatable results.
The three test conditions: no skills (just the task instruction), curated skills (pre-built skill modules included in the environment), and self-generated skills (the model writes its own skills first, then tackles the task).

Curated skills boost pass rates — but it’s not automatic
Across 7 agent-model configurations and 7,308 trajectories, the data is clear: curated skills improve performance by an average of 16.2 percentage points.
But that average hides a lot of variation. Gains depend heavily on the model, the agent platform, the domain, and the specific task. Skills augmentation works, but it’s not free performance — how well you build and integrate the skills matters just as much.
The best-performing setup
The top-performing setup was Gemini CLI + Gemini 3 Flash, hitting a 48.7% pass rate with skills. Other configurations also improved, but the benchmark surfaced something worth noting: the agent platform matters a lot.
Some platforms are better at actually using skills than others. Claude Code showed strong, consistent gains. Other agents would acknowledge the skills in their reasoning but then ignore them during execution.

Self-generated skills mostly don’t work
Here’s the finding that matters most for teams building agents: self-generated skills barely help. Compared to the no-skills baseline, self-generated skills averaged around -1.3 percentage points. They actually made things slightly worse.
Models clearly benefit from good procedural knowledge. But they can’t reliably write that knowledge themselves. The self-generated skills tended to be vague, miss critical steps, or skip domain-specific workflows the model didn’t realize were needed.
Domain makes a big difference
Skill impact varied a lot by domain. Areas with specialized workflows saw the biggest gains — healthcare tasks improved by an average of 51.9 percentage points, and manufacturing by 41.9pp. Software engineering and mathematics saw smaller bumps.
Not every task benefited. Skills sometimes hurt performance, especially when they added unnecessary overhead or conflicted with what the model already knew how to do well. Sometimes a simpler approach was just better, and the skill got in the way.
What makes a good skill?
The benchmark also looked at what makes a skill actually work. Two design patterns emerged: 2 to 3 skills per task tends to be the sweet spot (more skills hit diminishing returns), and moderate-length skills outperform comprehensive documentation. Long, manual-style skills seem to overload the context and make it harder for the model to find what it needs.
One more thing worth flagging: smaller models with good curated skills sometimes beat larger models with no skills. That makes skills a practical lever for teams watching costs.
Practical takeaways for teams building LLM agents
If you’re deploying LLM agents in production, here’s what the data suggests: use curated, tested skills for specialized workflows rather than relying on the model to figure out procedures on its own. Keep skills focused and modular — step-by-step instructions with a working example beat sprawling documentation. Always measure skill impact with paired evaluations (with skills vs. without), not vibes. And don’t count on self-generated skills to cover your procedural gaps.
The bottom line
SkillsBench gives the agent-building community something it needed: a repeatable way to measure whether skills actually work. The takeaway is straightforward — curated skills help, self-generated skills mostly don’t, and both the quality of the skills and the agent platform shape the outcome.
For researchers, it’s a solid benchmark with clear methodology. For teams building agents, it’s evidence that investing in good skill design pays off, especially in domains where knowing the right procedure is half the battle.