Measure Your Agent Skills or Delete Them

AI12 June 2026· 4 min read

Nick Nisi discovered one of his skills dropped accuracy from 97% to 77%. The lesson: skills actively hurt without measurement. Evaluate or delete.

Nick Nisi from WorkOS ran a simple comparison. He gave a coding agent a task with a skill loaded. The agent got it right 77% of the time. Then he ran the same task without the skill. The agent got it right 97% of the time.

The skill made things worse. He deleted 95% of his skills and replaced them with 553 lines of handwritten gotchas. His eval time dropped from 68 minutes to 6 minutes.

This is the most important lesson about agent skills right now. You have no idea if your skills are helping or hurting because you are not measuring them.

Skills are not free

Every skill you add consumes context and provides alternative instructions. Most people think of skills as additive. More knowledge, better results. But each skill is a gamble. It might help. It might hurt. Without measurement, you cannot tell which.

The model already knows how to code. What it does not know is where your specific codebase has landmines. The conventions that violate common practice. The gotchas in your deployment pipeline.

When you add a skill, the model does not just add that knowledge. It may override or conflict with what it already knows. Nisi's finding shows this directly: the skill was not neutral, it was destructive.

The gotcha principle

Nisi replaced his comprehensive skills with gotchas. His insight: do not write a textbook on your stack. Write the things the model will get wrong about your specific setup.

A good skill captures landmines. What is the unusual database constraint? What is the non-standard directory layout? Which patterns look correct but silently break in this codebase?

A bad skill repeats what the model already knows. General framework patterns, language idioms, best practices you can find in any documentation. The model has seen all of this before. Adding it as a skill wastes tokens and creates noise.

If you look at your skills and see things the model could learn from reading the codebase, delete them. If you see things the model would never discover by reading the code, keep them.

Evidence over trust

Nisi's deeper contribution is the enforcement mechanism. He moved past better prompts. He built a state-machine harness that SHA-256 hashes actual test output and verifies cryptographically. The principle: make it easier to do the real work than to fake it, and enforce through code, not prompts.

This is the difference between hope and evidence. A prompt says "please run the tests." A harness runs the tests, hashes the output, and rejects anything that does not match. One is a request. The other is a gate.

The same principle applies to skill evaluation. Running an eval once tells you something. Running it continuously tells you when a model update breaks your skill, when a new skill conflicts with an existing one, and when a skill has drifted from relevance.

The skills guide covers how to build skills. This article covers how to know whether they work.

What to measure

If you are not measuring your skills, start with the minimum viable eval:

Pick one task your skill is supposed to improve
Run it 10 times without the skill, 10 times with the skill
Compare accuracy, run time, and token consumption
If the skill does not improve accuracy, delete it
If the skill degrades accuracy, delete it and rethink your approach
If the skill increases token consumption without improving results, it is noise

Claude has a built-in eval-creator skill that sets up these comparisons. It generates side-by-side HTML output. Use it.

Nisi's 77% vs 97% result is not an outlier. It is what happens when you add a skill blindly. The only way to know if your skills work is to test them. If you cannot measure a skill, you cannot justify keeping it.

Delete what you cannot defend

This applies at every scale. A 10,000-line skill corpus can be worse than no skills at all. A single carefully written gotcha can be better than a comprehensive reference. The question is whether the model performs better with your instructions than without them.

The hardest thing about skills is not writing them. It is deleting the ones that do not work.

← Older

PHP Development Guidelines

Newer →

Laravel Development Guidelines

A weekly newsletter on React, Next.js, AI-assisted development, and engineering. No spam, unsubscribe any time.