Industry roundup #11

Vercel testing AGENTS.md vs Skills

Feb 02, 2026

Last week I stumbled upon this post by Vercel: “AGENTS.md outperforms skills in our agent evals”.

Vercel compared two concrete implementations to give an agent data that was not part of pre-training nor post-training: AGENTS.md vs Skills. AGENTS.md is a markdown file that is always present in the agent’s context. AGENTS.md is the closest to a standard for agentic programming, although Claude Code uses a different but similar convention (CLAUDE.md).

A skill is a folder containing at least a SKILL.md file with metadata and instructions that tell an agent how to perform a specific task. A skill folder usually contains multiple markdown files for domain knowledge (prompts, tools, documentation) that the agent can invoke on demand.

The difference between AGENTS.md and Skills is that AGENTS.md is a persistent context for coding agents. The content of the file is available to the agent at all times, there is no decision to load it. For this reason, there is a limit to the size of the file, to prevent it from using the entire context.

Skills instead are used in a progressive way. At the start, agents load only the name and description of each available skill, just enough to know when it might be relevant (Discovery). If a task matches a skill’s description, the agent loads the full content of SKILL.md in the context (Activation). The agent then follows the instructions in SKILL.md, loading referenced files or executing code as needed (Execution).

In a sense, AGENTS.md is a push-context mode, where you stuff guidance into a file that the agent sees at every turn, while a skill is pull-context mode, where the agent has a menu available and hopefully decides to fetch the right thing at the right time.

The results were brutal

With nothing available, the agent pass rate was 53%. With Skills available, the agent pass rate was … 53%. Skills were there, but never invoked. They found that adding more instructions about the existence of Skills in AGENTS.md helped, and the pass rate increased to 79%. However, when they shoved all docs into AGENTS.md as a compressed index, pass rate hit 100%.

What is outstanding here is the level of non-determinism of LLMs and the wizardry required to deal with it. For example, when they added instructions to use the Skills in AGENTS.md, Vercel found that different wordings produced dramatically different results. Something like “You MUST invoke the skill” produced worse results than “Explore project first, then invoke skill”. Why? Nobody knows.

Programming agents is not like configuring a deterministic system. It’s more like herding cats, affecting a probabilistic policy with fragile text instructions. In fact, if you think about it, what Vercel did with AGENTS.md as a compressed index (not the whole docs, just a pointer map to version-matched files) is basically re-building the skill system, but all in context. In doing so they removed the Activation step, so the agent does not need to decide whether to retrieve context, removing some non-determinism that is not easy to debug.

Do you like this post? Of course you do. Share it on Twitter/X, LinkedIn and HackerNews

The AI Architect

Feb 2

Really insightful breakdown of the Vercel test. The fact that "You MUST invoke" performed worse than "Explore project first" shows how much prompt engineering is still voodoo even in 2026. Teh pull vs push context trade-off is interesting too, basically trading token efficiency for reliability. Seems like we're learning that reducing agent decision points (even if it means bloating context) beats hoping the model makes the right retrieval call.

1 reply by Better than Random

1 more comment...

Better than Random

Discussion about this post

Ready for more?