
When AI Learns to Test Its Own Skills — So Do I
In this article, you'll learn:
Anthropic published an article a couple of days ago, titled “Improving Skill Creator: Test, Measure, and Refine Agent Skills.”
I laughed when I finished reading it.
Not because it was funny — but because I’d been doing the exact same thing over the past few months.
They’re doing it at the platform level. I’m doing it at the personal level — evolving my own “lazy factory.”
First, What Did Anthropic Actually Say?
The core message is simple:
AI Skills that “seem to work” aren’t enough. You need to test them, measure them, and keep improving them.
They break Skills into two categories:
| Type | Description | Examples |
|---|---|---|
| Capability-enhancing | Makes AI do things it couldn’t do well before | PDF form filling, document generation |
| Preference-encoding | “Writes” your workflow into the AI | NDA review, weekly report summarization |
Then they introduced an eval system — basically a way to write “exam questions” for your AI skills.
Define your input, describe the expected output, and see if the AI passes.
Sounds like unit testing in software engineering, right?
Exactly — and they say so themselves.
Bringing the rigor of software development (tests, benchmarks, iterative improvement) into skill writing — without needing to write code.
Funny, I’ve Been Doing the Same Thing
During the Lunar New Year, I did a big overhaul of my workflow and rebuilt my entire AI collaboration system from scratch.
Right now, inside my Claude Code environment, I have 19 custom-built Skills, 12 specialized Agents, and a full cross-tool knowledge sync architecture.
Sounds intense?
It’s really just what happens when a lazy person refuses to do the same thing twice.
My Skills Fall Into Two Categories Too
Looking back at Anthropic’s framework, my Skills map perfectly onto both types:
Capability-enhancing:
book-cover-automation: Auto-download and remove backgrounds from book coverstranslate-blog: Auto-translate Chinese posts to Englishseo-analysis: SEO data analysis and strategy generation
Preference-encoding (writing my workflow into AI):
hugo-content-guide: My writing style and formatting rulescommit: Auto-generate Git commit messagesdaily-review: Daily review → auto-write to Anytypesession-end: Auto status check and knowledge extraction when a session ends
The second category is where I’ve invested the most effort.
Because these aren’t about “making AI smarter” — they’re about “making AI become me.”
Testing Isn’t Optional
Anthropic’s article calls out a very real problem:
Most skill authors are domain experts, not engineers. They know their workflows, but lack the tools to verify whether a skill still works correctly.
That hit home.
I’m a CEO from an insurance background, not an engineer. But I’m now managing 19 AI Skills, and every single one affects my content production pipeline.
If translate-blog breaks,
my English posts will be wrong.
If the tone rules in hugo-content-guide drift,
the AI’s writing won’t sound like me.
So I started doing something similar to eval — just in a more lo-fi way:
check-skills: A dedicated Skill for checking the health of all other Skillssync-skills: Keeps knowledge in sync across Claude Code, Copilot, and Codexpromote-lessons: Reviews knowledge suggestions to prevent config files from bloating indefinitely
Not elegant, but it works.
When a Model Improves, Should Your Skill Retire?
One part of the article really resonated with me:
If the base model starts passing your eval without the Skill, that means the technique has been absorbed by the model. The Skill isn’t broken — it’s just no longer needed.
This matches my experience exactly.
I’ve already retired 3 Skills:
canva-cover-updatecode-simplifiercontent-writing
Not because they were poorly written — but because the model itself improved and no longer needed the extra prompting.
And honestly? That’s a good thing.
It means your automation system is alive — it self-simplifies as AI evolves.
The Full Picture of the Lazy Factory
Since we’re here, let me sketch out what the full system looks like right now:
Content Pipeline:
Notion post → Hugo blog → English translation → SEO optimization
→ Auto-generate social posts → Auto-schedule to FB / IG
Knowledge Pipeline:
Reading notes → Zettelkasten cards → Anytype
Daily review → Conversational diary → Anytype
Operations Pipeline:
GA4 data → Growth strategy → CTR optimization
Newsletter → Auto-send via ConvertKit
Podcast → Auto-integration and promotion
All of this is wired together by AI Skills + Python scripts, with three AI tools (Claude Code, GitHub Copilot, Codex) sharing the same knowledge base.
Yes — I’m the kind of person who spends an entire Lunar New Year building systems just to avoid doing things manually.
Very lazy. But my laziness is highly systematic.
The Future: The Line Between Skill and Spec Will Blur
The article closes with a fascinating observation:
As models improve, the line between “Skill” and “Specification” may blur. Today’s SKILL.md is an implementation plan — telling AI exactly how to do something. In the future, a natural language description of what to do might be enough.
I think they’re right.
Right now, each of my SKILL.md files runs hundreds of lines — full of formatting rules, banned phrases, sentence patterns, and example code.
But maybe someday, all I’ll need to write is:
“Write a lifestyle post in Lazy Da’s voice.”
And the AI will just know what to do.
When that day comes, my Skills won’t disappear — they’ll have become the AI’s memory.

FAQ
Further Reading
Lazy Conclusion
What Anthropic is doing and what I’m doing are fundamentally the same thing:
Turning AI skills from “seems like it works” to “confirmed to work.”
The only difference is scale — they’re building platform-level tools, I’m building a personal lazy factory.
But the core logic is identical:
- Define your workflow (write it as a Skill)
- Test whether it works as expected (via eval or lo-fi methods)
- Trim as the model improves (retire what’s no longer needed)
AI won’t replace you — but people who use AI will move faster.
And people who test their own AI skills will move a little more steadily, too.
🚀 已有 1,000+ 讀者加入理財成長之路


