2026 / 03 / 05

When AI Learns to Test Its Own Skills — So Do I

Q: What's an AI Skill? How is it different from a regular prompt?

A Skill is a structured set of instructions that tells an AI how to behave in a specific context. Unlike a one-off prompt, a Skill is reusable, testable, and iterable. Think of it as the AI’s “Standard Operating Procedure (SOP).”

Q: Can I build an AI Skill without knowing how to code?

Yes. Anthropic’s Skill Creator doesn’t require coding. I’m not from an engineering background either — but using Claude Code’s SKILL.md format, I can describe my workflows in plain language. The key isn’t coding ability; it’s knowing your own process well enough to write it down.

Q: How do I know when a Skill should retire?

The simplest test: run your test cases without loading the Skill, and see if the AI performs just as well. If it does, the Skill is probably no longer needed. That’s essentially what Anthropic’s eval system does.

Starting from Anthropic's Skill Creator eval feature, I share hands-on experience building 19 Claude Code Skills and how to apply test-driven thinking to continuously improve AI workflows.

Lazy Da

| 3 分鐘閱讀 | 更新：2026-03-05

In this article, you'll learn:

Anthropic published an article a couple of days ago, titled “Improving Skill Creator: Test, Measure, and Refine Agent Skills.”

I laughed when I finished reading it.

Not because it was funny — but because I’d been doing the exact same thing over the past few months.

They’re doing it at the platform level. I’m doing it at the personal level — evolving my own “lazy factory.”

First, What Did Anthropic Actually Say?

The core message is simple:

AI Skills that “seem to work” aren’t enough. You need to test them, measure them, and keep improving them.

They break Skills into two categories:

Type	Description	Examples
Capability-enhancing	Makes AI do things it couldn’t do well before	PDF form filling, document generation
Preference-encoding	“Writes” your workflow into the AI	NDA review, weekly report summarization

Then they introduced an eval system — basically a way to write “exam questions” for your AI skills.

Define your input, describe the expected output, and see if the AI passes.

Sounds like unit testing in software engineering, right?

Exactly — and they say so themselves.

Bringing the rigor of software development (tests, benchmarks, iterative improvement) into skill writing — without needing to write code.

Funny, I’ve Been Doing the Same Thing

During the Lunar New Year, I did a big overhaul of my workflow and rebuilt my entire AI collaboration system from scratch.

Right now, inside my Claude Code environment, I have 19 custom-built Skills, 12 specialized Agents, and a full cross-tool knowledge sync architecture.

Sounds intense?

It’s really just what happens when a lazy person refuses to do the same thing twice.

My Skills Fall Into Two Categories Too

Looking back at Anthropic’s framework, my Skills map perfectly onto both types:

Capability-enhancing:

book-cover-automation: Auto-download and remove backgrounds from book covers
translate-blog: Auto-translate Chinese posts to English
seo-analysis: SEO data analysis and strategy generation

Preference-encoding (writing my workflow into AI):

hugo-content-guide: My writing style and formatting rules
commit: Auto-generate Git commit messages
daily-review: Daily review → auto-write to Anytype
session-end: Auto status check and knowledge extraction when a session ends

The second category is where I’ve invested the most effort.

Because these aren’t about “making AI smarter” — they’re about “making AI become me.”

Testing Isn’t Optional

Anthropic’s article calls out a very real problem:

Most skill authors are domain experts, not engineers. They know their workflows, but lack the tools to verify whether a skill still works correctly.

That hit home.

I’m a CEO from an insurance background, not an engineer. But I’m now managing 19 AI Skills, and every single one affects my content production pipeline.

If translate-blog breaks, my English posts will be wrong.

If the tone rules in hugo-content-guide drift, the AI’s writing won’t sound like me.

So I started doing something similar to eval — just in a more lo-fi way:

check-skills: A dedicated Skill for checking the health of all other Skills
sync-skills: Keeps knowledge in sync across Claude Code, Copilot, and Codex
promote-lessons: Reviews knowledge suggestions to prevent config files from bloating indefinitely

Not elegant, but it works.

When a Model Improves, Should Your Skill Retire?

One part of the article really resonated with me:

If the base model starts passing your eval without the Skill, that means the technique has been absorbed by the model. The Skill isn’t broken — it’s just no longer needed.

This matches my experience exactly.

I’ve already retired 3 Skills:

canva-cover-update
code-simplifier
content-writing

Not because they were poorly written — but because the model itself improved and no longer needed the extra prompting.

And honestly? That’s a good thing.

It means your automation system is alive — it self-simplifies as AI evolves.

The Full Picture of the Lazy Factory

Since we’re here, let me sketch out what the full system looks like right now:

Content Pipeline:
  Notion post → Hugo blog → English translation → SEO optimization
  → Auto-generate social posts → Auto-schedule to FB / IG

Knowledge Pipeline:
  Reading notes → Zettelkasten cards → Anytype
  Daily review → Conversational diary → Anytype

Operations Pipeline:
  GA4 data → Growth strategy → CTR optimization
  Newsletter → Auto-send via ConvertKit
  Podcast → Auto-integration and promotion

All of this is wired together by AI Skills + Python scripts, with three AI tools (Claude Code, GitHub Copilot, Codex) sharing the same knowledge base.

Yes — I’m the kind of person who spends an entire Lunar New Year building systems just to avoid doing things manually.

Very lazy. But my laziness is highly systematic.

The Future: The Line Between Skill and Spec Will Blur

The article closes with a fascinating observation:

As models improve, the line between “Skill” and “Specification” may blur. Today’s SKILL.md is an implementation plan — telling AI exactly how to do something. In the future, a natural language description of what to do might be enough.

I think they’re right.

Right now, each of my SKILL.md files runs hundreds of lines — full of formatting rules, banned phrases, sentence patterns, and example code.

But maybe someday, all I’ll need to write is:

“Write a lifestyle post in Lazy Da’s voice.”

And the AI will just know what to do.

When that day comes, my Skills won’t disappear — they’ll have become the AI’s memory.

AI workflow automation diagram

FAQ

What’s an AI Skill? How is it different from a regular prompt?

Can I build an AI Skill without knowing how to code?

How do I know when a Skill should retire?

Lazy Conclusion

懶大

金融科技新創 CEO | 獨立財務顧問

What Anthropic is doing and what I’m doing are fundamentally the same thing:

Turning AI skills from “seems like it works” to “confirmed to work.”

The only difference is scale — they’re building platform-level tools, I’m building a personal lazy factory.

But the core logic is identical:

Define your workflow (write it as a Skill)
Test whether it works as expected (via eval or lo-fi methods)
Trim as the model improves (retire what’s no longer needed)

AI won’t replace you — but people who use AI will move faster.

And people who test their own AI skills will move a little more steadily, too.

📩

訂閱電子報，獲取更多理財觀點

🚀 已有 1,000+ 讀者加入理財成長之路

Tags Ai 自動化 Claude code 生產力生活工具

還有這些文章

查看更多文章

When AI Learns to Test Its Own Skills — So Do I

In this article, you'll learn:

First, What Did Anthropic Actually Say?

Funny, I’ve Been Doing the Same Thing

My Skills Fall Into Two Categories Too

Testing Isn’t Optional

When a Model Improves, Should Your Skill Retire?

The Full Picture of the Lazy Factory

The Future: The Line Between Skill and Spec Will Blur

FAQ

Further Reading

Lazy Conclusion

懶得變有錢 Podcast

🎧 更多收聽平台

還有這些文章

語意投資

完美出場！

今日投資報告

When AI Learns to Test Its Own Skills — So Do I

In this article, you'll learn:

First, What Did Anthropic Actually Say?

Funny, I’ve Been Doing the Same Thing

My Skills Fall Into Two Categories Too

Testing Isn’t Optional

When a Model Improves, Should Your Skill Retire?

The Full Picture of the Lazy Factory

The Future: The Line Between Skill and Spec Will Blur

FAQ

Further Reading

Lazy Conclusion

懶得變有錢 Podcast

🎧 更多收聽平台

免費訂閱懶得變有錢電子報

還有這些文章

The AI Overtime Trap: Why AI Makes You Busier Instead of Less

I Found a Reason to Leave Comet and Go Back to Chrome...

【Life】Anytype Revived During the New Year! How an Extremely Lazy Person Builds an Efficient Note-Taking Workflow with Automation

💰 加入懶得變有錢電子報

語意投資

完美出場！

今日投資報告