This site uses Google Adsense and affiliate marketing to support site operations and charitable purposes for children’s welfare!

2026 / 03 / 05
職涯與生活

When AI Learns to Test Its Own Skills — So Do I

Starting from Anthropic's Skill Creator eval feature, I share hands-on experience building 19 Claude Code Skills and how to apply test-driven thinking to continuously improve AI workflows.

M
Lazy Da
| 3 分鐘閱讀 | 更新:2026-03-05

Anthropic published an article a couple of days ago, titled “Improving Skill Creator: Test, Measure, and Refine Agent Skills.”

I laughed when I finished reading it.

Not because it was funny — but because I’d been doing the exact same thing over the past few months.

They’re doing it at the platform level. I’m doing it at the personal level — evolving my own “lazy factory.”

First, What Did Anthropic Actually Say?

The core message is simple:

AI Skills that “seem to work” aren’t enough. You need to test them, measure them, and keep improving them.

They break Skills into two categories:

TypeDescriptionExamples
Capability-enhancingMakes AI do things it couldn’t do well beforePDF form filling, document generation
Preference-encoding“Writes” your workflow into the AINDA review, weekly report summarization

Then they introduced an eval system — basically a way to write “exam questions” for your AI skills.

Define your input, describe the expected output, and see if the AI passes.

Sounds like unit testing in software engineering, right?

Exactly — and they say so themselves.

Bringing the rigor of software development (tests, benchmarks, iterative improvement) into skill writing — without needing to write code.

Funny, I’ve Been Doing the Same Thing

During the Lunar New Year, I did a big overhaul of my workflow and rebuilt my entire AI collaboration system from scratch.

Right now, inside my Claude Code environment, I have 19 custom-built Skills, 12 specialized Agents, and a full cross-tool knowledge sync architecture.

Sounds intense?

It’s really just what happens when a lazy person refuses to do the same thing twice.

My Skills Fall Into Two Categories Too

Looking back at Anthropic’s framework, my Skills map perfectly onto both types:

Capability-enhancing:

  • book-cover-automation: Auto-download and remove backgrounds from book covers
  • translate-blog: Auto-translate Chinese posts to English
  • seo-analysis: SEO data analysis and strategy generation

Preference-encoding (writing my workflow into AI):

  • hugo-content-guide: My writing style and formatting rules
  • commit: Auto-generate Git commit messages
  • daily-review: Daily review → auto-write to Anytype
  • session-end: Auto status check and knowledge extraction when a session ends

The second category is where I’ve invested the most effort.

Because these aren’t about “making AI smarter” — they’re about “making AI become me.”

Testing Isn’t Optional

Anthropic’s article calls out a very real problem:

Most skill authors are domain experts, not engineers. They know their workflows, but lack the tools to verify whether a skill still works correctly.

That hit home.

I’m a CEO from an insurance background, not an engineer. But I’m now managing 19 AI Skills, and every single one affects my content production pipeline.

If translate-blog breaks, my English posts will be wrong.

If the tone rules in hugo-content-guide drift, the AI’s writing won’t sound like me.

So I started doing something similar to eval — just in a more lo-fi way:

  1. check-skills: A dedicated Skill for checking the health of all other Skills
  2. sync-skills: Keeps knowledge in sync across Claude Code, Copilot, and Codex
  3. promote-lessons: Reviews knowledge suggestions to prevent config files from bloating indefinitely

Not elegant, but it works.

When a Model Improves, Should Your Skill Retire?

One part of the article really resonated with me:

If the base model starts passing your eval without the Skill, that means the technique has been absorbed by the model. The Skill isn’t broken — it’s just no longer needed.

This matches my experience exactly.

I’ve already retired 3 Skills:

  • canva-cover-update
  • code-simplifier
  • content-writing

Not because they were poorly written — but because the model itself improved and no longer needed the extra prompting.

And honestly? That’s a good thing.

It means your automation system is alive — it self-simplifies as AI evolves.

The Full Picture of the Lazy Factory

Since we’re here, let me sketch out what the full system looks like right now:

Content Pipeline:
  Notion post → Hugo blog → English translation → SEO optimization
  → Auto-generate social posts → Auto-schedule to FB / IG

Knowledge Pipeline:
  Reading notes → Zettelkasten cards → Anytype
  Daily review → Conversational diary → Anytype

Operations Pipeline:
  GA4 data → Growth strategy → CTR optimization
  Newsletter → Auto-send via ConvertKit
  Podcast → Auto-integration and promotion

All of this is wired together by AI Skills + Python scripts, with three AI tools (Claude Code, GitHub Copilot, Codex) sharing the same knowledge base.

Yes — I’m the kind of person who spends an entire Lunar New Year building systems just to avoid doing things manually.

Very lazy. But my laziness is highly systematic.

The Future: The Line Between Skill and Spec Will Blur

The article closes with a fascinating observation:

As models improve, the line between “Skill” and “Specification” may blur. Today’s SKILL.md is an implementation plan — telling AI exactly how to do something. In the future, a natural language description of what to do might be enough.

I think they’re right.

Right now, each of my SKILL.md files runs hundreds of lines — full of formatting rules, banned phrases, sentence patterns, and example code.

But maybe someday, all I’ll need to write is:

“Write a lifestyle post in Lazy Da’s voice.”

And the AI will just know what to do.

When that day comes, my Skills won’t disappear — they’ll have become the AI’s memory.


AI workflow automation diagram


FAQ


Further Reading


Lazy Conclusion

Lazy Conclusion

What Anthropic is doing and what I’m doing are fundamentally the same thing:

Turning AI skills from “seems like it works” to “confirmed to work.”

The only difference is scale — they’re building platform-level tools, I’m building a personal lazy factory.

But the core logic is identical:

  1. Define your workflow (write it as a Skill)
  2. Test whether it works as expected (via eval or lo-fi methods)
  3. Trim as the model improves (retire what’s no longer needed)

AI won’t replace you — but people who use AI will move faster.

And people who test their own AI skills will move a little more steadily, too.

📩
訂閱電子報,獲取更多理財觀點

🚀 已有 1,000+ 讀者加入理財成長之路

Tags Ai 自動化 Claude code 生產力 生活工具

📩 訂閱即送 · Lead Magnet

免費訂閱懶得變有錢電子報

訂閱即送 「ETF 比較速查表」(VT/VOO/QQQ/0050 主流 ETF 五大面向比較)。每週一篇精選理財觀察 · 隨時退訂。

已有 3,900+ 位訂閱者

延伸閱讀 · Related

還有這些文章

查看更多文章

💰 加入懶得變有錢電子報

每週獲得最新理財心法與投資洞察

我們尊重您的隱私,隨時可以取消訂閱

🚀 已有 1,000+ 讀者加入理財成長之路