<?xml version="1.0" encoding="utf-8"?><feed xmlns="http://www.w3.org/2005/Atom" xml:lang="en-US"><generator uri="https://jekyllrb.com/" version="3.10.0">Jekyll</generator><link href="https://www.albertsikkema.com/feed.xml" rel="self" type="application/atom+xml" /><link href="https://www.albertsikkema.com/" rel="alternate" type="text/html" hreflang="en-US" /><updated>2026-05-12T11:46:44+00:00</updated><id>https://www.albertsikkema.com/feed.xml</id><title type="html">Albert Sikkema - Building Production AI Systems</title><subtitle>Production-ready AI implementation, software engineering best practices, and enterprise AI systems development. Building scalable AI solutions with Claude, OpenAI, and engineering discipline for enterprise and government.</subtitle><author><name>Albert Sikkema</name></author><entry><title type="html">Why I Shrunk Claude Code’s Context Window Back to 200k</title><link href="https://www.albertsikkema.com/ai/development/tools/2026/04/23/smaller-context-window-better-claude-code.html" rel="alternate" type="text/html" title="Why I Shrunk Claude Code’s Context Window Back to 200k" /><published>2026-04-23T00:00:00+00:00</published><updated>2026-04-23T00:00:00+00:00</updated><id>https://www.albertsikkema.com/ai/development/tools/2026/04/23/smaller-context-window-better-claude-code</id><content type="html" xml:base="https://www.albertsikkema.com/ai/development/tools/2026/04/23/smaller-context-window-better-claude-code.html"><![CDATA[<figure>
  <img src="/assets/images/context-window-rain-glass.jpg" alt="Rain-covered glass with blurred warm lights behind, signal obscured by noise" width="1920" height="1078" fetchpriority="high" style="width:100%;height:auto" />
  <figcaption>Signal obscured by noise. Photo by <a href="https://unsplash.com/@c_g_">c g</a> on <a href="https://unsplash.com">Unsplash</a></figcaption>
</figure>

<p>This morning I watched a video on context window management in Claude Code as part of my daily “keep up with what is happening in the LLM space” routine. Good content, solid diagnosis of the problem. But everything in it was about manual interventions: trigger compaction at the right moment, use structured handoffs, rewind instead of correcting. All valid techniques. But there is an issue that is buried deeper and there are two settings, buried in the documentation, that may solve most of this.</p>

<h2 id="the-problem-with-more-room">The Problem With More Room</h2>

<p>Since Opus 4.6 it ships with a default <a href="https://claude.com/blog/1m-context-ga">1M token context window</a>. Five times the 200k window of their predecessors. Sounds like a pure upgrade. Great!</p>

<p>It is not. The single most important thing for working with LLMs is context management: keep it as small as possible with as relevant info as possible and nothing more than that. Anthropic’s <a href="https://www.anthropic.com/engineering/effective-context-engineering-for-ai-agents">engineering team</a> says you should be “striving for the minimal set of information that fully outlines your expected behavior.” Their <a href="https://platform.claude.com/docs/en/build-with-claude/context-windows">documentation</a> acknowledges that “as token count grows, accuracy and recall degrade, a phenomenon known as context rot.” The root cause is the n-squared attention mechanism: double the context, quadruple the number of pairwise relationships the model has to track.</p>

<p>I am not alone in experiencing this. Some quick research finds similar sentiment: <a href="https://simonwillison.net/2025/Jan/26/paul-gauthier/">Paul Gauthier</a>, the creator of Aider, found that “every model seems to get confused when you feed them more than ~25-30k tokens.” He calls it the number one problem his users report. <a href="https://blog.jetbrains.com/research/2025/12/efficient-context-management/">JetBrains Research</a> tested observation masking (hiding old tool outputs) and found a 52% cost reduction while <em>boosting</em> solve rates by 2.6%. Less context, better results. The <a href="https://eval.16x.engineer/blog/llm-context-management-guide">NoLiMa benchmark</a> found that 11 of 12 tested models dropped below 50% of their short-context performance at just 32k tokens. Not 200k. Not 1M. 32 thousand.</p>

<p>In practice this means: more hallucinations, forgotten instructions, goal drift, inconsistent decisions. Not at 900k tokens. Much, much earlier.</p>

<h2 id="what-i-keep-seeing">What I Keep Seeing</h2>

<p>A lot of the advice I come across focuses on manual interventions. Trigger compaction yourself at the right moment. Use a new session or <code class="language-plaintext highlighter-rouge">/clear</code> when switching tasks. Save state to a JSON file before clearing. Ask Claude for periodic summaries. Use sub-agents to keep intermediate work out of your main context.</p>

<p>These are all valid. I use sub-agents heavily (they get their own fresh context window, which is <a href="https://www.morphllm.com/context-rot">the single most effective architectural pattern</a> for avoiding context rot) and <code class="language-plaintext highlighter-rouge">/clear</code> between unrelated tasks. But manual interventions are workarounds for a window that is too large, not fixes for the underlying problem. They require you to watch your context usage while trying to get work done. That is overhead the tooling should handle.</p>

<p><a href="https://tessl.io/blog/amp-retires-compaction-for-a-cleaner-handoff-in-the-coding-agent-context-race">Amp</a> went further: they dropped compaction entirely and designed around short threads with clean handoffs. Their senior engineer Dan Mac put it bluntly: “You should basically never use compaction.”.</p>

<h2 id="the-simpler-fix">The Simpler Fix</h2>

<p>Two <a href="https://code.claude.com/docs/en/env-vars">environment variables</a> solve this without ongoing attention:</p>

<p><strong><code class="language-plaintext highlighter-rouge">CLAUDE_CODE_DISABLE_1M_CONTEXT</code></strong> set to <code class="language-plaintext highlighter-rouge">1</code> caps the context window back to 200k tokens. It removes the 1M model variants from the <a href="https://code.claude.com/docs/en/model-config#extended-context">model picker</a> entirely.</p>

<p><strong><code class="language-plaintext highlighter-rouge">CLAUDE_AUTOCOMPACT_PCT_OVERRIDE</code></strong> set to a value between 1-100 controls when auto-compaction triggers, as a percentage of context capacity. The <a href="https://code.claude.com/docs/en/how-claude-code-works">default is around 95%</a>, which means on a 1M window, compaction does not kick in until you are at 950k tokens. That is way past the point where quality has degraded.</p>

<p>Set both in your project or user settings:</p>

<div class="language-json highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="p">{</span><span class="w">
  </span><span class="nl">"env"</span><span class="p">:</span><span class="w"> </span><span class="p">{</span><span class="w">
    </span><span class="nl">"CLAUDE_CODE_DISABLE_1M_CONTEXT"</span><span class="p">:</span><span class="w"> </span><span class="s2">"1"</span><span class="p">,</span><span class="w">
    </span><span class="nl">"CLAUDE_AUTOCOMPACT_PCT_OVERRIDE"</span><span class="p">:</span><span class="w"> </span><span class="s2">"70"</span><span class="w">
  </span><span class="p">}</span><span class="w">
</span><span class="p">}</span><span class="w">
</span></code></pre></div></div>

<p>200k window at 70% threshold means compaction triggers around 140k tokens. Well before quality drops off. Compaction runs more frequently but with less context to summarize, which means better summaries. <a href="https://x.com/karpathy/status/1937902205765607626">Andrej Karpathy</a> described context engineering as “filling the context window with just the right information for the next step.” Too much, and “performance might come down.” A constrained window forces that discipline automatically.</p>

<h2 id="the-cost-angle">The Cost Angle</h2>

<p>Every turn in Claude Code sends the full conversation context to Anthropic’s servers. If your context window is sitting at 600k tokens of accumulated tool output, file reads, and old conversation, all of that gets re-sent and re-billed on the next message.</p>

<p>With <a href="https://platform.claude.com/docs/en/about-claude/pricing">Opus 4.6 at $5 per million input tokens</a>, a 600k context costs a lot per turn just for input (not exactly 3 dollar because there is also caching going on). A 140k context (right before compaction) theoretically costs $0.70. Over a long session with dozens of turns, that difference adds up. One user on <a href="https://news.ycombinator.com/item?id=47580395">Hacker News</a> described burning through $100 in credit when Opus 4.6 “got stuck in a bullshit reasoning loop.” So a smaller window is better for quality and is cheaper.</p>

<h2 id="compaction-is-not-a-safety-net">Compaction Is Not a Safety Net</h2>

<p>The standard advice is to rely on compaction to keep your context clean. And compaction works, sort of and sometimes. It summarizes the conversation to free up space. But it is lossy. The model decides what matters and what gets dropped, and its judgment is not always yours. Often I feel like I have to start over again with the nuances of the problem we were working on.</p>

<p>Another problem is what happens <em>after</em> compaction. Yesterday I was in a session, working on changes across a repo. Auto-compaction kicked in mid-task. First thing Claude did after compacting: committed everything. Without being asked. It lost enough context to forget that I had not asked for a commit, saw uncommitted files and went ahead.</p>

<p><a href="https://claude.com/blog/using-claude-code-session-management-and-1m-context">Thariq Shihipar</a> from the Claude Code team recommends compacting proactively at 50-60% capacity instead of waiting for auto-compaction. Good advice. But if you constrain your window to 200k and set the threshold to 70%, you get roughly the same effect automatically. No need to watch your token count and manually trigger <code class="language-plaintext highlighter-rouge">/compact</code> at the right moment.</p>

<p>There is another upside to the smaller window: compaction itself gets better. If compaction triggers at 70% and reduces back to roughly 30%, the 1M window has to summarize away 400,000 tokens of conversation. The 200k window only discards 80,000. Five times less information to lose. Smaller contexts lead to more accurate summaries. (For the information theory purists out there: I am aware that it is more complicated than this, please forgive my shortcuts)</p>

<h2 id="fresh-sessions-not-long-ones">Fresh Sessions, Not Long Ones</h2>

<p>The env vars help within a session. But the bigger win is avoiding compaction by not having long sessions in the first place.</p>

<p>My workflow separates every phase into its own session. Research runs in one session, planning in another, building in a third. Never chained together in the same conversation. I <a href="/AI/LLM/development/productivity/2025/11/21/orchestrator-automating-claude-code-workflows.html">built an orchestrator</a> that does this automatically: each step launches a separate Claude Code instance. The whole rationale is context isolation. Each phase starts clean with only the information needed to start that session.</p>

<p>This is the same principle behind sub-agents, just at a larger scale. When Claude Code spawns a sub-agent, that agent gets its own fresh context window. All intermediate work (file reads, grep output, failed attempts) stays in the sub-agent’s context. Only the final result comes back. <a href="https://www.morphllm.com/context-rot">Morph’s research</a> found a 90% performance gain using sub-agent architecture over single-agent. The reason is straightforward: every file read and tool call that stays out of your main context is noise that never competes for attention.</p>

<h2 id="the-counterintuitive-takeaway">The Counterintuitive Takeaway</h2>

<p>The 1M context window is a capacity increase, not a quality increase. More room means more space for noise, higher bills, and worse output once you cross the degradation threshold. Steve Smith <a href="https://blog.nimblepros.com/blogs/context-windows-wont-grow-forever/">calls it</a> “a huge junk drawer.” Glen Rhodes <a href="https://glenrhodes.com/context-window-management-treating-llm-context-as-working-memory-not-unlimited-storage/">describes context</a> as working memory, not storage, and argues you should treat it like RAM on a constrained system: deliberate about what gets loaded, suspicious of anything that lingers.</p>

<p>The best results I get from Claude Code come from keeping the window small, avoid compacting at all, and never letting one phase pollute the next. Two environment variables and a habit of starting fresh. That is the whole trick.</p>

<hr />

<p><em>Running into context issues or managing your own Claude Code setup? <a href="#" onclick="task1(); return false;">Get in touch</a> to compare notes.</em></p>

<h2 id="resources">Resources</h2>

<h3 id="claude-code-documentation">Claude Code documentation</h3>

<ul>
  <li><a href="https://claude.com/blog/using-claude-code-session-management-and-1m-context">Using Claude Code: session management and 1M context</a> – Thariq Shihipar’s practical guide to context management</li>
  <li><a href="https://www.anthropic.com/engineering/effective-context-engineering-for-ai-agents">Effective context engineering for AI agents</a> – Anthropic’s engineering blog on context rot and mitigation</li>
  <li><a href="https://code.claude.com/docs/en/env-vars">Claude Code environment variables</a> – Official docs including the two env vars discussed here</li>
  <li><a href="https://code.claude.com/docs/en/context-window">Explore the context window</a> – Interactive simulation of how context fills during a session</li>
  <li><a href="https://platform.claude.com/docs/en/build-with-claude/context-windows">Context windows API docs</a> – Model-by-model context sizes and Anthropic’s acknowledgment of context rot</li>
</ul>

<h3 id="research-and-analysis">Research and analysis</h3>

<ul>
  <li><a href="https://arxiv.org/abs/2307.03172">Lost in the Middle (Liu et al., 2023)</a> – The foundational research on performance degradation in long contexts</li>
  <li><a href="https://www.trychroma.com/research/context-rot">Context Rot research by Chroma</a> – Evaluation of 18 LLMs showing universal degradation with input length</li>
  <li><a href="https://blog.jetbrains.com/research/2025/12/efficient-context-management/">JetBrains: Smarter Context Management for Agents</a> – Observation masking: less context, better results</li>
  <li><a href="https://www.morphllm.com/context-rot">Morph: Context Rot complete guide</a> – Sub-agent architecture and agent-specific context data</li>
  <li><a href="https://gist.github.com/badlogic/cd2ef65b0697c4dbe2d13fbecb0a0a5f">Compaction research across coding tools</a> – Claude Code, Codex CLI, OpenCode, Amp compared</li>
</ul>

<h3 id="developer-perspectives">Developer perspectives</h3>

<ul>
  <li><a href="https://simonwillison.net/2025/Jan/26/paul-gauthier/">Paul Gauthier on practical context limits</a> – Aider creator: models get confused above 25-30k tokens</li>
  <li><a href="https://x.com/karpathy/status/1937902205765607626">Karpathy on context engineering</a> – “Too much or too irrelevant, and performance might come down”</li>
  <li><a href="https://tessl.io/blog/amp-retires-compaction-for-a-cleaner-handoff-in-the-coding-agent-context-race">Amp drops compaction for handoff</a> – Why one coding tool designed around short threads</li>
  <li><a href="https://blog.nimblepros.com/blogs/context-windows-wont-grow-forever/">Why Context Windows Won’t Keep Growing Forever</a> – Steve Smith on diminishing returns and the junk drawer effect</li>
  <li><a href="https://glenrhodes.com/context-window-management-treating-llm-context-as-working-memory-not-unlimited-storage/">Context as working memory, not storage</a> – Glen Rhodes on treating context like constrained RAM</li>
</ul>

<h3 id="related-posts">Related posts</h3>

<ul>
  <li><a href="/AI/LLM/development/productivity/2025/11/21/orchestrator-automating-claude-code-workflows.html">The Orchestrator: Automating Full Claude Code Workflows</a> – Each phase in its own Claude Code instance</li>
  <li><a href="/ai/development/automation/2026/04/17/automated-builds-cost-fatigue-ceiling.html">Fully Automated LLM Builds: Where It Actually Stops</a> – Token costs as a ceiling on automation</li>
  <li><a href="/ai/development/tools/2026/04/09/gtk-cutting-llm-token-costs-cli-output.html">gtk: Filtering CLI Noise to Save Tokens</a> – Reducing what goes into context in the first place</li>
</ul>]]></content><author><name>Albert Sikkema</name></author><category term="ai" /><category term="development" /><category term="tools" /><summary type="html"><![CDATA[The 1M context window in Claude Code sounds like an upgrade. In practice, constraining it to 200k with early compaction produces better results and lower costs.]]></summary><media:thumbnail xmlns:media="http://search.yahoo.com/mrss/" url="https://www.albertsikkema.com/assets/images/smaller-context-window-better-claude-code-blog.png" /><media:content medium="image" url="https://www.albertsikkema.com/assets/images/smaller-context-window-better-claude-code-blog.png" xmlns:media="http://search.yahoo.com/mrss/" /></entry><entry><title type="html">What MCP’s Future Means for API Design</title><link href="https://www.albertsikkema.com/ai/development/mcp/2026/04/20/from-talk-to-practice-mcp-future-api-design.html" rel="alternate" type="text/html" title="What MCP’s Future Means for API Design" /><published>2026-04-20T00:00:00+00:00</published><updated>2026-04-20T00:00:00+00:00</updated><id>https://www.albertsikkema.com/ai/development/mcp/2026/04/20/from-talk-to-practice-mcp-future-api-design</id><content type="html" xml:base="https://www.albertsikkema.com/ai/development/mcp/2026/04/20/from-talk-to-practice-mcp-future-api-design.html"><![CDATA[<figure>
  <img src="/assets/images/mcp-future-api-design.jpg" alt="Misty Scottish highland landscape with winding path through moorland" />
  <figcaption>Photo by <a href="https://unsplash.com/@martinbennie">Martin Bennie</a> on <a href="https://unsplash.com">Unsplash</a></figcaption>
</figure>

<p>This weekend I built a small CLI tool that pulls transcripts, comments, and metadata from YouTube videos. The first thing I fed it was David Soria Parra’s keynote <a href="https://www.youtube.com/watch?v=v3Fr2JR47KA">“The Future of MCP”</a> at the AI Engineer conference. David wrote the original Python MCP SDK at Anthropic, so he knows where the protocol is heading. I dove into MCP about 9 months ago as part of the exploratory phase of a government job, a lot has happened since, and there is a lot on the roadmap. I discussed it with an LLM, asking about REST’s role, about playbooks that instruct models how to compose tools, and about the similarities with what I have been building.</p>

<h2 id="what-david-laid-out">What David Laid Out</h2>

<p>The short version: 2025 was about coding agents (local, sandboxed, verifiable). 2026 is about general knowledge workers who need connectivity to five SaaS apps and a shared drive, not a local compiler. He sees three layers for this: skills (domain knowledge in files), <a href="https://modelcontextprotocol.io/">MCP</a> (rich semantics, auth, governance, long-running tasks), and CLI/computer use (great when the tool is already in pre-training data like git or gh). The best agents will use all three.</p>

<p>Three things he wants the ecosystem to fix: <strong>progressive discovery</strong> (stop dumping all tools into context, load them on demand), <strong>programmatic tool calling</strong> (give the model an execution environment to compose multiple calls in one script instead of round-tripping one by one, like <a href="https://blog.cloudflare.com/code-mode-mcp/">Cloudflare’s Code Mode</a> does), and <strong>designing for agents, not REST</strong> (stop mapping REST endpoints 1:1 into MCP servers, he called conversion tools “cringe”).</p>

<p>The upcoming features add a lot, but the new feature I care about most is skills over MCP: servers shipping domain knowledge alongside their tools. More on that below. The protocol is barely 18 months old, with 110 million monthly downloads (roughly 2x faster than React hit that number), so that is proving how popular it is.</p>

<h2 id="i-already-built-this">I Already Built This</h2>

<p>The part that clicked with me was “skills over MCP.” David described it as an upcoming protocol primitive: servers should ship playbooks alongside their tools, instructing the model how to combine them for specific tasks. The server author maintains the playbooks, not the user. When workflows change, the server updates its skills and every connected agent gets the new instructions automatically.</p>

<p>I have been doing exactly this in <a href="/ai/development/operations/2026/04/16/when-llms-actually-deliver.html">logbench</a>, the MCP server I built for querying our Axiom logs. Tools like <code class="language-plaintext highlighter-rouge">explore_dataset</code> don’t return raw data. They return step-by-step instructions: “first get the schema, then run an error breakdown, then drill into the top categories.” The model picks the right workflow tool, gets the recipe, follows it. (Not entirely my idea, did something similar a long time ago, but this step was inspired by Axiom’s official mcp code)</p>

<p>It works well. No context bloat because the playbook only loads when the model calls that specific tool. Progressive discovery is built in for free. And the instructions are scoped to the task at hand, not a generic “here are all the things you could do.”</p>

<h2 id="skills-over-mcp-the-distribution-angle">Skills Over MCP: The Distribution Angle</h2>

<p>Before I get to why formalization worries me, there is one part of skills over MCP that is genuinely exciting: distribution.</p>

<p>Right now, if I want my colleagues to use the logbench playbooks, they need my exact MCP server setup. If I want to share a highly specialized tool behind an auth wall or a paywall, there is no standard way to do that. Skills over MCP solves this. Your team connects to the same MCP server and everyone gets the same playbooks, updated by the server author, no local configuration needed. A specialized log analysis skill, a compliance checking workflow, a financial reporting recipe: all distributed through the same protocol, access controlled at the server level.</p>

<p>That is a real improvement over “copy this markdown file into your project.” It means you can build tools that are genuinely sharable across teams, organizations, even commercially. The distribution story is strong.</p>

<h2 id="the-boring-toolset-problem">The Boring Toolset Problem</h2>

<p>But here is where I am less enthusiastic. What I see happening with MCP is the same thing that happens to every successful protocol: it moves from “wow, that is cool” to the inevitable boring enterprise toolset, the same kind of <a href="/ai/development/2026/02/13/let-the-ai-pick-react.html">standardization convergence</a> I wrote about with React. Mediocre-good-for-all, mostly optimized for large organizations with compliance requirements, not for developers who want to push boundaries.</p>

<p>My concern is not that the formalization itself will limit what I can do. It probably will not. My concern is what happens to developers along the way. When I built the playbook pattern in logbench, I understood exactly what was happening: a tool returns instructions, the model follows them. I learned how to engage with the model, how to structure instructions it would follow reliably, what worked and what did not. That understanding came from building it myself, from prodding and experimenting. Once that becomes a protocol primitive you just consume, the experimentation stops. You get a standard way to do it, and most developers will never look underneath.</p>

<p>That is how we lose the skill of working with LLMs directly. Not because the abstractions are bad, but because they are comfortable. People stop experimenting with how to instruct models, how to structure tool interactions, how to design playbooks that actually work. They use the MCP skills primitive because it is there, and they never discover new patterns that only emerge when you build from scratch.</p>

<p>MCP itself is open source now (Anthropic <a href="https://www.anthropic.com/news/donating-the-model-context-protocol-and-establishing-of-the-agentic-ai-foundation">donated it to the Linux Foundation</a> as part of the Agentic AI Foundation), which is good. But the ecosystem it lives inside is moving in a direction I like less. Claude Code started as a developer-focused tool, the kind of thing where you could <a href="/AI/development/productivity/python/2026/01/13/rethinking-claude-flow-from-per-repo-chaos-to-global-app.html">wire up your own workflows</a> and push the boundaries. Increasingly it is becoming a fits-all product, and the pricing reflects that. The whole Claude Code environment is powerful, but it is also an ecosystem that wants you to stay inside it.</p>

<p>Which is why I think it is worth looking at what exists outside. <a href="https://www.pi.dev/">Pi</a> is one framework worth trying for agentic use, approaching connectivity differently from MCP. There are more options than the one path Anthropic is paving, and the best time to explore them is now, while the patterns are still forming and nothing is locked in.</p>

<h2 id="what-happens-to-rest">What Happens to REST?</h2>

<p>This is the question I kept coming back to. As more and more interaction moves through agents, and agents interact through MCP, what happens to the classic API?</p>

<p>REST is not going anywhere as plumbing. You cannot have MCP without REST (or something like it) underneath. The comments on the talk pushed back hard on this point, and they are right: MCP is essentially “discoverable REST,” and the move toward stateless transport is literally re-converging toward REST patterns.</p>

<p>But I wonder about the trajectory: right now we have API-centered systems and we are moving toward API + MCP. Will that become MCP-centered? Eventually MCP-only for some use cases? And if so, what happens to the APIs that remain?</p>

<p>I think they change character, the classic REST API is a developer’s tool: full CRUD, every resource exposed, every operation available, everything must be in there. The MCP model is different: only those operations that add value, combined into higher-level actions when that makes sense. Less “here are all the building blocks” and more “here is what you can actually do.”</p>

<p>That is not a developer-centered design: it is a human-centered design, or an LLM-centered design, which turn out to be surprisingly similar. And if agents become the primary consumers of APIs, the APIs that stick around will probably start looking more like MCP tools than like the CRUD interfaces we build today. Fewer granular endpoints, more intent-oriented operations. Domain knowledge shipped alongside the API, not buried in documentation.</p>

<p>Or put differently: if your API is so granular that you need a playbook to use it, perhaps the API itself should be the playbook.</p>

<h2 id="what-i-took-away">What I Took Away</h2>

<p>Watch the <a href="https://www.youtube.com/watch?v=v3Fr2JR47KA">full talk</a> if you work with MCP or build agent tooling. It is 18 minutes and dense with where the protocol is heading.</p>

<p>My takeaways, for what they are worth:</p>

<ul>
  <li>The playbook-as-a-tool pattern works today, no protocol extension needed. If you are building MCP servers, try it before waiting for the formalized version.</li>
  <li>Progressive discovery is not optional at scale. If you dump 50 tools into the context window, you are doing it wrong.</li>
  <li>MCP is a good protocol, but keep building things yourself too. The understanding you get from direct experimentation with LLMs is worth more than any abstraction.</li>
  <li>Look beyond Anthropic’s ecosystem. Try <a href="https://www.pi.dev/">Pi</a>, try building without MCP, see what works. The best patterns come from exploration, not from consuming frameworks.</li>
  <li>As agents become primary data consumers, the APIs themselves will loose importance and start looking more like MCP tools: fewer CRUD endpoints, more intent-oriented operations.</li>
</ul>

<hr />

<p><em>Building MCP servers or thinking about API design for agents? <a href="#" onclick="task1(); return false;">Get in touch</a> to compare notes.</em></p>

<h2 id="resources">Resources</h2>

<ul>
  <li><a href="https://www.youtube.com/watch?v=v3Fr2JR47KA">The Future of MCP – David Soria Parra keynote</a> at AI Engineer conference</li>
  <li><a href="https://modelcontextprotocol.io/">Model Context Protocol specification</a> – the official MCP spec and documentation</li>
  <li><a href="https://github.com/jlowin/fastmcp">FastMCP</a> – the Python SDK David called “way better” than the official one</li>
  <li><a href="https://github.com/cloudflare/mcp">Cloudflare MCP server</a> and their <a href="https://blog.cloudflare.com/code-mode-mcp/">Code Mode blog post</a> – example of exposing an execution environment instead of individual tools</li>
  <li><a href="https://www.anthropic.com/news/donating-the-model-context-protocol-and-establishing-of-the-agentic-ai-foundation">Agentic AI Foundation announcement</a> – Anthropic donating MCP to the Linux Foundation</li>
  <li><a href="https://github.com/dsp">David Soria Parra on GitHub</a></li>
  <li><a href="/ai/development/operations/2026/04/16/when-llms-actually-deliver.html">When LLMs Actually Deliver</a> – my earlier post on logbench and the playbook pattern</li>
</ul>]]></content><author><name>Albert Sikkema</name></author><category term="ai" /><category term="development" /><category term="mcp" /><summary type="html"><![CDATA[Watching a talk on MCP's future, I realized I already built the pattern they are formalizing. And it raises a bigger question about how we design APIs.]]></summary><media:thumbnail xmlns:media="http://search.yahoo.com/mrss/" url="https://www.albertsikkema.com/assets/images/from-talk-to-practice-mcp-future-api-design-blog.png" /><media:content medium="image" url="https://www.albertsikkema.com/assets/images/from-talk-to-practice-mcp-future-api-design-blog.png" xmlns:media="http://search.yahoo.com/mrss/" /></entry><entry><title type="html">Fully Automated LLM Builds: Where It Actually Stops</title><link href="https://www.albertsikkema.com/ai/development/automation/2026/04/17/automated-builds-cost-fatigue-ceiling.html" rel="alternate" type="text/html" title="Fully Automated LLM Builds: Where It Actually Stops" /><published>2026-04-17T00:00:00+00:00</published><updated>2026-04-17T00:00:00+00:00</updated><id>https://www.albertsikkema.com/ai/development/automation/2026/04/17/automated-builds-cost-fatigue-ceiling</id><content type="html" xml:base="https://www.albertsikkema.com/ai/development/automation/2026/04/17/automated-builds-cost-fatigue-ceiling.html"><![CDATA[<figure>
  <img src="/assets/images/automated-builds-machinery.jpg" alt="Close-up of industrial machinery with interlocking gears, chains, belts, and pulleys" width="1920" height="1280" fetchpriority="high" style="width:100%;height:auto" />
  <figcaption>A lot of moving parts that have to mesh. Photo by <a href="https://unsplash.com/@kiwihug">Kiwihug</a> on <a href="https://unsplash.com/photos/Hld-BtdRdPU">Unsplash</a>.</figcaption>
</figure>

<p>For the last year or so I have been automating my software build workflow, handing more and more of the actual development work over to LLMs and watching where it breaks. The builds mostly run themselves now with good quality code, good enough that I can hand off that work and check back an hour later on the progress (or react to the Telegram message telling me a PR is waiting).</p>

<p>This is the point I was aiming for when I started iterating on automated LLM-driven development a year ago. There were quite a few steps in between with different levels of automation, tools, start all over again, improve etcetera.</p>

<h2 id="what-i-found-to-be-bottlenecks">What I Found to Be Bottlenecks</h2>

<p>Three things I suspected but could only experience and prove by running the loop:</p>

<p><strong>You cannot skip the human review.</strong> Left alone long enough the agent will drift. Possibly not in the first couple of PRs. But somewhere between 5 and 10 it will make a decision that looks locally correct and is globally wrong, and every PR after that builds on the drift. No amount of prompt engineering fixes this (and I tried a lot of methods). I wrote about the <a href="/ai/llm/development/best-practices/2025/11/14/human-in-the-loop-ai-code-review.html">same pattern with meta-tests</a> last November and it is still true, just with a bigger blast radius now that the loop is tighter.</p>

<p><strong>Work item size matters.</strong> Too small and you burn tokens spinning up six agents to change a button color. Too large and the model cannot hold the whole thing in its usable context and produces something that seems plausible or it does not get anywhere at all (one word of advice: set up a max number of turns, some processes can go on for hours spinning in a logical loop without progress). The trap is assuming “size” maps to the human version of the word. It does not. Eight hours of copy-paste work is boring, not complex, and the LLM will do it in seconds. A 15-minute architectural decision can be too much for the model because it needs judgment the model does not have. The right granularity for an agent queue is not a human time estimate, it is “how much context and judgment does this require,” and you have to learn that shape by running the loop and watching where it breaks. Getting the backlog granularity right is now a separate skill. And automating that part is the challenge: I start to reach the conclusion that with the current models this is not possible. And perhaps we can even say that with LLMs as we know it, this will highly likely never be possible (not until we get different kinds of intelligence). So there is the human role again.</p>

<p><strong>Reviewing LLM code all day is boring.</strong> This one is mostly about me, but I do not think I am alone. When you do not write the code yourself, the mental connection is gone. I have to search for everything. When I wrote (parts of) the codebase myself, I knew my way (that data model contains this, we have a helper function for that, etc). You are reading prose somebody else wrote in a codebase you do not know by heart, with the added complication that the entity (in the broadest sense) that wrote the code does not learn from your feedback across sessions (you can use memories to persist, but that is not learning). After a few hours your attention drops and you start approving things you would not have approved in the morning. And the idea that the future holds endless PR reviews every day for the rest of my life is not really motivating, especially since code will be produced so fast that the pipeline will always be full and waiting for you.</p>

<h2 id="the-review-triage-fix-loop">The Review-Triage-Fix Loop</h2>

<p>The first two problems I can partly engineer around. The third one I can only manage.</p>

<p>What I experimented with, and works better than anything else I tried, is a review loop with three stages before a human sees it:</p>

<ol>
  <li>A <strong>very critical review</strong>, with explicit checklists, run by a strong model. Not “looks fine” but “hunt for everything.” Complete, pedantic, slightly paranoid. Use multiple agents all focused on certain angles to look at the codebase.</li>
  <li>A <strong>triage</strong> pass over the findings. What actually needs to be fixed now? What goes on the backlog? What is wrong but not wrong enough to matter?</li>
  <li>A <strong>fix</strong> pass on the must-fix items and fix them automatically.</li>
</ol>

<figure>
  <img src="/assets/images/automated-builds-dashboard.jpg" alt="Project Server jobs dashboard showing a succeeded build with Setup, Build, Review, Triage and Fix stages ticked off" />
  <figcaption>One run of the loop: plan, build, review, triage, fix, all green. 51 minutes, 407 turns, 111k tokens, one PR out the other end.</figcaption>
</figure>

<p>Then and only then the human looks. By this point the easy stuff is handled, the backlog has captured the medium stuff, and the human is adding judgment.</p>

<p>This works. The reviews get caught early, the agent stays on track, and I can intervene at the point where my time is worth the most. But two things crack under load.</p>

<h2 id="cost">Cost</h2>

<p>Each loop eats a lot of tokens. A critical review is not a two-line “LGTM” prompt, it is pages of context, guidelines, and targeted checks. The triage step needs the full review output. The fix step needs the triage output plus the original code plus the repo context. Do this on every PR and the bill adds up.</p>

<p>And having the human only intervene at the PR stage means a lot of work has been done before. If the PR is turned down or changes (possibly big ones) need to be made, this adds to the costs.</p>

<p>Opus 4.6 runs <a href="https://platform.claude.com/docs/en/about-claude/pricing">$5 per million input tokens and $25 per million output tokens</a>. That is one of the most expensive models that are currently available. It is even more expensive at scale with this approach: every feature goes through several expensive passes before a human is even involved. <a href="https://www.finout.io/blog/anthropic-api-pricing">Caching helps</a>, batching helps, but the floor is still non-trivial. And it is of course the most fun to have several processes run at once (let’s say I can burn through my Claude Code token limits in no time). And then the extra usage starts, most of the time I am at 1 euro per hour. Not sustainable for the long run unless you are deeply funded and have money to burn (literally almost). So needless to say that I turned that off for most projects I work on.</p>

<p>The obvious question is whether you can drop to a cheaper model for some of the stages. I have tried, and the answer is no for most work. What works for me is Opus on planning and review, Sonnet on building. Results are better with Sonnet on everything, but token usage is a factor. I have never found a use for Haiku in this loop, and anything smaller than Sonnet on the build stage just breaks. The reviews from a weaker model miss the subtle stuff or flag perfectly fine code as breaking issues. A mediocre review is worse than no review because it gives you false confidence, the human at that point has to be able to trust the review done automatically. That is not the place to save tokens. Every time I have tried to save tokens by running a cheaper model somewhere in the chain, the quality drop caused rework, and the rework cost more tokens than I saved. The cheap option is, in the end, the expensive option. Given the hourly rate of the developer and the costs of tokens you can do a nice calculation of what it costs: in the end the human is more expensive. So making it easy for the human is key: less time spent per feature is cheaper. But rework costs extra time.</p>

<h2 id="fatigue">Fatigue</h2>

<p>The other problem is me (or developers in general). I can absorb code and automated reviews in a PR for a while, but only a while. Staring at well-structured PR summaries and deciding “yes, agree, ship it” over and over is draining in a way that writing code is not. You are in evaluation mode all day and never in creation mode, and evaluation mode runs on a smaller battery.</p>

<p>The obvious fix is to take the human out of every review, but that runs straight back into lesson one. Automated reviews alone miss things an experienced developer would catch in thirty seconds. A well-rested, motivated, experienced developer is still a better reviewer than any model I have tried, even with careful prompts and all the checklists in the world.</p>

<p>An interesting sub-observation: the automated review is probably better than a significant fraction of what developers do in practice. I would guesstimate 40%, with no hard numbers to back that up, based on the PR reviews I have seen from myself and colleagues over the years. Tired reviewer on a Friday afternoon versus an LLM run on a critical-review prompt? LLM wins, most of the time.</p>

<h2 id="the-ceiling">The Ceiling</h2>

<p>So what is blocking the next step, where I can trust the loop without babysitting it every few PRs?</p>

<p>Two things, and they are linked.</p>

<p><strong>Tokens.</strong> Cost and availability. I need to run more review passes, with more context, on more PRs, and I cannot unless the per-token cost drops or the rate limits go up. Right now a busy day maxes out my limits before I am done, which is its own kind of bottleneck. Combine that with recent changes in Anthropic weighing tokens used between 14 and 22h (that is in my timezone) and it means having this run after 2 in the afternoon makes it even more expensive. And a few times per week connections fail, so lots of problems there.</p>

<p><strong>Smarter models.</strong> Specifically, smarter models that I can run locally. Not because I want to self-host for the sake of it, but because local inference is how the per-token cost drops. The current local options are getting genuinely good. <a href="https://huggingface.co/Qwen/Qwen3-Coder-Next">Qwen3 Coder Next</a> scores 44.3 on SWE-Bench Pro, above DeepSeek-V3.2 and GLM-4.7. <a href="https://techie007.substack.com/p/qwen-35-the-complete-guide-benchmarks">Qwen 3.5 hits 76.4 on SWE-bench Verified</a>, level with Gemini 3 Pro. <a href="https://ollama.com/library/gemma4">Gemma 4</a> is a real step up from Gemma 3.</p>

<p>But Opus 4.6 still <a href="https://akitaonrails.com/en/2026/04/05/testing-llms-open-source-and-commercial-can-anyone-beat-claude-opus/">wins on the hard stuff</a>: multi-file reasoning, long-horizon planning, the kind of review where you need to hold a whole architecture in your head. And the critical review stage is exactly that kind of work. For me to move the review loop onto local hardware, the local models need to at least match what Opus 4.6 does today, not approximately match it. We need something closer to Opus 4.6+++, running on a box in my office, before this flips.</p>

<p>That will happen. Local models are improving fast. But it is not this year.</p>

<h2 id="and-then-there-is-electricity">And Then There Is Electricity</h2>

<p><a href="https://www.cnbc.com/2026/02/12/electricity-price-data-center-ai-inflation-goldman.html">AI data center demand is pushing electricity prices up across Europe</a>. Goldman expects a 10 to 15% boost to European power demand over the coming 10 to 15 years, mostly from data centers. The <a href="https://www.iea.org/reports/energy-and-ai/energy-demand-from-ai">IEA puts global data center consumption on track to approach 1,050 TWh by 2026</a>, which would rank data centers between Japan and Russia if they were a country.</p>

<p>On top of that, the situation in the Middle East has pushed <a href="https://www.iea.org/reports/oil-market-report-april-2026">physical crude oil prices near $150 a barrel</a>, with <a href="https://www.aljazeera.com/news/2026/4/14/global-oil-demand-to-plunge-amid-middle-east-war-disruptions">shipping through the Strait of Hormuz still severely restricted</a>. Energy, in other words, is getting more expensive while AI is driving demand higher, and both lines are bending up. Thankfully we have wind and sun.</p>

<p>So even the local-hardware dream has a cost floor (not forgetting the rising costs of RAM and GPUs). Running Opus-class models on your own machine still takes a lot of watts and decent hardware.</p>

<h2 id="where-this-leaves-me">Where This Leaves Me</h2>

<p>The review-triage-fix catches more than manual review alone. It produces better PRs than pure automation. It lets me spend my attention on the decisions that need attention.</p>

<p>But it is not the end state. The end state, where I can let the builds run themselves, is gated by two things I cannot fix on my laptop: cheaper strong-model inference, and a local model that reaches Opus 4.6 plus. Until those arrive I am more or less at a standstill when it comes to further improvement.</p>

<p>Which is fine if you look back at what is now possible that was not possible only three years ago when all this started for me. I have run worse loops. And in the meantime I have learned a lot about where the human adds value in this stack, which is probably the most useful thing you can learn right now.</p>

<p>If you are running a similar loop and have found something that works better, I would really like to hear about it.</p>

<hr />

<p><em>Automating builds, tuning review loops, or stuck on the same ceiling? <a href="#" onclick="task1(); return false;">Get in touch</a> to compare notes.</em></p>

<h2 id="resources">Resources</h2>

<h3 id="models-and-pricing">Models and pricing</h3>

<ul>
  <li><a href="https://platform.claude.com/docs/en/about-claude/pricing">Claude API pricing</a> - Official Anthropic pricing for Opus 4.6 and other models</li>
  <li><a href="https://www.finout.io/blog/anthropic-api-pricing">Anthropic API pricing deep dive</a> - Caching and batching cost breakdown</li>
  <li><a href="https://huggingface.co/Qwen/Qwen3-Coder-Next">Qwen3-Coder-Next on Hugging Face</a> - Open-weight coding model</li>
  <li><a href="https://techie007.substack.com/p/qwen-35-the-complete-guide-benchmarks">Qwen 3.5 benchmarks</a> - SWE-bench and real-world results</li>
  <li><a href="https://ollama.com/library/gemma4">Gemma 4 on Ollama</a> - Local deployment</li>
  <li><a href="https://akitaonrails.com/en/2026/04/05/testing-llms-open-source-and-commercial-can-anyone-beat-claude-opus/">Can anyone beat Claude Opus?</a> - Open vs commercial comparison</li>
</ul>

<h3 id="energy-and-infrastructure">Energy and infrastructure</h3>

<ul>
  <li><a href="https://www.iea.org/reports/energy-and-ai/energy-demand-from-ai">IEA: Energy demand from AI</a> - Global data center projections</li>
  <li><a href="https://www.cnbc.com/2026/02/12/electricity-price-data-center-ai-inflation-goldman.html">Goldman Sachs on AI electricity demand</a> - European power demand forecast</li>
  <li><a href="https://www.iea.org/reports/oil-market-report-april-2026">IEA Oil Market Report April 2026</a> - Current supply disruptions</li>
  <li><a href="https://www.aljazeera.com/news/2026/4/14/global-oil-demand-to-plunge-amid-middle-east-war-disruptions">Middle East crisis and global oil demand</a></li>
</ul>

<h3 id="related-posts">Related posts</h3>

<ul>
  <li><a href="/ai/llm/development/best-practices/2025/11/14/human-in-the-loop-ai-code-review.html">Human in the Loop</a> - Why human review still matters</li>
  <li><a href="/ai/development/operations/2026/04/16/when-llms-actually-deliver.html">When LLMs Actually Deliver</a> - The tooling that makes LLMs useful</li>
  <li><a href="/ai/development/tools/2026/04/09/gtk-cutting-llm-token-costs-cli-output.html">gtk: Filtering CLI Noise to Save Tokens</a> - Related token-cost work</li>
</ul>]]></content><author><name>Albert Sikkema</name></author><category term="ai" /><category term="development" /><category term="automation" /><summary type="html"><![CDATA[Automated LLM-driven builds mostly work. What stops you at scale: costs, reviewer/developer fatigue, and models that are not intelligent enough.]]></summary><media:thumbnail xmlns:media="http://search.yahoo.com/mrss/" url="https://www.albertsikkema.com/assets/images/automated-builds-cost-fatigue-ceiling-blog.png" /><media:content medium="image" url="https://www.albertsikkema.com/assets/images/automated-builds-cost-fatigue-ceiling-blog.png" xmlns:media="http://search.yahoo.com/mrss/" /></entry><entry><title type="html">When LLMs Actually Deliver</title><link href="https://www.albertsikkema.com/ai/development/operations/2026/04/16/when-llms-actually-deliver.html" rel="alternate" type="text/html" title="When LLMs Actually Deliver" /><published>2026-04-16T00:00:00+00:00</published><updated>2026-04-16T00:00:00+00:00</updated><id>https://www.albertsikkema.com/ai/development/operations/2026/04/16/when-llms-actually-deliver</id><content type="html" xml:base="https://www.albertsikkema.com/ai/development/operations/2026/04/16/when-llms-actually-deliver.html"><![CDATA[<figure>
  <img src="/assets/images/when-llms-actually-deliver-blog.jpg" alt="Vintage industrial control room with rows of analog gauges, dials, and switches on a marble panel" width="1920" height="1280" fetchpriority="high" style="width:100%;height:auto" />
  <figcaption>Many signals, one panel. Photo by <a href="https://unsplash.com/@modry_dinosaurus">Frantisek Duris</a> on <a href="https://unsplash.com/photos/C3DfIgig1j8">Unsplash</a>.</figcaption>
</figure>

<p>I have a love/hate relationship with LLMs (mostly love, but every now and then I get frustrated).</p>

<p>One moment they nail a complex refactor across six files. The next they confidently introduce a bug that any starting developer would (probably) catch, or hallucinate an API that has never existed. The swing between brilliant and stupid can happen within two turns of the same conversation. Every time you start to trust, even a little, that trust is crushed to bits.</p>

<p>But every now and then something happens that makes you stop and think: <em>this changes things</em>. (especially if you suddenly realise how complex and time-consuming some tasks were before)</p>

<h2 id="one-sentence-thirty-seconds">One Sentence, Thirty Seconds</h2>

<p>This morning I opened Claude Code and typed one sentence:</p>

<blockquote>
  <p>Please check the logs for production of <redacted repo="" name=""> over the last 24 hours. Then check the PRs made to see if the uploaded changes in the last 3 days have made a difference.</redacted></p>
</blockquote>

<p>Thirty seconds later I had:</p>

<ul>
  <li>A full breakdown of all log events with error rates per hour</li>
  <li>Error classification by type, with counts and affected tenants</li>
  <li>A correlation table showing how each merged PR impacted specific error categories</li>
  <li>A before/after comparison: ‘clarifyinput’ errors dropped from 29 to 2, OpenAI errors from 43 to zero</li>
  <li>An explanation of why icon errors <em>went up</em> (improved logging granularity, not a regression)</li>
  <li>Three concrete action items with root causes identified</li>
</ul>

<p>Really helpful: the LLM didn’t just count errors – it ‘understood’ that PR #458 changed the logging format from a generic “fetching” message to per-icon “loading” messages, and correctly concluded that the apparent spike in icon errors was an artifact of better observability, not a regression. That kind of contextual reasoning across log data and code changes used to take me an hour of cross-referencing. On a bad day, longer.</p>

<p>Before this, checking whether a deploy improved things looked something like:</p>

<ol>
  <li>Open the logging dashboard, set the time range, filter by severity</li>
  <li>Stare at graphs, try to spot patterns</li>
  <li>Open GitHub, find PRs merged in the relevant window</li>
  <li>For each PR, read the diff and figure out which error messages it should have affected</li>
  <li>Go back to the logging dashboard, write queries for those specific error messages</li>
  <li>Compare before/after time windows</li>
  <li>Write it all up in your head or in a document</li>
  <li>Repeat for each PR</li>
</ol>

<p>If you had proper dashboards and saved queries this was maybe an hour. Without them, half a day. And you had to know what to look for in advance – if a PR had an unexpected side effect, you might miss it entirely.</p>

<p>Apart from the time it took, also the mental drain to check the logs: it takes time and energy but does not create a solution. It is not even about getting to know the scope or the cause of the problem, the energy is consumed by using tools itself, forcing you to spend mental energy on using the tools to actually see if there IS a problem. And now that energy can be put towards seeing the problem and moving on to what we get paid for: thinking about how to solve the problem.</p>

<p>All I had to do is decide if I agree with its conclusions (about 60% of the time I do), and decide how to tackle the issues.</p>

<h2 id="the-playbook-is-everything">The Playbook Is Everything</h2>

<p>The LLM did not do this on its own, left to its own with this instruction it would not have access to the logs and the repo so it could not do anything at all. Even when it has API access to the logs and would have connected to the repo, results would be not as good as this. The model would have fumbled with authentication, guessed at query syntax, missed fields, and produced a surface-level summary that looks impressive but tells you nothing you didn’t already know. And it would not remember next time how it should query to get to a certain result.</p>

<p>What made this work is that I built the tools and the playbook first. (Logbench is one of several tools I built for this; another is <a href="/ai/development/tools/2026/04/09/gtk-cutting-llm-token-costs-cli-output.html">gtk</a>, which filters CLI output to save tokens.)</p>

<p><strong>Logbench</strong> is a small MCP server I wrote in Go. It connects to our Axiom log platform and exposes a handful of tools: <code class="language-plaintext highlighter-rouge">query_apl</code> for running raw queries, <code class="language-plaintext highlighter-rouge">explore_dataset</code> and <code class="language-plaintext highlighter-rouge">error_breakdown</code> for guided analysis, <code class="language-plaintext highlighter-rouge">get_dataset_schema</code> so the LLM knows what fields exist. Nothing fancy. But each tool has a clear contract: here is what you pass in, here is what you get back. (why not the <a href="https://github.com/axiomhq/mcp">official Axiom MCP server</a>? We are on a separate version (eu) and the official server does not work with that. I did take inspiration from their code though.) Besides I added some extra steps to find certain information that is unique to this implementation.</p>

<p>The key is the <strong>prompted workflows</strong>. Logbench doesn’t just expose raw query access. It includes structured playbooks: “when investigating errors, first get the schema, then run an error breakdown, then drill into the top categories.” The LLM follows the playbook instead of improvising.</p>

<p>So the LLM is not being creative here. It is following a recipe for how to query the logs for certain often needed results. And then the starting prompt turns into a logical order:</p>

<ol>
  <li>Get the dataset schema (know your fields)</li>
  <li>Run aggregate queries (get the big picture)</li>
  <li>Break down errors by message (find the categories)</li>
  <li>Get recent PRs from GitHub (know what changed)</li>
  <li>Run time-windowed queries per error category (measure impact)</li>
  <li>Cross-reference and synthesize (connect the dots)</li>
</ol>

<p>Each step is a tool call with predictable input and output. The LLM’s job is to orchestrate the steps, handle errors (like when a field name is wrong – it recovered and tried bracket notation), and synthesize the results into something a human can act on. (This is the same orchestration principle I described in <a href="/ai/llm/development/productivity/2025/11/21/orchestrator-automating-claude-code-workflows.html">automating Claude Code workflows</a>, but applied to log analysis instead of code.)</p>

<h2 id="the-ifs">The IFs</h2>

<p>This only works:</p>

<p><strong>IF</strong> you give them the right tools. Not raw access to everything, but curated tools with clear interfaces.</p>

<p><strong>IF</strong> you give them a playbook. Not “figure it out” but “here are the steps, follow them.”</p>

<p><strong>IF</strong> the tools handle the hard parts. Authentication, query syntax, field validation, error formatting – all of that lives in the MCP server, not in the prompt.</p>

<p>Without those guardrails the same model will confidently query a field that doesn’t exist, and present wrong conclusions with the same authoritative tone. (I wrote about <a href="/ai/development/best-practices/2026/03/31/evidence-based-best-practices-ai-guardrails.html">building these kinds of guardrails</a> in more detail previously.) Because that is what I observed starting out during development, before the playbooks and extensive query examples were in place.</p>

<h2 id="the-pattern">The Pattern</h2>

<p>Sometimes LLMs are truly brilliant and come up with solutions that are elegant and useful. Most of the time they do not. Apart from those moments of brilliance, every time I have seen an LLM deliver useful results this was because of:</p>

<ol>
  <li><strong>Structured tools</strong> with clear inputs and outputs</li>
  <li><strong>Guided workflows</strong> that tell the model what steps to follow</li>
  <li><strong>Domain knowledge baked into the tools</strong></li>
  <li><strong>The LLM as orchestrator and synthesizer</strong>, not as domain expert</li>
</ol>

<p>That is a fundamentally different use case than “write me a function” or “refactor this class.” It is closer to having a junior analyst who can follow a checklist very fast and write a surprisingly good summary. You still need to build the checklist and the tools.</p>

<h2 id="lovehate-but-mostly-love-today">Love/Hate, But Mostly Love Today</h2>

<p>I will go back to being mad and amazed at hallucinated facts and stupid ideas and authoritative tone tomorrow. The love/hate cycle continues. But moments like today are a reminder that the frustrating parts are worth pushing through, because when it clicks – when the tools are right and the playbook is clear – the result is something that was not possible two years ago.</p>

<p>Not because the AI is intelligent in itself, but because the system around it is. (which with some philosophical reflection is not far from how we humans function). Have a nice day!</p>

<hr />

<p><em>Logbench is not public – it is custom built for internal use. Interested in building something similar? <a href="#" onclick="task1(); return false;">Get in touch</a>.</em></p>]]></content><author><name>Albert Sikkema</name></author><category term="ai" /><category term="development" /><category term="operations" /><summary type="html"><![CDATA[LLMs can be brilliant and stupid within two turns. But give them the right tools and a playbook, and the results are something that was not possible two years ago.]]></summary><media:thumbnail xmlns:media="http://search.yahoo.com/mrss/" url="https://www.albertsikkema.com/assets/images/when-llms-actually-deliver-blog.png" /><media:content medium="image" url="https://www.albertsikkema.com/assets/images/when-llms-actually-deliver-blog.png" xmlns:media="http://search.yahoo.com/mrss/" /></entry><entry><title type="html">gtk: Filtering CLI Noise to Save LLM Tokens</title><link href="https://www.albertsikkema.com/ai/development/tools/2026/04/09/gtk-cutting-llm-token-costs-cli-output.html" rel="alternate" type="text/html" title="gtk: Filtering CLI Noise to Save LLM Tokens" /><published>2026-04-09T00:00:00+00:00</published><updated>2026-04-09T00:00:00+00:00</updated><id>https://www.albertsikkema.com/ai/development/tools/2026/04/09/gtk-cutting-llm-token-costs-cli-output</id><content type="html" xml:base="https://www.albertsikkema.com/ai/development/tools/2026/04/09/gtk-cutting-llm-token-costs-cli-output.html"><![CDATA[<figure>
  <img src="/assets/images/gtk-coffee-filter.jpg" alt="Pour-over coffee filter with dark coffee dripping through, filtering grounds from liquid" width="1920" height="1280" fetchpriority="high" style="width:100%;height:auto" />
  <figcaption>Filtering out what you don't need. Photo by <a href="https://unsplash.com/@eilivaceron">Eiliv Aceron</a> on <a href="https://unsplash.com/photos/XP5zW2ngk9w">Unsplash</a>.</figcaption>
</figure>

<p>A few weeks ago I ran into <a href="https://github.com/rtk-ai/rtk">rtk</a> (Rust Token Killer), a CLI proxy that filters command output before it reaches your AI coding agent. The idea is brilliant: most of what <code class="language-plaintext highlighter-rouge">git log</code> or <code class="language-plaintext highlighter-rouge">cargo test</code> spits out is noise the model doesn’t need. Strip it, and you save tokens. (and limit the total context you build up over the run of several back and forths)</p>

<p>I liked the concept but not the scope. rtk is written in Rust and supports a lot of features I’ll never use. And for a lot of my workflow it is important to have a full response on git diff and PR: rtk does not do that and I found no easy way to alter this. So I did what any reasonable developer would do: I took the principle and rewrote it in Go with only the parts I care about.</p>

<p>The result is gtk (Go Token Kit): a quick-and-dirty single static binary with no dependencies and 87 filters across the tools I actually use. It is not public (it is not polished enough for that, and rtk already exists), but the principles behind it are worth sharing.</p>

<p>Does it save as much as rtk suggests? I don’t know because I do not measure the total tokens saved. And every now and then you have to rerun the command to get the output without gtk intervention, so that is an extra call which makes a lot of what you saved before not useful anymore. So overall I think that I do not save as much as rtk claims, my guesstimate is at 10-15%: still more than enough to justify using it.</p>

<h2 id="why-this-is-worth-caring-about">Why This Is Worth Caring About</h2>

<p>In a typical Claude Code session, CLI output is the largest source of (wasted) tokens. Run <code class="language-plaintext highlighter-rouge">go test ./...</code> with 100 passing tests: hundreds of lines of noise. Multiply that by dozens of commands per session. And all you actually need for that step is: did the tests pass? Nothing more. One analysis found that <a href="https://medium.com/@jakenesler/context-compression-to-reduce-llm-costs-and-frequency-of-hitting-limits-e11d43a26589">AI coding agents spend 60-80% of their token budget</a> on orientation (finding things, reading output), not on actual problem-solving.</p>

<p>Tokens aren’t free, every token gets consumed by the LLM and costs money, and since the context is passed on every interaction, this adds up. Also tokens eat into your context window, which means shorter useful conversations before the model starts forgetting earlier context. They slow down responses (more input to process) and <a href="https://redis.io/blog/context-window-overflow/">irrelevant context actively degrades LLM performance</a>.</p>

<p>gtk (and rtk) cuts that down. A <code class="language-plaintext highlighter-rouge">git log -20</code> that would produce 2,000 tokens comes out as ~120 tokens (hash, subject, date, author). A test run with 100 passing tests becomes a single summary line. The savings add up fast.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>[gtk: 2054 -&gt; 118 tokens, 94% saved]
</code></pre></div></div>

<p>That hint prints to stderr after every filtered command. It’s a nice reminder that the thing is actually working. And gives an idea about what is passed to the LLM instead of the full output of the previous step.</p>

<h2 id="how-it-works">How It Works</h2>

<p>gtk operates in three stages:</p>

<p><strong>Argument injection.</strong> Before running the command, gtk can modify arguments to get more parseable output. It injects <code class="language-plaintext highlighter-rouge">-json</code> into <code class="language-plaintext highlighter-rouge">go test</code>, <code class="language-plaintext highlighter-rouge">--pretty=format:...</code> into <code class="language-plaintext highlighter-rouge">git log</code>, <code class="language-plaintext highlighter-rouge">--reporter=json</code> into <code class="language-plaintext highlighter-rouge">vitest</code>. It checks whether you already specified these flags, so it never overrides your intent.</p>

<p><strong>Execution.</strong> Runs the real command, captures stdout and stderr, preserves the original exit code. If <code class="language-plaintext highlighter-rouge">go test</code> fails with exit code 1, gtk returns exit code 1.</p>

<p><strong>Filtering.</strong> Applies one of 87 registered filters to compress the output. If no filter matches, output passes through unchanged. If a filter errors, you get the raw output as fallback. It never blocks you.</p>

<p>The filter strategies vary by what makes sense for each command:</p>

<table>
  <thead>
    <tr>
      <th>Strategy</th>
      <th>What it does</th>
      <th>Example</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>Elimination</td>
      <td>Strips passing tests, compilation progress, hint lines</td>
      <td><code class="language-plaintext highlighter-rouge">cargo test</code> only shows failures</td>
    </tr>
    <tr>
      <td>Compression</td>
      <td>One line per item instead of multi-line blocks</td>
      <td><code class="language-plaintext highlighter-rouge">git log</code> becomes hash + subject + date</td>
    </tr>
    <tr>
      <td>Deduplication</td>
      <td>Replaces timestamps and UUIDs with placeholders, counts occurrences</td>
      <td>Repeated log lines collapse</td>
    </tr>
    <tr>
      <td>Structured parsing</td>
      <td>Parses JSON output into summaries</td>
      <td><code class="language-plaintext highlighter-rouge">go test -json</code> becomes “100 passed in 2 packages”</td>
    </tr>
    <tr>
      <td>Truncation</td>
      <td>Caps long lines and large result sets</td>
      <td>Lines cut to 80-120 chars</td>
    </tr>
    <tr>
      <td>Masking</td>
      <td>Replaces sensitive values with <code class="language-plaintext highlighter-rouge">****</code></td>
      <td>Env vars containing “secret”, “token”, “password”</td>
    </tr>
  </tbody>
</table>

<h2 id="the-part-that-makes-it-actually-work">The Part That Makes It Actually Work</h2>

<p>If the AI has to remember to prefix every command with <code class="language-plaintext highlighter-rouge">gtk</code>, it probably will (sometimes). Inconsistency is the norm with LLMs.</p>

<p>So to make sure this always runs, we need to make sure that Claude does not have to remember it: Claude just runs <code class="language-plaintext highlighter-rouge">git log -10</code> normally. A <a href="https://code.claude.com/docs/en/hooks">PreToolUse hook</a> is called: it intercepts the command and rewrites it to <code class="language-plaintext highlighter-rouge">gtk git log -10</code> transparently. You can compare it to a proxy in a way (or middleware): it intercepts the command, does some things to it, runs the altered command and then returns the cleaned output.</p>

<p>The hook checks if the command starts with a known prefix and prepends the gtk binary path. Commands with pipes or chains are left alone, since those already have their own filtering. (and also more pragmatic: are too difficult to reliably recreate with gtk)</p>

<div class="language-go highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">var</span> <span class="n">gtkPrefixes</span> <span class="o">=</span> <span class="p">[]</span><span class="kt">string</span><span class="p">{</span>
    <span class="s">"git "</span><span class="p">,</span> <span class="s">"cargo "</span><span class="p">,</span> <span class="s">"go test"</span><span class="p">,</span> <span class="s">"go build"</span><span class="p">,</span> <span class="s">"gh "</span><span class="p">,</span>
    <span class="s">"docker "</span><span class="p">,</span> <span class="s">"kubectl "</span><span class="p">,</span> <span class="s">"npm "</span><span class="p">,</span> <span class="s">"pytest "</span><span class="p">,</span> <span class="s">"curl "</span><span class="p">,</span>
    <span class="c">// ... 24 prefixes total</span>
<span class="p">}</span>

<span class="k">func</span> <span class="n">tryGtkRewrite</span><span class="p">(</span><span class="n">command</span> <span class="kt">string</span><span class="p">)</span> <span class="kt">string</span> <span class="p">{</span>
    <span class="k">if</span> <span class="n">strings</span><span class="o">.</span><span class="n">ContainsAny</span><span class="p">(</span><span class="n">command</span><span class="p">,</span> <span class="s">"|"</span><span class="p">)</span> <span class="p">{</span> <span class="k">return</span> <span class="s">""</span> <span class="p">}</span>
    <span class="k">if</span> <span class="n">strings</span><span class="o">.</span><span class="n">Contains</span><span class="p">(</span><span class="n">command</span><span class="p">,</span> <span class="s">"&amp;&amp;"</span><span class="p">)</span>   <span class="p">{</span> <span class="k">return</span> <span class="s">""</span> <span class="p">}</span>

    <span class="k">for</span> <span class="n">_</span><span class="p">,</span> <span class="n">prefix</span> <span class="o">:=</span> <span class="k">range</span> <span class="n">gtkPrefixes</span> <span class="p">{</span>
        <span class="k">if</span> <span class="n">strings</span><span class="o">.</span><span class="n">HasPrefix</span><span class="p">(</span><span class="n">command</span><span class="p">,</span> <span class="n">prefix</span><span class="p">)</span> <span class="p">{</span>
            <span class="k">return</span> <span class="n">gtkBin</span> <span class="o">+</span> <span class="s">" "</span> <span class="o">+</span> <span class="n">command</span>
        <span class="p">}</span>
    <span class="p">}</span>
    <span class="k">return</span> <span class="s">""</span>
<span class="p">}</span>
</code></pre></div></div>

<p>This is the same hook I wrote about in <a href="/ai/security/development/tools/2026/02/01/securing-claude-code-hooks-best-practices.html">Securing YOLO Mode</a>, except now it does three jobs instead of one:</p>

<ol>
  <li><strong>Security patterns</strong> – blocks dangerous <code class="language-plaintext highlighter-rouge">rm</code>, fork bombs, force pushes to main, reverse shells, credential exfiltration</li>
  <li><strong>Deny list</strong> – project-specific glob patterns from settings.json converted to regex at runtime</li>
  <li><strong>GTK rewrite</strong> – the token optimization layer</li>
</ol>

<p>One binary, three layers. In container mode (<code class="language-plaintext highlighter-rouge">CLAUDE_CONTAINER_MODE=1</code>), local-only threats like <code class="language-plaintext highlighter-rouge">rm -rf</code> are skipped since the container is disposable, but network and escape checks stay active.</p>

<h2 id="why-go-instead-of-rust">Why Go Instead of Rust</h2>

<p>The original I took inspiration from is written in Rust. It works well. But I like Go, had little time and thought this should work just as well: Go compiles to a static binary just like Rust and is more than fast enough. Cross-compilation is trivial (<code class="language-plaintext highlighter-rouge">GOOS=linux GOARCH=amd64 go build</code>). And adding a new filter is just writing a function and registering it in a map. Easy does it.</p>

<p>I also dropped everything I don’t use. rtk supports Gradle, Maven, Swift, .NET, Terraform, and others, and integrates with Cursor, Gemini CLI, Aider, and other agents. I don’t work with any of those tools, and I only use Claude Code. My version covers git, go, cargo, docker, kubectl, npm, pnpm, pytest, eslint, tsc, prettier, vitest, curl, and a few more. That’s it.</p>

<h2 id="bypass-when-needed">Bypass When Needed</h2>

<p>Sometimes filtered output hides what you need. The escape hatch is <code class="language-plaintext highlighter-rouge">gtk proxy</code>:</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>gtk proxy git log <span class="nt">-10</span>   <span class="c"># full unfiltered output</span>
</code></pre></div></div>

<p>You need this when:</p>
<ul>
  <li>Filtered output doesn’t explain a failure</li>
  <li>You want to see passing tests, not just failures</li>
  <li>You need full diff content, not just stats</li>
  <li>A warning or log line might be relevant to the issue</li>
</ul>

<p>Claude is aware of the option to bypass and is instructed to use <code class="language-plaintext highlighter-rouge">gtk proxy</code> whenever the output is not clear (for instance a failing test is only labeled as failed, without the details). When confronted with that, Claude can rerun the command with <code class="language-plaintext highlighter-rouge">gtk proxy</code> to get the full original output.</p>

<p>In practice I see Claude using this a few times a day. The filters are conservative enough that failures and errors always come through, but since this happens regularly, the savings are not as great as one would expect at first.</p>

<h2 id="what-it-doesnt-do">What It Doesn’t Do</h2>

<p>gtk is not a general-purpose output compressor. It doesn’t try to summarize arbitrary text or use any AI to decide what’s relevant. Each filter is hand-written for a specific command and knows exactly what matters for that command. There’s no magic, no heuristics beyond “does this line match a known noise pattern.”</p>

<p>It also doesn’t help with non-CLI token costs. If you’re burning tokens on large file reads or massive prompts, gtk won’t touch those. It only filters shell command output. It only works on bash commands.</p>

<h2 id="the-end-result">The End Result?</h2>

<p>Inconclusive. I think I save tokens, but not as much as I hoped: it is not a day and night difference. It is hard to properly measure. Does the LLM result improve? Hard to say. Do I get more meaningful conversations? Hard to say. Does Claude’s regular use of <code class="language-plaintext highlighter-rouge">gtk proxy</code> negate the savings? Not completely, but it happens often enough to have an impact.</p>

<p>The only way to say for sure requires a lot more data and a lot more proper testing. Not something I am going to do. For now I will keep it running in my setup: it does no harm and I think I see a benefit in token usage, but I cannot prove it.</p>

<p>If the concept interests you, just use <a href="https://github.com/rtk-ai/rtk">rtk</a>. It is well-maintained, supports more tools than my version, and integrates with most AI coding agents out of the box. I built my own because I wanted to customize the filters for my specific workflow, but for most people rtk will do everything you need.</p>

<hr />

<p><em>Building your own Claude Code tooling? <a href="#" onclick="task1(); return false;">Get in touch</a> to compare notes.</em></p>

<h2 id="resources">Resources</h2>

<ul>
  <li><a href="https://github.com/rtk-ai/rtk">rtk (Rust Token Killer)</a> – the original Rust project that inspired gtk</li>
  <li><a href="https://www.rtk-ai.app/">rtk website</a> – documentation and install instructions for rtk</li>
  <li><a href="https://code.claude.com/docs/en/hooks">Claude Code Hooks Reference</a> – how PreToolUse hooks work</li>
  <li><a href="/ai/security/development/tools/2026/02/01/securing-claude-code-hooks-best-practices.html">Securing YOLO Mode</a> – my earlier post on using hooks for security</li>
</ul>]]></content><author><name>Albert Sikkema</name></author><category term="ai" /><category term="development" /><category term="tools" /><summary type="html"><![CDATA[How a CLI proxy filters shell command output to reduce token usage in AI coding sessions with Claude Code. The concept, the implementation, and what it actually saves.]]></summary><media:thumbnail xmlns:media="http://search.yahoo.com/mrss/" url="https://www.albertsikkema.com/assets/images/gtk-cutting-llm-token-costs-cli-output-blog.png" /><media:content medium="image" url="https://www.albertsikkema.com/assets/images/gtk-cutting-llm-token-costs-cli-output-blog.png" xmlns:media="http://search.yahoo.com/mrss/" /></entry><entry><title type="html">Security by Design: Using Project CodeGuard as AI Guardrails</title><link href="https://www.albertsikkema.com/ai/development/security/2026/04/07/security-by-design-with-project-codeguard.html" rel="alternate" type="text/html" title="Security by Design: Using Project CodeGuard as AI Guardrails" /><published>2026-04-07T00:00:00+00:00</published><updated>2026-04-07T00:00:00+00:00</updated><id>https://www.albertsikkema.com/ai/development/security/2026/04/07/security-by-design-with-project-codeguard</id><content type="html" xml:base="https://www.albertsikkema.com/ai/development/security/2026/04/07/security-by-design-with-project-codeguard.html"><![CDATA[<figure>
  <img src="/assets/images/security-by-design-codeguard.jpg" alt="Hedgehog walking through green grass, nature's own security by design" width="1920" height="1440" fetchpriority="high" style="width:100%;height:auto" />
  <figcaption>Security by design, the natural way. Photo by <a href="https://unsplash.com/@smoliak">Viktor Smoliak</a> on <a href="https://unsplash.com/photos/7B9Ia1dQL_U">Unsplash</a>.</figcaption>
</figure>

<p>In the <a href="/ai/development/best-practices/2026/03/31/evidence-based-best-practices-ai-guardrails.html">previous two posts</a> I shared 18 best practice files that keep AI-generated code production-ready, covering architecture, error handling, security, privacy, accessibility, and more. Those are my own distillations, written from experience and grounded in standard and literature. This post adds another layer: security by design (or better ‘early implementation’)</p>

<p>The rules come from <a href="https://project-codeguard.org/">Project CodeGuard</a>, an open-source, model-agnostic security framework maintained by <a href="https://github.com/cosai-oasis/project-codeguard">CoSAI (Coalition for Secure AI)</a> and originally developed by Cisco. The framework provides over a hundred security rules derived from OWASP cheat sheets and CWE guidance, formatted specifically so AI coding agents can use them during code generation and review.</p>

<p>I’m not going to walk through every rule, there are 109 of them and that would be a very boring read. Instead, I’ll show how I’ve wired them into my Claude Code setup so they activate at the right moments, and how you could do something similar regardless of your tooling.</p>

<h2 id="what-is-project-codeguard">What is Project CodeGuard?</h2>

<p>Project CodeGuard ships two sets of rules:</p>

<p><strong>Core rules</strong> (22 files) cover broad security domains: input validation, authentication, authorization, session management, cryptography, file handling, logging, container security, supply chain, privacy, and more. These are language-tagged, each file lists which programming languages it applies to in its YAML frontmatter.</p>

<p><strong>OWASP rules</strong> (86 files) are more granular. They map closely to individual <a href="https://cheatsheetseries.owasp.org/">OWASP Cheat Sheets</a>: SQL injection prevention, CSRF, XSS, content security policy, JWT handling, OAuth2, Docker security, Kubernetes security, GraphQL, REST assessment, and dozens more.</p>

<p>The format is straightforward. Each rule is a markdown file with YAML frontmatter listing the applicable languages and a description. The body contains the actual guidance: principles, do/don’t patterns, code examples in multiple languages, and implementation checklists. Here’s a taste of the authorization rule:</p>

<div class="language-markdown highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nn">---</span>
<span class="na">description</span><span class="pi">:</span> <span class="s">Authorization and access control (RBAC/ABAC/ReBAC, IDOR, mass assignment, transaction auth)</span>
<span class="na">languages</span><span class="pi">:</span> <span class="pi">[</span><span class="nv">c</span><span class="pi">,</span> <span class="nv">go</span><span class="pi">,</span> <span class="nv">java</span><span class="pi">,</span> <span class="nv">javascript</span><span class="pi">,</span> <span class="nv">php</span><span class="pi">,</span> <span class="nv">python</span><span class="pi">,</span> <span class="nv">ruby</span><span class="pi">,</span> <span class="nv">typescript</span><span class="pi">,</span> <span class="nv">yaml</span><span class="pi">]</span>
<span class="na">alwaysApply</span><span class="pi">:</span> <span class="no">false</span>
<span class="nn">---</span>

<span class="gu">## Authorization &amp; Access Control</span>

<span class="gu">### Core Principles</span>
<span class="p">1.</span> Deny by Default
<span class="p">2.</span> Principle of Least Privilege
<span class="p">3.</span> Validate Permissions on Every Request
<span class="p">4.</span> Prefer ABAC/ReBAC over RBAC
</code></pre></div></div>

<p>The rules don’t guarantee secure code. They steer the model toward safer patterns and away from common mistakes. Think of them as a knowledgeable colleague looking over the model’s shoulder, one who has memorized every OWASP cheat sheet.</p>

<h2 id="how-i-use-them">How I Use Them</h2>

<p>I’ve built a fairly extensive Claude Code setup with custom commands, agents, skills, and helper scripts, more than what most people will have. The examples below show how I’ve wired the rules into that setup. Your setup will look different, and that’s fine. The underlying pattern is what matters: load the right rules at the right moment, not all of them all the time.</p>

<p>The rules sit in <code class="language-plaintext highlighter-rouge">memories/security_rules/</code> with two subdirectories: <code class="language-plaintext highlighter-rouge">core/</code> and <code class="language-plaintext highlighter-rouge">owasp/</code>. They’re not loaded into every conversation, that would burn tokens on rules about Kubernetes security when you’re writing a Python CLI tool. Instead, they’re pulled in selectively by the parts of my setup that need them.</p>

<h3 id="planning-phase-risk-analysis">Planning Phase Risk Analysis</h3>

<p>Before writing a plan, my setup spawns a quality risk analyzer agent. It takes the feature description, figures out which security areas apply (does this feature handle user input? sessions? file uploads?), reads the relevant CodeGuard rules, then surfaces risks and recommendations that get baked into the plan itself.</p>

<p>The result is that security considerations don’t popup at review. They’re in the plan from the start, with specific rules referenced. The developer (or the model) knows what to watch for during implementation. (Which does fit the security-by-design paradigm nicely)</p>

<h3 id="security-aware-pr-reviews">Security-Aware PR Reviews</h3>

<p>My PR review workflow spawns multiple agents in parallel, code quality, test coverage, best practices, and security. The security agent is where CodeGuard rules come alive.</p>

<p>The agent’s instructions include a mapping table: if the diff touches user input, load the input validation rules, etcetera. The agent reads the relevant 3-5 rule files, then applies them against the actual changed code.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>| If code handles...  | Read these rules                                          |
|----------------------|----------------------------------------------------------|
| User input           | input-validation-injection, injection-prevention          |
| Authentication       | authentication, password-storage, credential-stuffing     |
| Authorization        | authorization-access-control, insecure-direct-object-ref  |
| File operations      | file-handling-and-uploads, file-upload                    |
| Docker/K8s           | devops-ci-cd-containers, docker-security, kubernetes      |
</code></pre></div></div>

<p>This is the same table that appears in the agent definition, the code audit command, and the quality risk analyzer. This makes sure that there is consistency in mapping across all those steps.</p>

<h3 id="full-security-audits">Full Security Audits</h3>

<p>My code audit command runs a full security analysis across 18 areas, split into three phases: critical controls (data isolation, injection, authentication, XSS, file uploads, secrets), security configuration (rate limiting, CSRF, RBAC, database, logging, third-party integrations), and implementation patterns (secure coding, error handling, API security, frontend, dependencies, performance).</p>

<p>Phase 0 of this audit is “framework discovery”, it detects the tech stack, then loads the relevant CodeGuard rules filtered by language. The audit then cross-references findings against both my own best practice files and the CodeGuard rules, giving two independent perspectives on the same code.</p>

<h2 id="how-you-could-use-them">How You Could Use Them</h2>

<p>You don’t need my specific setup to benefit from these rules. Here’s the general pattern:</p>

<p><strong>Step 1: Get the rules.</strong> Clone the <a href="https://github.com/project-codeguard/rules">Project CodeGuard rules repository</a>. The rules are in <code class="language-plaintext highlighter-rouge">core/</code> and <code class="language-plaintext highlighter-rouge">owasp/</code> directories.</p>

<p><strong>Step 2: Put them where your agent can find them.</strong> For Claude Code, that means somewhere in your project directory or a path your configuration references. I use <code class="language-plaintext highlighter-rouge">memories/security_rules/</code> but any path works.</p>

<p><strong>Step 3: Don’t load everything.</strong> The rules total a lot of tokens. You benefit costwise from selective loading based on what the code actually does.</p>

<p><strong>Step 4: Build a mapping.</strong> Create a simple lookup: “if the code handles X, load rules Y and Z.” This is the most important part. Without it, you either load nothing (useless) or everything (expensive and noisy).</p>

<p><strong>Step 5: Wire it into your review/audit workflow.</strong> Whether that’s a custom command, a hook, a skill, or just a prompt, the rules need to be loaded at the point where security matters. That’s in planning and review.</p>

<p>The rules also come with a <a href="https://project-codeguard.org/getting-started/">SKILLS.md template</a> that you can drop directly into coding agents that support skills (Claude Code, Cursor, Copilot). The template defines when to activate the skill and how to apply the rules based on what the code does. I do not use it as a skill. Why? Because that would imply that I need to think about it, and the whole point is that I do not need to remember using it, but that it is part of everything I do. So it should be there in the background, just out of sight but always there and steering the plans and reviews.</p>

<h2 id="why-external-rules-and-not-just-be-secure">Why External Rules and Not Just “Be Secure”</h2>

<p>Simply prompting “Write secure code” is not gonna work for you. It is too simple, too broad and does not comply with the idea of security by design.</p>

<p>The model has extensive security knowledge in its training data. The art of getting this to work is activating the right knowledge at the right time. When you load a CodeGuard rule about SQL injection prevention, you’re not teaching the model something new, it has a lot of data about injection prevention: you’re simply telling it “this is relevant right now, apply it.” The rule contains specific patterns (use parameterized queries, never concatenate user input into SQL, use least-privilege database users) that the model knows but might not prioritize without the prompt.</p>

<p>This is the same principle behind the best practice files from the previous posts: bring specific knowledge to the front of the model’s attention when it matters. The good thing is that CodeGuard rules are maintained, OWASP-backed, and cover a broader surface than I could write.</p>

<p>Keep in mind that this does not make your code secure: it still is up to you to decide if it is secure in your context. But in my experience this does help a lot! (My automated code reviews have never been so sharp and to the point as now I started applying these principles.)</p>

<h2 id="try-it-yourself">Try It Yourself</h2>

<p>The CodeGuard rules are available from the <a href="https://github.com/project-codeguard/rules">Project CodeGuard rules repository</a>. The best practice files from the previous posts are in <a href="https://github.com/albertsikkema/claude-code-best-practices">my public repository</a>. Grab what’s relevant to your stack and wire them into your workflow.</p>

<hr />

<p><em>Have questions or want to share your approach? <a href="#" onclick="task1(); return false;">Get in touch</a>.</em></p>

<h2 id="references">References</h2>

<ul>
  <li><a href="https://project-codeguard.org/">Project CodeGuard (old Cisco repo)</a>, the framework</li>
  <li><a href="https://github.com/cosai-oasis/project-codeguard">CoSAI / Project CodeGuard (OASIS)</a>, the CoSAI-maintained repository</li>
  <li><a href="https://cheatsheetseries.owasp.org/">OWASP Cheat Sheet Series</a>, the source material for many CodeGuard rules</li>
  <li><a href="https://owasp.org/www-project-top-ten/">OWASP Top 10</a></li>
  <li><a href="https://github.com/cosai-oasis">CoSAI (Coalition for Secure AI)</a>, the broader initiative behind CodeGuard, with more AI security projects worth exploring</li>
  <li><a href="https://blogs.cisco.com/ai/announcing-new-framework-securing-ai-generated-code">Announcing a New Framework for Securing AI-Generated Code (Cisco Blog)</a></li>
</ul>]]></content><author><name>Albert Sikkema</name></author><category term="ai" /><category term="development" /><category term="security" /><summary type="html"><![CDATA[How I use 109 OWASP-based security rules from Project CodeGuard to embed security by design into AI coding workflows.]]></summary><media:thumbnail xmlns:media="http://search.yahoo.com/mrss/" url="https://www.albertsikkema.com/security-by-design-with-project-codeguard-blog.png" /><media:content medium="image" url="https://www.albertsikkema.com/security-by-design-with-project-codeguard-blog.png" xmlns:media="http://search.yahoo.com/mrss/" /></entry><entry><title type="html">Evidence-Based Best Practices as AI Guardrails (Part 2)</title><link href="https://www.albertsikkema.com/ai/development/best-practices/security/2026/04/01/security-privacy-production-hardening-ai-guardrails.html" rel="alternate" type="text/html" title="Evidence-Based Best Practices as AI Guardrails (Part 2)" /><published>2026-04-01T00:00:00+00:00</published><updated>2026-04-01T00:00:00+00:00</updated><id>https://www.albertsikkema.com/ai/development/best-practices/security/2026/04/01/security-privacy-production-hardening-ai-guardrails</id><content type="html" xml:base="https://www.albertsikkema.com/ai/development/best-practices/security/2026/04/01/security-privacy-production-hardening-ai-guardrails.html"><![CDATA[<figure>
  <img src="/assets/images/security-privacy-hardening-antelope-canyon.jpg" alt="Sunlight beam piercing through layered sandstone walls of Antelope Canyon, Arizona" width="5472" height="3648" fetchpriority="high" style="width:100%;height:auto" />
  <figcaption>Layers of protection, carved deep. Photo by <a href="https://www.pexels.com/@madhu-shesharam-108388377">Madhu Shesharam</a> on <a href="https://www.pexels.com/photo/a-cave-of-red-rock-formation-with-sunlight-reflection-9579434/">Pexels</a>.</figcaption>
</figure>

<p>This is part two of an originally three part now turned into two-part series on using evidence-based best practice files to keep AI-generated code production-ready and improve the general quality. <a href="/ai/development/best-practices/2026/03/31/evidence-based-best-practices-ai-guardrails.html">Part 1</a> covered the foundations: architecture, error handling, testing, API design, data integrity, and structured logging – with detailed examples of how each file works. I originally planned 3 parts, but I do not like dragging it out, and I get bored easily, so better to get it done: here is the rest!</p>

<p>This post covers the remaining twelve files. Rather than repeating the deep-dive format, I’ll give you the core idea behind each: the problem it solves, the principle it encodes, and why Claude gets it wrong without it. The files themselves contain the full rules, code examples, and trade-offs – grab them from the <a href="https://github.com/albertsikkema/claude-code-best-practices">public repository</a> and read what’s relevant to your stack.</p>

<h2 id="what-this-post-covers">What This Post Covers</h2>

<p><strong>Security and Privacy</strong></p>

<ol>
  <li><a href="#1-authorization">Authorization</a> – the difference between “logged in” and “allowed”</li>
  <li><a href="#2-defense-in-depth-validation">Defense-in-Depth Validation</a> – why one validation layer is never enough</li>
  <li><a href="#3-container-security">Container Security</a> – a secret “deleted” in layer 5 still exists in layer 3</li>
  <li><a href="#4-privacy-by-design">Privacy by Design</a> – the safest data is data you never collected</li>
</ol>

<p><strong>Operations and Reliability</strong></p>

<ol>
  <li><a href="#5-resilience-patterns">Resilience Patterns</a> – designing for the certainty that dependencies will fail</li>
  <li><a href="#6-zero-downtime-deployment">Zero-Downtime Deployment</a> – old and new code run simultaneously, plan for it</li>
  <li><a href="#7-observability">Observability</a> – turning 2-hour investigations into 5-minute ones</li>
  <li><a href="#8-background-job-patterns">Background Job Patterns</a> – not all work belongs in the request cycle</li>
</ol>

<p><strong>User-Facing Quality</strong></p>

<ol>
  <li><a href="#9-accessibility">Accessibility</a> – build interfaces that work for everyone</li>
  <li><a href="#10-seo">SEO</a> – make your structure machine-readable</li>
</ol>

<p><strong>External Boundaries</strong></p>

<ol>
  <li><a href="#11-robots-and-scraping-protection">Robots and Scraping Protection</a> – control what automated agents can access</li>
  <li><a href="#12-llm-integration-patterns">LLM Integration Patterns</a> – route cheap before expensive</li>
</ol>

<hr />

<h2 id="security-and-privacy">Security and Privacy</h2>

<h3 id="1-authorization">1. Authorization</h3>

<p>Authentication and authorization are two different things. Claude is quite loose in using the terminology: It builds login flows, JWT validation, and session management, then considers security “done.” But knowing <em>who</em> a user is tells you nothing about <em>what they’re allowed to do</em>. (BTW important to realise: this is not necessarily Claude’s ‘fault’, probably has more to do with the training data and the ambiguity in there.)</p>

<p>The file encodes default-deny authorization: every request is unauthorized unless explicitly permitted. Object-level checks (not just “can users access orders” but “can <em>this</em> user access <em>this specific</em> order”), centralized RBAC, and re-authentication for destructive operations. Without it, Claude writes <code class="language-plaintext highlighter-rouge">GET /api/orders/:id</code> that returns any order to any authenticated user. IDOR (Insecure Direct Object Reference – where a user accesses resources by manipulating an identifier like an ID in the URL, without the server checking whether they’re allowed to) sits at #1 in the <a href="https://owasp.org/www-project-top-ten/">OWASP Top 10</a> as a part of A01:2021 Broken Access Control. And Claude in my experience is notoriously bad and inconsistent when it comes to IDOR.</p>

<h3 id="2-defense-in-depth-validation">2. Defense-in-Depth Validation</h3>

<p>Most of the time Claude adds one layer of input validation (if at all) – usually a schema check at the handler – and moves on. That catches missing fields and wrong types. It does nothing against path traversal, injection patterns, or context-specific attacks. Not a problem, I do not expect to create perfect software all in one go. But I am supposed to spot that and deliver a secure product. Defense in depth is a step in getting there.</p>

<p>The file defines four independent validation layers: structural constraints, format and character restrictions, explicit security pattern checks (path traversal, null bytes, injection), and downstream sanitization (parameterized queries, shell escaping, HTML encoding). The key principle: each layer assumes the others might fail. Remove any single layer and the system is still protected (perhaps not as good as with the layer you removed, but you get the point: adding multiple layers protects you from yours or Claude’s stupidity).</p>

<h3 id="3-container-security">3. Container Security</h3>

<p>Claude is actually quite good at writing docker container definitions. It does not use full Ubuntu base images, running as root, secrets passed as build arguments visible in <code class="language-plaintext highlighter-rouge">docker history</code> forever. It does not build a functional container with the attack surface of a full server. But it is smart to keep it aligned with your (or in this case my) vision on security and building layers.</p>

<p>The file mandates minimal base images (distroless or slim), multi-stage builds, non-root execution, read-only filesystems, dropped capabilities, and never baking secrets into layers. A Go app in distroless is small with no shell to exploit. The same app in Ubuntu is 500MB with everything that gives a bad actor points to interact with.</p>

<h3 id="4-privacy-by-design">4. Privacy by Design</h3>

<p>Sometimes I think Claude has no concept or was never trained on the basic concepts of privacy. Ask it to build a user profile page and it’ll collect date of birth, phone number, and address “in case we need them later.” That “in case” creates legal liability under <a href="https://eur-lex.europa.eu/legal-content/EN/TXT/?uri=CELEX%3A32016R0679">GDPR</a> that compounds over time. Plenty of examples (also here in the Netherlands) with very private data that becomes available on the dark web or is ransomed. Not something you want to run into. So the starting point: do not collect it unless absolutely necessary (and most of it is absolutely NOT necessary).</p>

<p>The file encodes data minimization (collect only what’s functionally necessary), consent management (explicit, informed, granular, revocable), right to erasure (covering all copies: database, backups, caches, third-party systems), data portability, and cookie compliance under the <a href="https://eur-lex.europa.eu/legal-content/EN/TXT/?uri=CELEX%3A02009L0136-20201221">ePrivacy Directive</a>. Privacy becomes a first-class architectural concern, not a compliance checkbox. You want to start with privacy awareness in the start phase of your product, but keep the right to erasure and data portability for a later stage.</p>

<hr />

<h2 id="operations-and-reliability">Operations and Reliability</h2>

<h3 id="5-resilience-patterns">5. Resilience Patterns</h3>

<p>Sometimes you get lucky and Claude remembers to build in some resilience. Most of the time you will find out when testing, or even better in production. And with luck the first time a dependency slows down in production, the entire system cascades into failure. Better to build this in from the start and explicitly design for it.</p>

<p>The file defines timeouts on every outbound call (with specific defaults per type), deadline propagation, retries with exponential backoff and jitter (only for idempotent operations, only for transient errors), circuit breakers (closed/open/half-open), graceful degradation (hard vs soft dependencies), and backpressure. The composition order matters: backpressure -&gt; circuit breaker -&gt; timeout -&gt; retry -&gt; degradation. As I said in the previous post: Claude knows much more about this than I will ever do, but you need to call it to the center of attention (not unlike how you focus a human’s attention I find).</p>

<h3 id="6-zero-downtime-deployment">6. Zero-Downtime Deployment</h3>

<p>Deployments and data migrations with Claude can often go awesome, and sometimes you are in deep trouble. So it is good, for your own sanity and job security, to protect the data at all costs. How far you will take this is up to you. You decide what is acceptable (no data loss is priority one and zero downtime (or near zero) is a good second one.) If need be, make sure to run both versions of your database at once. Or just go for it. It is a thrill (if you like that kind of thrill). Rule number one: make sure you always can go back! Rule number two: backup. Rule number three: make sure you can roll back. Rule number four: backup. (And rule 5 and following: make sure your backups can be deployed. Best to do that before deploying.)</p>

<p>The file mandates expand-contract migrations in four phases (expand, backfill, deploy new code, contract), health-check-gated rollouts with separate <code class="language-plaintext highlighter-rouge">/health</code> and <code class="language-plaintext highlighter-rouge">/ready</code> endpoints, graceful shutdown on SIGTERM, and tested rollback scripts. Every change must be safe for both old and new versions to read and write at the same time. Make this as complex as you want or the situation warrants.</p>

<h3 id="7-observability">7. Observability</h3>

<p>Simply following a data flow through your code using correlation IDs is really handy. It really helps with debugging, monitoring and logging. Claude sometimes adds <code class="language-plaintext highlighter-rouge">logger.info("request processed")</code> and considers observability done and is very sure of that. No correlation IDs, no metrics, no structured context. When something breaks, you’re searching unstructured logs with no way to connect a failed request to its downstream calls. And the good thing about doing this properly (or at least add some improvements that help observability): Claude will help you debug better because it can easily work with this.</p>

<p>The file covers the three pillars: structured logs (what happened), metrics (how much), and distributed traces (the journey of a request). Every request gets a correlation ID propagated through all downstream calls. The Four Golden Signals (latency, traffic, errors, saturation), RED method for services, USE method for resources. Alerts with severity-appropriate thresholds and enough context to start investigating immediately.</p>

<h3 id="8-background-job-patterns">8. Background Job Patterns</h3>

<p>I suspect that it has to do with the training data, but Claude will almost never propose to put a process in a background process, unless you explicitly ask it to. Mentioning it in these files does increase the likelihood that it will be used in the places that matter (it is funny when you worked with claude for a while and start adding these files, it seems to become a lot better at making code that makes sense.)</p>

<p>The file defines three patterns with clear selection criteria: fire-and-forget (return 202, spawn background work, no status tracking), tracked jobs (return a job ID, persist status to a database, client polls for completion), and queue-based processing (decouple producer and consumer with a message queue, bounded retries, dead letter queues). Cross-cutting concerns are covered: isolation between request and worker threads, correlation IDs for background logging, timeouts on every job, and graceful shutdown behavior. Start simple, scale up when you need it. Background processes are not the answer to all problems, but they have their uses.</p>

<hr />

<h2 id="user-facing-quality">User-Facing Quality</h2>

<h3 id="9-accessibility">9. Accessibility</h3>

<p>Claude generates <code class="language-plaintext highlighter-rouge">&lt;div onClick={...}&gt;</code> instead of <code class="language-plaintext highlighter-rouge">&lt;button&gt;</code>, skips <code class="language-plaintext highlighter-rouge">alt</code> attributes, ignores keyboard navigation, and uses color alone to convey meaning. The result looks fine. A screen reader can’t parse it, a keyboard user can’t navigate it, and you’re non-compliant with the <a href="https://ec.europa.eu/social/main.jsp?catId=1202">European Accessibility Act</a>. This is one thing Claude is actually really bad at, it has never given me ideas or steps to improve accessibility. Again the training data I think. Anyway, it is vital to include this. It used to be quite a hassle, but nowadays you cannot build a webpage with no attention for WCAG. Implementing the basics is easy with Claude, you just have to tell what you want. This file helps with that. But just this file is in this case not enough. In a later post I will share how I (attempt to) solve this.</p>

<p>The file targets <a href="https://www.w3.org/TR/WCAG22/">WCAG 2.2</a> Level AA: semantic HTML elements for their intended purpose, keyboard accessibility for all interactive elements, text alternatives for all non-text content, color contrast ratios (4.5:1 for normal text, 3:1 for large text and UI components), labeled form inputs, respect for <code class="language-plaintext highlighter-rouge">prefers-reduced-motion</code>, proper ARIA live regions for dynamic content, text resizing support, and automated axe-core checks in CI. Automated tools catch 30-40% of issues – the file also specifies what requires manual testing. Important here is to watch the structure of a page.</p>

<h3 id="10-seo">10. SEO</h3>

<p>SEO is not automatically added to your page if you let Claude build it. For most projects I work on that is not an issue: internal tools and applications do not need that. But for a lot of pages visibility for search engines is crucial. BTW good performance is useful for every tool, so parts of this are useful for those internal tools as well.</p>

<p>The file covers crawlability (SSR for public pages, no orphan pages), canonical URLs (one URL per piece of content), meaningful title and meta tags, structured data via JSON-LD (Organization, Product, Article, FAQ, Breadcrumb), auto-generated sitemaps, hreflang for multi-language sites, Core Web Vitals optimization (LCP &lt; 2.5s, CLS &lt; 0.1, INP &lt; 200ms), proper redirect handling, and useful 404 pages. For multi-region sites: URL strategies (ccTLD vs subdomain vs subdirectory) with trade-offs for each.</p>

<hr />

<h2 id="external-boundaries">External Boundaries</h2>

<h3 id="11-robots-and-scraping-protection">11. Robots and Scraping Protection</h3>

<p>Even though I love using LLM’s, the amount of bot traffic has risen tremendously as a consequence. The problem is that Claude doesn’t think about bot traffic (a cynic could argue that this is in its own (or its owners) interest). It builds a public API without rate limiting, exposes admin paths in <code class="language-plaintext highlighter-rouge">robots.txt</code> (telling attackers exactly where to look), and ignores the fact that AI training crawlers will scrape everything they can reach.</p>

<p>The file separates search engine crawlers (usually welcome) from AI training crawlers (block by default: GPTBot, CCBot, Google-Extended, Bytespider, ClaudeBot, and others). Per-page control via meta tags for indexing decisions. Server-side rate limiting on all public endpoints – <code class="language-plaintext highlighter-rouge">robots.txt</code> is advisory only, not enforcement. API scraping protection through authentication, pagination limits, and monitoring. And a <code class="language-plaintext highlighter-rouge">security.txt</code> so researchers know where to report vulnerabilities. But keep in mind: a lot of bots ignore robots.txt. So do not be surprised if your cloud bills are through the roof because of a sudden spike in bot interest. (this is actually a really good reason to run your applications on your own (rented) hardware)</p>

<h3 id="12-llm-integration-patterns">12. LLM Integration Patterns</h3>

<p>This is a work in progress and far from complete. The basic idea is that you use logic for the deterministic parts of your workflow and LLM for the non- or semi-deterministic parts. You do not want to use LLMs for every step, most of the time it is far better to use a bit of code: it is predictable, reliable and testable. All things that an LLM are not. I am still improving this part, but found that this already helps in getting Claude to ‘think’ in the direction I want it to.</p>

<p>The file encodes a principle: route cheap before expensive. Keep deterministic work (filtering, sorting, validation) outside the LLM. Use a cheap classifier to route off-topic requests before hitting the expensive agent. Format context deliberately with truncation limits – a pure function that takes structured data and returns a token-efficient string. Manage prompts as versioned files, not hardcoded strings. Cache responses for repeated queries. Handle non-determinism explicitly: validate structured output against schemas, track fallback rates, retry once on malformed responses. Never expose raw LLM errors to users. And observe everything: prompt length, response length, token usage, latency, cost per request, tool call sequences. Your wallet will thank you (or your boss).</p>

<hr />

<h2 id="the-full-picture">The Full Picture</h2>

<p>Across both posts, these 18 files form a connected system. Part 1 laid the structural foundations: how code is organized, how errors flow, how tests verify, how APIs behave, how data stays consistent, and how logs tell you what happened. This post covered what makes that foundation trustworthy in production: who can do what, how input is validated at multiple layers, how containers are hardened, how privacy is protected, how the system survives failure, how deploys avoid downtime, how you observe it all, how background work is managed, how interfaces work for everyone, how search engines find you, how bots are controlled, and how LLM integrations stay efficient.</p>

<p>No single file solves the problem. Authorization without observability means you won’t know when it fails. Resilience without observability means watching failures you can’t diagnose. Container security without privacy means a hardened runtime leaking PII through the application. The value is in the combination – and in the traceability back to requirements, specifications, and standards.</p>

<p>And as mentioned before and here again: the clue is not to try to tell Claude what it needs to do. It is to get Claude to ‘remember’ what it already knows. Bring to the front of its attention those things you think are necessary for that step in the development process in your product. It has way more data than you ever will comprehend, but it needs a hooman to keep it focused.</p>

<h2 id="try-it-yourself">Try It Yourself</h2>

<p>All 18 best practice files are in the <a href="https://github.com/albertsikkema/claude-code-best-practices">public repository</a>. Drop them into your setup, adapt them to your stack, or use them as a starting point for your own.</p>

<p><strong>In this series:</strong></p>

<ul>
  <li><a href="/ai/development/best-practices/2026/03/31/evidence-based-best-practices-ai-guardrails.html">Part 1: Architecture, Error Handling, Testing, API Design, Data Integrity, Structured Logging</a></li>
  <li><strong>Part 2: Security, Privacy, Operations, Accessibility, SEO, and Integration Patterns</strong> (this post)</li>
</ul>

<hr />

<p><em>Have questions or want to share your approach to keeping AI-generated code in line? <a href="#" onclick="task1(); return false;">Get in touch</a>.</em></p>

<h2 id="standards-and-references">Standards and References</h2>

<ul>
  <li><a href="https://owasp.org/www-project-top-ten/">OWASP Top 10</a></li>
  <li><a href="https://eur-lex.europa.eu/legal-content/EN/TXT/?uri=CELEX%3A32016R0679">GDPR (Regulation 2016/679)</a></li>
  <li><a href="https://eur-lex.europa.eu/legal-content/EN/TXT/?uri=CELEX%3A02009L0136-20201221">ePrivacy Directive (2009/136/EC)</a></li>
  <li><a href="https://www.w3.org/TR/WCAG22/">WCAG 2.2 - Web Content Accessibility Guidelines</a></li>
  <li><a href="https://ec.europa.eu/social/main.jsp?catId=1202">European Accessibility Act</a></li>
  <li><a href="https://github.com/opencontainers/image-spec">OCI Image Specification</a></li>
  <li><a href="https://www.rfc-editor.org/rfc/rfc9457">RFC 9457 - Problem Details for HTTP APIs</a></li>
</ul>]]></content><author><name>Albert Sikkema</name></author><category term="ai" /><category term="development" /><category term="best-practices" /><category term="security" /><summary type="html"><![CDATA[The remaining 12 best practice files that keep AI-generated code production-ready. Part 2 of 2: security, privacy, operations, accessibility, SEO, and integration patterns.]]></summary><media:thumbnail xmlns:media="http://search.yahoo.com/mrss/" url="https://www.albertsikkema.com/security-privacy-production-hardening-ai-guardrails-blog.png" /><media:content medium="image" url="https://www.albertsikkema.com/security-privacy-production-hardening-ai-guardrails-blog.png" xmlns:media="http://search.yahoo.com/mrss/" /></entry><entry><title type="html">Evidence-Based Best Practices as AI Guardrails (Part 1)</title><link href="https://www.albertsikkema.com/ai/development/best-practices/2026/03/31/evidence-based-best-practices-ai-guardrails.html" rel="alternate" type="text/html" title="Evidence-Based Best Practices as AI Guardrails (Part 1)" /><published>2026-03-31T00:00:00+00:00</published><updated>2026-03-31T00:00:00+00:00</updated><id>https://www.albertsikkema.com/ai/development/best-practices/2026/03/31/evidence-based-best-practices-ai-guardrails</id><content type="html" xml:base="https://www.albertsikkema.com/ai/development/best-practices/2026/03/31/evidence-based-best-practices-ai-guardrails.html"><![CDATA[<figure>
  <img src="/assets/images/guardrails-desert-road-valley-of-fire.jpg" alt="Desert road with guardrails leading toward red mountains in Valley of Fire, Nevada" width="1920" height="2832" fetchpriority="high" style="width:100%;height:auto" />
  <figcaption>Guardrails keep you on the road, even when the terrain gets rough. Photo by <a href="https://unsplash.com/@bricecooper">Brice Cooper</a> on <a href="https://unsplash.com/photos/a-road-with-a-mountain-in-the-background-rZybLYQ7xTg">Unsplash</a>.</figcaption>
</figure>

<p>Claude Code writes working code. It passes tests, it runs, it does what you asked. Then it logs passwords in plaintext, skips input validation, serves pages that screen readers can’t parse, and deploys in a way that takes your site down for thirty seconds.</p>

<p>Tell it once, it listens. Next conversation, same mistakes.</p>

<p>I’ve spent over 2,000 hours iterating on my Claude Code setup. The single most impactful thing I did wasn’t clever prompting or complex hooks, it was writing down what I know about building production software in a way the model can actually use.</p>

<p>Not opinions. Evidence-based practices, grounded in standards: <a href="https://www.rfc-editor.org/rfc/rfc9457">RFC 9457</a> for error responses, <a href="https://www.w3.org/TR/WCAG22/">WCAG 2.2</a> for accessibility, <a href="https://owasp.org/www-project-top-ten/">OWASP Top 10</a> for security, <a href="https://www.dama.org/cpages/body-of-knowledge">DAMA-DMBOK</a> for data quality. Each practice traces to a concrete requirement, each requirement traces to a specification, and each specification traces to a library choice. When the model writes code, it doesn’t just know <em>what</em> to do, it knows <em>why</em>, and <em>with what</em>.</p>

<h2 id="the-problem-with-instructions">The Problem with Instructions</h2>

<p>You’ve probably tried the obvious approach: tell the LLM what to do.</p>

<p>“Always use parameterized queries.” “Handle errors properly.” “Make it accessible.”</p>

<p>This works once, maybe twice but inevitably it drifts. The model is smart enough, but the instructions are vague, context-dependent, and easily outweighed by other things in the prompt. “Handle errors properly” means nothing without defining what “properly” looks like in your stack, with your patterns, for your use case. And you need to keep repeating this, every prompt. Or it will revert to its old ways.</p>

<p>I went through the same cycle most people go through. First, I wrote instructions in the prompt, then claude.md. Then I moved them to rules. Then I <a href="/ai/security/development/tools/2026/02/01/securing-claude-code-hooks-best-practices.html">added hooks to enforce them</a>. Each step helped, but the model kept finding new ways to cut corners I hadn’t explicitly forbidden. Then skills came into swing.</p>

<p>I also tried existing frameworks (when I started they did not really exist yet). <a href="https://github.com/bmadcode/BMAD-METHOD">BMAD</a>, spec-driven development, <a href="https://github.com/humanlayer/humanlayer">HumanLayer</a> (which I genuinely liked for its “thoughts” directory approach to project memory). But in practice, I found most of them too dogmatic. They impose a rigid process that doesn’t bend to the messy reality of actual projects, where sometimes you need to spike something quickly, sometimes you need deep planning, and the model needs to know the difference. What works is pragmatism: take the good ideas from each, discard the ceremony, and <a href="/ai/tools/productivity/2025/10/14/supercharge-claude-code-with-custom-configuration.html">build something that adapts to how you actually work</a>.</p>

<p>I am not explaining my full system in this series: it would take way more to explain that, maybe I will do that later. (I am now building a semi-automated system, that takes these best practices and so far is actually able to write better code than I can, maintaining a level of quality and coherence) But it remains a work in progress.</p>

<h2 id="the-system-requirements-specifications-best-practices">The System: Requirements, Specifications, Best Practices</h2>

<p>What I ended up building is a connected system of three layers:</p>

<p><strong>Requirements</strong> define <em>what</em> must be true. Each has an ID, a description, and a project phase (start, mvp, production). For example:</p>

<blockquote>
  <p><strong>REQ-API-001</strong>: Error responses must follow RFC 9457 (Problem Details for HTTP APIs) with a consistent structure: <code class="language-plaintext highlighter-rouge">type</code>, <code class="language-plaintext highlighter-rouge">title</code>, <code class="language-plaintext highlighter-rouge">status</code>, <code class="language-plaintext highlighter-rouge">detail</code>, and optional <code class="language-plaintext highlighter-rouge">instance</code> and extension fields. <em>(Phase: mvp)</em></p>
</blockquote>

<p><strong>Specifications</strong> define <em>how</em> to implement each requirement. They trace back to requirement IDs:</p>

<blockquote>
  <p><strong>Error format</strong>: RFC 9457 Problem Details. Content-Type: <code class="language-plaintext highlighter-rouge">application/problem+json</code>. Structure: <code class="language-plaintext highlighter-rouge">{ "type", "title", "status", "detail", "instance" }</code>. Use <code class="language-plaintext highlighter-rouge">type</code> as a stable URI for each error category. Add extension fields as needed. Never expose stack traces in production. <em>(Traces to: REQ-API-001)</em></p>
</blockquote>

<p><strong>Best practices</strong> provide the <em>deep knowledge</em>, the why, the trade-offs, the common mistakes, the patterns. They are what this series shares.</p>

<p>The traceability matters. When the model writes an error handler, it doesn’t just know “use RFC 9457”, it knows the requirement demands it, the specification defines the exact format, and the best practice file explains why generic errors are useless at 2 AM and how to add context that actually helps diagnose problems. And not to forget: the extensive training data knows more about this standard than I will ever do: you just have to ‘trigger’ it to come forward from that vast amount of data.</p>

<p>This is part one of a two-part series. The files discussed here (and the rest) are available in a <a href="https://github.com/albertsikkema/claude-code-best-practices">public repository</a> that I’ll keep updating as I add more.</p>

<h2 id="doesnt-this-cost-a-lot-of-tokens">“Doesn’t This Cost a Lot of Tokens?”</h2>

<p>Yes. It does.</p>

<p>Loading best practice files, requirements, and specifications into context costs tokens. There’s no way around that. But the cost is manageable if you’re smart about <em>when</em> you load <em>what</em>.</p>

<p>I don’t dump all 18 files into every conversation. The full files are loaded during the steps that actually use them: planning and review. When the model is designing an approach or reviewing code against standards, it needs the deep knowledge. When it’s implementing a well-defined task from an approved plan, the plan itself already encodes the relevant practices, the model doesn’t need to re-read the source material. And every now and then you find a little gem, where Claude Code starts correcting you based on your own best practices. One of mine says to leave no dead code in the repo. It corrected me that my commented-out code was not in accordance with the best practices.</p>

<p>The alternative, not spending the tokens, is worse. Without this context, the model drifts. It makes its own architectural decisions, picks its own error format, skips validation it doesn’t know you care about. Then you spend tokens correcting it. And correcting the corrections. And explaining why the correction matters. And next conversation, you start over.</p>

<p>In the end it is simple math: pay upfront to load the model with your standards at the right moments, or you pay afterwards and repeatedly to fix the output when it inevitably diverges from what you need. The upfront cost is predictable and targeted. The correction cost is unpredictable and compounds. (both in time and in money)</p>

<p>So yes, be selective: nothing more and nothing less than what is needed at that point. A frontend task doesn’t need the container security file. A database migration doesn’t need the accessibility rules. Load what’s relevant to the phase of work you’re in.</p>

<h2 id="what-this-post-covers">What This Post Covers</h2>

<p>In this first post: the foundational practices every project needs regardless of what you’re building.</p>

<ol>
  <li><a href="#1-layered-architecture">Layered Architecture</a>, how to structure code so Claude doesn’t write spaghetti</li>
  <li><a href="#2-error-handling">Error Handling</a>, turning “something went wrong” into actionable diagnostics</li>
  <li><a href="#3-testing-strategy">Testing Strategy</a>, tests that catch real bugs, not just verify mock wiring</li>
  <li><a href="#4-api-design">API Design</a>, consistent, predictable interfaces that follow standards</li>
  <li><a href="#5-data-integrity">Data Integrity</a>, because corrupt data is worse than downtime</li>
  <li><a href="#6-structured-logging">Structured Logging</a>, logs that are actually useful at 3 AM</li>
</ol>

<p>For each practice, I’ll show the problem it solves, a taste of the key rules, and how it connects back to the requirements and standards it’s built on. Most of this is no rocket science, and you might not agree with some choices, which is fine. Define your own.</p>

<hr />

<h2 id="1-layered-architecture">1. Layered Architecture</h2>

<p><strong>The problem</strong>: Without explicit guidance, Claude tends to put everything in one place. Business logic in the API handler. Database queries mixed with validation. HTTP status codes decided deep inside a service function. It works, until you need to test it, replace a dependency, or understand what the code does.</p>

<p><strong>The principle</strong>: Separate code into distinct layers with strict downward dependency, handler, service, repository, model. Each layer has one job and never reaches past its neighbour.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Handler / API Layer      -- owns the transport protocol
    |
Service / Business Logic -- owns the rules
    |
Repository / Data Access -- owns the queries
    |
Domain Model             -- owns the data shape
</code></pre></div></div>

<p><strong>Key rules from the file</strong>:</p>

<ul>
  <li>The handler validates input shape and formats responses. It does not contain business rules.</li>
  <li>The service layer orchestrates business logic and defines transaction boundaries. It does not know about HTTP status codes or request objects.</li>
  <li>The repository layer executes queries and maps results. It does not decide what data to return based on business rules.</li>
  <li>Dependencies go down only. A service never imports from a handler. A repository never calls a service.</li>
</ul>

<p><strong>Why it matters for AI-generated code</strong>: When the model understands this separation, it stops making the most common architectural mistake: putting everything in the handler. It writes services you can test without spinning up an HTTP server. It writes repositories you can swap without rewriting business logic.</p>

<p><strong>Traces to</strong>: REQ-QUAL-005 (testable code), REQ-QUAL-006 (maintainable tests). Built on the principle that each module should be describable in one sentence.</p>

<h2 id="2-error-handling">2. Error Handling</h2>

<p><strong>The problem</strong>: Claude’s default error handling is either too aggressive (catch everything, return a generic message) or too lazy (let exceptions propagate without context). Both are bad. The first hides bugs. The second makes debugging impossible.</p>

<p><strong>The principle</strong>: Errors are not exceptional, they’re a normal part of program execution. Handle them explicitly at every layer, propagate them with context, translate them at boundaries, and never swallow them silently.</p>

<p><strong>Key rules from the file</strong>:</p>

<ul>
  <li><strong>Never swallow errors.</strong> <code class="language-plaintext highlighter-rouge">catch (e) { log(e) }</code> is not handling, it’s ignoring with a paper trail. The system continues in a corrupt state.</li>
  <li><strong>Add context when propagating.</strong> Each layer adds what it was doing. The final message reads like a stack of explanations: <code class="language-plaintext highlighter-rouge">"create order: charge payment: POST /payments: connection refused"</code>.</li>
  <li><strong>Translate at boundaries.</strong> A repository throws a database error. The service translates it to a domain error. The handler translates it to an HTTP response. Each layer speaks its own language.</li>
  <li><strong>Distinguish error types.</strong> Retriable (5xx, timeout) vs terminal (4xx, auth) vs corruption. Different types require different responses: retry, report to user, or alert on-call.</li>
</ul>

<p><strong>The standard</strong>: Error responses follow <a href="https://www.rfc-editor.org/rfc/rfc9457">RFC 9457</a> (Problem Details for HTTP APIs), a machine-parseable format with <code class="language-plaintext highlighter-rouge">type</code>, <code class="language-plaintext highlighter-rouge">title</code>, <code class="language-plaintext highlighter-rouge">status</code>, <code class="language-plaintext highlighter-rouge">detail</code>, and <code class="language-plaintext highlighter-rouge">instance</code> fields. This replaces the ad-hoc <code class="language-plaintext highlighter-rouge">{ "error": "something went wrong" }</code> that Claude defaults to.</p>

<div class="language-json highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="p">{</span><span class="w">
  </span><span class="nl">"type"</span><span class="p">:</span><span class="w"> </span><span class="s2">"https://api.example.com/errors/insufficient-funds"</span><span class="p">,</span><span class="w">
  </span><span class="nl">"title"</span><span class="p">:</span><span class="w"> </span><span class="s2">"Insufficient Funds"</span><span class="p">,</span><span class="w">
  </span><span class="nl">"status"</span><span class="p">:</span><span class="w"> </span><span class="mi">422</span><span class="p">,</span><span class="w">
  </span><span class="nl">"detail"</span><span class="p">:</span><span class="w"> </span><span class="s2">"Account abc-123 has EUR 10.00, but the transaction requires EUR 25.00."</span><span class="p">,</span><span class="w">
  </span><span class="nl">"instance"</span><span class="p">:</span><span class="w"> </span><span class="s2">"/orders/order-456"</span><span class="w">
</span><span class="p">}</span><span class="w">
</span></code></pre></div></div>

<p><strong>Traces to</strong>: REQ-API-001 (RFC 9457 error format), REQ-API-002 (appropriate status codes), REQ-OBS-002 (errors logged with sufficient context).</p>

<h2 id="3-testing-strategy">3. Testing Strategy</h2>

<p><strong>The problem</strong>: Left to its own devices, Claude writes tests that test nothing. It mocks everything, asserts that mocked functions were called with the right arguments, and calls it a day. The tests pass. The code is broken. Nobody notices until production.</p>

<p><strong>The principle</strong>: Test behaviour, not implementation. The test pyramid (unit -&gt; integration -&gt; E2E) defines how many tests of each type to write, but the core rule is simpler: if everything is mocked, the test proves nothing.</p>

<p><strong>Key rules from the file</strong>:</p>

<ul>
  <li><strong>Every feature tests five things</strong>: happy path, validation errors, auth errors, downstream failures, and edge cases.</li>
  <li><strong>Mock at system boundaries</strong>, not internally. Mock the payment gateway, not the service that calls it. Your test should exercise the actual code path.</li>
  <li><strong>Name tests as specifications</strong>: <code class="language-plaintext highlighter-rouge">test_create_user_with_duplicate_email_returns_409</code> tells you exactly what broke without reading the test body.</li>
  <li><strong>Tests must be independent and parallelisable.</strong> No shared state, no ordering dependencies, no “run test A before test B.”</li>
  <li><strong>Coverage target</strong>: 80% overall, 70% minimum per module. Not as a vanity metric, but as a signal that error paths are tested.</li>
</ul>

<p><strong>Why it matters for AI-generated code</strong>: When the model has this file, it stops writing tests that just verify mock wiring. It writes tests with real assertions against real behavior. And when you ask it to add error handling, it also adds the test that verifies the error handling works. This ties into the broader <a href="/ai/llm/development/best-practices/2025/11/14/human-in-the-loop-ai-code-review.html">human-in-the-loop review</a> approach: the AI writes, you verify.</p>

<p><strong>Traces to</strong>: REQ-QUAL-005 (test happy and error paths), REQ-QUAL-006 (useful, maintainable tests), REQ-QUAL-003 (coverage thresholds), REQ-QUAL-007 (test framework bootstrap from day one).</p>

<h2 id="4-api-design">4. API Design</h2>

<p><strong>The problem</strong>: Claude builds APIs that work for the happy path but fall apart at the edges. No pagination. Inconsistent error formats. Stack traces in production error responses. Rate limiting that returns no headers so clients can’t self-throttle.</p>

<p><strong>The principle</strong>: An API is a contract. It should be consistent, predictable, and follow established standards so that clients (and future developers) can rely on its behavior without reading the implementation.</p>

<p><strong>Key rules from the file</strong>:</p>

<ul>
  <li><strong>Standard HTTP status codes.</strong> Not just 200 and 500, use the full vocabulary: 201 (created), 204 (no content), 400 (bad request), 401 (unauthorized), 403 (forbidden), 404 (not found), 409 (conflict), 422 (unprocessable), 429 (rate limited).</li>
  <li><strong>RFC 9457 for all errors.</strong> Same format, every time, machine-parseable. The <code class="language-plaintext highlighter-rouge">type</code> field is a stable URI that clients can switch on.</li>
  <li><strong>Cursor-based pagination</strong> for large datasets. Offset pagination breaks under concurrent writes. Include <code class="language-plaintext highlighter-rouge">items</code>, <code class="language-plaintext highlighter-rouge">hasMore</code>, and <code class="language-plaintext highlighter-rouge">nextCursor</code> in every list response.</li>
  <li><strong>Rate limiting with standard headers.</strong> <code class="language-plaintext highlighter-rouge">RateLimit-Limit</code>, <code class="language-plaintext highlighter-rouge">RateLimit-Remaining</code>, <code class="language-plaintext highlighter-rouge">RateLimit-Reset</code> on every response. 429 with <code class="language-plaintext highlighter-rouge">Retry-After</code> when exceeded. Clients shouldn’t have to guess.</li>
  <li><strong>Never expose internals.</strong> No stack traces, no SQL errors, no file paths in production responses. Log them server-side, return a clean error to the client.</li>
</ul>

<p><strong>Traces to</strong>: REQ-API-001 through REQ-API-004, REQ-SEC-009 (rate limiting), REQ-DOC-001 (OpenAPI documentation). The specification further mandates generating OpenAPI docs from code annotations and validating them in CI.</p>

<h2 id="5-data-integrity">5. Data Integrity</h2>

<p><strong>The problem</strong>: Claude writes code that works perfectly, until two requests arrive at the same time, or a payment fails after inventory was already deducted, or a migration drops a column while the old code is still running. Concurrency and partial failure are invisible in code reviews. They only surface in production.</p>

<p><strong>The principle</strong>: Data corruption is worse than downtime. A crashed server restarts in minutes. Corrupt data requires investigation, manual fixes, and sometimes can’t be recovered at all.</p>

<p><strong>Key rules from the file</strong>:</p>

<ul>
  <li><strong>Transactions for multi-step mutations.</strong> Create order, deduct inventory, charge payment, all in one transaction. If payment fails, everything rolls back.</li>
  <li><strong>Database constraints as the last line of defence.</strong> <code class="language-plaintext highlighter-rouge">NOT NULL</code>, <code class="language-plaintext highlighter-rouge">UNIQUE</code>, <code class="language-plaintext highlighter-rouge">FOREIGN KEY</code>, <code class="language-plaintext highlighter-rouge">CHECK</code> constraints. Application validation can have bugs. The database doesn’t lie.</li>
  <li><strong>Idempotency by design.</strong> Every operation that might be retried (webhooks, queue messages, API calls) must produce the same result when executed twice.</li>
  <li><strong>Race condition prevention.</strong> Optimistic locking (version column) for low-contention reads. Pessimistic locking (<code class="language-plaintext highlighter-rouge">SELECT ... FOR UPDATE</code>) for critical sections. <code class="language-plaintext highlighter-rouge">INSERT ... ON CONFLICT</code> instead of check-then-insert.</li>
  <li><strong>Expand-contract migrations.</strong> Never drop or rename a column in the same migration that adds its replacement. Add the new column, backfill, deploy code that uses it, then remove the old one.</li>
</ul>

<p><strong>The framework</strong>: Data quality is evaluated across eight dimensions from <a href="https://www.dama.org/cpages/body-of-knowledge">DAMA-DMBOK</a>: accuracy, completeness, consistency, integrity, reasonability, timeliness, uniqueness, and validity. These give you a vocabulary for discussing data issues.</p>

<p><strong>Traces to</strong>: REQ-DATA-001 (versioned migrations), REQ-DEPLOY-002 (expand-contract pattern), REQ-DEPLOY-003 (tested rollback scripts).</p>

<h2 id="6-structured-logging">6. Structured Logging</h2>

<p><strong>The problem</strong>: Claude’s default logging is <code class="language-plaintext highlighter-rouge">console.log("user created")</code> or <code class="language-plaintext highlighter-rouge">logger.info(f"Processing order {order_id}")</code>. String interpolation, no structure, no context. Useless in production where you need to filter, aggregate, and correlate across services.</p>

<p><strong>The principle</strong>: Logs are structured data, not formatted strings. Every log entry should be a set of key-value pairs that machines can parse and humans can read.</p>

<p><strong>Key rules from the file</strong>:</p>

<ul>
  <li><strong>Structured fields, not string interpolation.</strong> <code class="language-plaintext highlighter-rouge">logger.info("order_created", user_id=user.id, order_id=order.id, total=order.total)</code>, not <code class="language-plaintext highlighter-rouge">logger.info(f"Created order {order.id} for user {user.id}")</code>.</li>
  <li><strong>Consistent field names across the codebase.</strong> <code class="language-plaintext highlighter-rouge">user_id</code>, <code class="language-plaintext highlighter-rouge">request_id</code>, <code class="language-plaintext highlighter-rouge">session_id</code>, <code class="language-plaintext highlighter-rouge">error_type</code>, <code class="language-plaintext highlighter-rouge">duration_ms</code>, <code class="language-plaintext highlighter-rouge">operation</code>. Pick names once and stick with them.</li>
  <li><strong>Never log secrets.</strong> Not passwords, not tokens, not API keys. Log their presence: <code class="language-plaintext highlighter-rouge">api_key_present=true</code>, not the value.</li>
  <li><strong>Log at system boundaries.</strong> Request received, request completed, outbound call started, outbound call finished, job started, job completed. Not inside tight loops.</li>
  <li><strong>Severity levels mean something.</strong> DEBUG for developer-only detail, INFO for expected events, WARN for unexpected-but-handled situations, ERROR for failures requiring investigation. Don’t log everything as INFO.</li>
  <li><strong>Correlation IDs.</strong> Generate a request ID at the entry point, propagate it through all downstream calls. Every log line includes it. When something breaks, you can trace the entire request path.</li>
</ul>

<p><strong>Traces to</strong>: REQ-OBS-001 (structured JSON logs with request context), REQ-OBS-002 (errors logged with diagnostic context), REQ-OBS-003 (correlation IDs propagated through downstream calls).</p>

<hr />

<h2 id="the-connection-between-layers">The Connection Between Layers</h2>

<p>These six practices don’t exist in isolation. They reinforce each other:</p>

<ul>
  <li><strong>Layered architecture</strong> creates the boundaries where <strong>error handling</strong> translates errors between layers.</li>
  <li><strong>Error handling</strong> defines the error format that <strong>API design</strong> exposes to clients.</li>
  <li><strong>Testing strategy</strong> verifies all of the above, and is made possible by the clean separation that <strong>layered architecture</strong> provides.</li>
  <li><strong>Data integrity</strong> protects the database layer that sits at the bottom of the architecture.</li>
  <li><strong>Structured logging</strong> observes what happens across all layers, with the correlation IDs that <strong>API design</strong> generates at the entry point.</li>
</ul>

<p>And all of them trace back to requirements with IDs, specifications with implementation details, and a tech stack where every library choice is justified. The model doesn’t just follow rules, it understands a system.</p>

<h2 id="try-it-yourself">Try It Yourself</h2>

<p>The full files for all six practices discussed here, plus twelve more covering security, resilience, accessibility, deployment, and more, are available in the <a href="https://github.com/albertsikkema/claude-code-best-practices">public repository</a>.</p>

<p>You can use them as-is by dropping them into a <code class="language-plaintext highlighter-rouge">best_practices/</code> directory that your Claude Code setup references, or adapt them to your own stack and standards. The format is simple: a principle, a “why” section, core rules with code examples, and common mistakes. The model picks them up without any special configuration, they just need to be part of the context.</p>

<p><strong>Coming up next:</strong></p>

<ul>
  <li><strong>Part 2: Security, Privacy, Operations, Accessibility, SEO, and Integration Patterns</strong> – the remaining 12 files covering authorization, validation, containers, privacy, resilience, deployment, observability, background jobs, accessibility, SEO, robots/scraping, and LLM integration.</li>
</ul>

<hr />

<p><em>Have questions or want to share how you keep AI-generated code in line? <a href="#" onclick="task1(); return false;">Get in touch</a>.</em></p>

<h2 id="standards-and-references">Standards and References</h2>

<ul>
  <li><a href="https://www.rfc-editor.org/rfc/rfc9457">RFC 9457 - Problem Details for HTTP APIs</a></li>
  <li><a href="https://www.w3.org/TR/WCAG22/">WCAG 2.2 - Web Content Accessibility Guidelines</a></li>
  <li><a href="https://owasp.org/www-project-top-ten/">OWASP Top 10</a></li>
  <li><a href="https://www.dama.org/cpages/body-of-knowledge">DAMA-DMBOK - Data Management Body of Knowledge</a></li>
  <li><a href="https://semver.org/">Semantic Versioning 2.0.0</a></li>
  <li><a href="https://keepachangelog.com/">Keep a Changelog 1.1.0</a></li>
  <li><a href="https://github.com/opencontainers/image-spec">OCI Image Specification</a></li>
</ul>]]></content><author><name>Albert Sikkema</name></author><category term="ai" /><category term="development" /><category term="best-practices" /><summary type="html"><![CDATA[How structured, evidence-based best practice files keep Claude Code from cutting corners. Part 1 of 2: architecture, error handling, testing, API design, data integrity, and structured logging.]]></summary><media:thumbnail xmlns:media="http://search.yahoo.com/mrss/" url="https://www.albertsikkema.com/evidence-based-best-practices-ai-guardrails-blog.png" /><media:content medium="image" url="https://www.albertsikkema.com/evidence-based-best-practices-ai-guardrails-blog.png" xmlns:media="http://search.yahoo.com/mrss/" /></entry><entry><title type="html">Bijna de helft van alle bushaltes in Nederland is ontoegankelijk, dus bouwde ik een tool om er wat aan te doen</title><link href="https://www.albertsikkema.com/accessibility/open-source/civic-tech/2026/02/24/bushalte-toegankelijkheid-nederland.html" rel="alternate" type="text/html" title="Bijna de helft van alle bushaltes in Nederland is ontoegankelijk, dus bouwde ik een tool om er wat aan te doen" /><published>2026-02-24T00:00:00+00:00</published><updated>2026-02-24T00:00:00+00:00</updated><id>https://www.albertsikkema.com/accessibility/open-source/civic-tech/2026/02/24/bushalte-toegankelijkheid-nederland</id><content type="html" xml:base="https://www.albertsikkema.com/accessibility/open-source/civic-tech/2026/02/24/bushalte-toegankelijkheid-nederland.html"><![CDATA[<figure>
  <img src="/assets/images/busstop-fail.jpg" alt="Een ontoegankelijke bushalte in Nederland: geen verhoogd perron, geen geleidelijnen" />
  <figcaption>Een bushalte een paar kilometer van mijn huis. Geen verhoogd perron, geen geleidelijnen — ontoegankelijk voor veel reizigers.</figcaption>
</figure>

<p><em>This post is in Dutch — a first for this blog. It covers a topic specific to the Netherlands: I built an open-source tool that maps all 20,277 inaccessible bus stops in the country and lets citizens email the responsible authority with a legally grounded request for improvements. Apologies to my English-speaking readers; normal service will resume next post.</em></p>

<hr />

<p>Vandaag publiceerde de <a href="https://nos.nl/artikel/2603791-veel-bushaltes-niet-toegankelijk-voor-mensen-met-een-beperking">NOS</a> dat voor veel mensen als een verrassing kwam: ongeveer de helft van de Nederlandse bushaltes is niet of nauwelijks toegankelijk voor mensen met een beperking. Zes op de tien haltes missen goede voorzieningen voor blinden en slechtzienden. Bijna de helft is slecht ingericht voor rolstoelgebruikers. In sommige gemeenten ligt het percentage ontoegankelijke haltes boven de 90%.</p>

<p>Afgelopen jaren ben ik intensief bezig geweest met toegankelijkheid, onder andere als developer van de <a href="https://bba.nl/">Beter Bereikbaar Applicatie - BBA</a>, en dit bevestigt alle verhalen die ik hoorde. De data is openbaar beschikbaar, maar dat meer dan de helft van de bushaltes in Nederland slecht bereikbaar zijn is hoger dan ik zelf had verwacht. Dus heb ik de mooie kaarten bekeken, en toen dacht ik: wat nu? De gemiddelde persoon zal het getalletje zien, en of zijn schouders ophalen of het voor kennis geving aannemen. Wat moet je er immers mee? Dus toen dacht ik: wat kan ik er mee? Zou het niet mooi zijn als je een kaart zou hebben waar je deze data op ziet en die je in staat stelt om dit onder de aandacht te brengen bij de desbetreffende instantie (vaak een gemeente, soms een provincie en heel soms een waterschap) Dus bouwde ik er iets voor.</p>

<h2 id="de-onzichtbare-data">De onzichtbare data</h2>

<p>Het <a href="https://dova.nu">Centraal Haltebestand</a> — beheerd door DOVA, het samenwerkingsverband van OV-autoriteiten — bevat gedetailleerde informatie over elke bushalte in Nederland. Van elke halte is bekend hoe hoog de stoeprand is, hoe breed het perron, of er geleidelijnen liggen, of de halte obstakels heeft. Al die data is openbaar.</p>

<p>Maar “openbaar” is niet hetzelfde als “zichtbaar”. De data zit in de <a href="https://halteviewer.ov-data.nl">Halteviewer</a>, een tool voor professionals. Je moet weten dat die bestaat, je moet weten hoe je erin zoekt. Voor een gemeenteraadslid of burger die willen weten hoeveel haltes in hun gemeente niet op orde zijn, is dat een doodlopende weg.</p>

<h2 id="20277-haltes-die-niet-voldoen">20.277 haltes die niet voldoen</h2>

<p>Ik schreef een datapipeline die alle haltedata ophaalt en toetst aan de <a href="https://www.crow.nl">CROW-normen</a> voor toegankelijkheid. Een halte voldoet niet als de stoeprand lager is dan 18 centimeter, het perron smaller dan 1,50 meter, er geen geleidelijnen liggen, of er geen obstakelvrije looproute is.</p>

<p>De cijfers:</p>

<table>
  <thead>
    <tr>
      <th> </th>
      <th>Aantal</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>Actieve bushaltes in Nederland</td>
      <td>42.068</td>
    </tr>
    <tr>
      <td>Voldoet niet aan CROW-normen</td>
      <td>20.277 (48%)</td>
    </tr>
    <tr>
      <td>Verantwoordelijke wegbeheerders</td>
      <td>384</td>
    </tr>
  </tbody>
</table>

<p>Die 384 wegbeheerders — dat zijn 345 gemeenten, 12 provincies, 5 waterschappen, 7 kantoren van Rijkswaterstaat, en nog een handvol private partijen. Allemaal afzonderlijk verantwoordelijk voor hun eigen haltes.</p>

<p>De wegbeheerders met de meeste ontoegankelijke haltes:</p>

<ul>
  <li><strong>Rotterdam</strong>: 416</li>
  <li><strong>Provincie Overijssel</strong>: 397</li>
  <li><strong>Amsterdam</strong>: 393</li>
  <li><strong>Provincie Gelderland</strong>: 367</li>
  <li><strong>Provincie Drenthe</strong>: 268</li>
</ul>

<h2 id="de-tool-toegankelijke-bushaltes">De tool: Toegankelijke Bushaltes</h2>

<p><a href="https://nietmetdebus.nl/"><strong>Niet met de bus?</strong></a> is een interactieve kaart die alle 20.277 ontoegankelijke haltes toont, gegroepeerd per wegbeheerder. In de zijbalk klik je op je gemeente (of provincie, waterschap, etc.) en je ziet direct welke haltes niet voldoen. Je kunt inzoomen, haltes aanklikken, en zien welke haltes niet voldoen aan de eisen. En het belangrijkste: je kunt met één klik een e-mail genereren naar de verantwoordelijke wegbeheerder.</p>

<figure>
  <a href="https://nietmetdebus.nl/"><img src="/assets/images/nietmetdebus-screenshot.jpg" alt="Screenshot van nietmetdebus.nl: een interactieve kaart met ontoegankelijke bushaltes in Nederland" /></a>
  <figcaption>De interactieve kaart op nietmetdebus.nl toont alle ontoegankelijke bushaltes per wegbeheerder.</figcaption>
</figure>

<h2 id="de-e-mail-goed-onderbouwd-klaar-om-te-versturen">De e-mail: goed onderbouwd, klaar om te versturen</h2>

<p>De gegenereerde e-mail is geen vaag verzoekje. Hij bevat:</p>

<ul>
  <li>Het exacte aantal ontoegankelijke haltes van die wegbeheerder</li>
  <li>Een verwijzing naar het <strong>VN-verdrag inzake de rechten van personen met een handicap</strong> (artikelen 9 en 20) — dat Nederland in 2016 heeft geratificeerd</li>
  <li>Een verwijzing naar de <strong>Wet gelijke behandeling op grond van handicap of chronische ziekte</strong> (Wgbh/cz, artikelen 2 en 3)</li>
  <li>Een verwijzing naar het <strong>Bestuursakkoord Toegankelijkheid OV 2022-2032</strong> — waarin overheden zelf hebben afgesproken om alle haltes toegankelijk te maken</li>
</ul>

<p>Je hoeft geen jurist te zijn. Je hoeft geen expert te zijn in OV-wetgeving. Je klikt, je past de mail aan naar hoe jij het wilt, en je verstuurt hem. Dat is het.</p>

<h2 id="waarom-dit-ertoe-doet">Waarom dit ertoe doet</h2>

<p>Peter Waalboer, belangenbehartiger voor mensen met een beperking, zei het treffend in het NOS-artikel: <em>“Het openbaar vervoer is een publieke voorziening. Die moet voor iedereen toegankelijk zijn — daar is geen discussie over mogelijk.”</em></p>

<p>Helemaal waar, maar de realiteit is weerbarstig: gemeenten hebben beperkte budgetten en bushaltes aanpassen kost geld — een enkele halte kan al duizenden euro’s kosten. Er bestaan subsidies van OV-autoriteiten, en het Bestuursakkoord zet ambities neer, maar naleving is vrijwillig. Zonder druk van inwoners verschuift “toegankelijkheid” makkelijk naar de onderkant van de prioriteitenlijst.</p>

<p>Wat er ontbreekt is niet wetgeving of goede bedoelingen — het is zichtbaarheid. Als een raadslid niet weet dat 60% van de haltes in haar gemeente niet voldoet, gaat ze er niet naar vragen. Als een dorpsgenoot niet weet dat zijn halte ongeschikt is voor zijn buurvrouw in een rolstoel, mist hij het signaal. Data die onzichtbaar is, leidt niet tot actie.</p>

<p>Deze tool maakt die data zichtbaar en actionable. In een paar klikken kun je zien wat er aan de hand is en de verantwoordelijke partij aanspreken — met een juridisch onderbouwd verzoek.</p>

<h2 id="open-source-voor-iedereen">Open source, voor iedereen</h2>

<p>De tool is volledig open source. De broncode staat op <a href="https://github.com/albertsikkema/niet-toegankelijke-bushaltes">GitHub</a>. Technisch is het bewust simpel gehouden: een datapipeline in Node.js die de DOVA- en Allmanak-data ophaalt, en een statische frontend met vanilla HTML, CSS en JavaScript — met een Leaflet-kaart en marker clustering. Geen frameworks en geen build-stappen. De data is verversbaar door de pipeline opnieuw te draaien.</p>

<h2 id="de-timing-gemeenteraadsverkiezingen-op-18-maart">De timing: gemeenteraadsverkiezingen op 18 maart</h2>

<p>Op 18 maart 2026 zijn de gemeenteraadsverkiezingen. Dat maakt dit hét moment om actie te ondernemen. Kandidaat-raadsleden en zittende politici zijn nu extra ontvankelijk voor signalen van inwoners. Stuur die e-mail nu — vóór de verkiezingen. Vraag aan je lokale partijen wat zij gaan doen aan de ontoegankelijke haltes in jouw gemeente. Toegankelijkheid hoort in elk verkiezingsprogramma, niet als voetnoot maar als prioriteit.</p>

<h2 id="wat-kun-jij-doen">Wat kun jij doen?</h2>

<ol>
  <li><strong>Ga naar <a href="https://nietmetdebus.nl/">nietmetdebus.nl</a></strong> en zoek je eigen gemeente op</li>
  <li><strong>Bekijk welke haltes niet voldoen</strong> — misschien is het die halte bij jou om de hoek</li>
  <li><strong>Genereer een e-mail</strong> en stuur die naar je wegbeheerder — liefst vóór 18 maart</li>
  <li><strong>Deel de tool</strong> met je gemeenteraad, je lokale belangenorganisatie, je buren</li>
  <li><strong>Stel het aan de orde</strong> bij verkiezingsdebatten en inspraakavonden in je gemeente</li>
  <li><strong>Heb je suggesties of wil je bijdragen?</strong> Open een issue op <a href="https://github.com/albertsikkema/niet-toegankelijke-bushaltes">GitHub</a></li>
</ol>

<p>Toegankelijkheid is geen gunst. Het is een recht. En soms begint verandering met een simpele e-mail.</p>]]></content><author><name>Albert Sikkema</name></author><category term="accessibility" /><category term="open-source" /><category term="civic-tech" /><summary type="html"><![CDATA[Een open-source tool die alle 20.277 ontoegankelijke bushaltes in Nederland zichtbaar maakt en burgers helpt actie te ondernemen.]]></summary><media:thumbnail xmlns:media="http://search.yahoo.com/mrss/" url="https://www.albertsikkema.com/assets/images/busstop-fail.jpg" /><media:content medium="image" url="https://www.albertsikkema.com/assets/images/busstop-fail.jpg" xmlns:media="http://search.yahoo.com/mrss/" /></entry><entry><title type="html">Let the AI Pick React</title><link href="https://www.albertsikkema.com/ai/development/2026/02/13/let-the-ai-pick-react.html" rel="alternate" type="text/html" title="Let the AI Pick React" /><published>2026-02-13T00:00:00+00:00</published><updated>2026-02-13T00:00:00+00:00</updated><id>https://www.albertsikkema.com/ai/development/2026/02/13/let-the-ai-pick-react</id><content type="html" xml:base="https://www.albertsikkema.com/ai/development/2026/02/13/let-the-ai-pick-react.html"><![CDATA[<figure>
  <img src="/assets/images/ai-react-convergence.jpg" alt="Aerial view of highway lanes converging into a single interchange, representing framework standardisation" />
  <figcaption>Photo by <a href="https://unsplash.com/@dnevozhai">Denys Nevozhai</a> on <a href="https://unsplash.com">Unsplash</a></figcaption>
</figure>

<p>There is a self-reinforcing cycle forming in frontend development, and it is making people nervous.</p>

<p>React has the most code in LLM training data. LLMs therefore <a href="https://www.200oksolutions.com/blog/github-copilot-vs-chatgpt-vs-claude-frontend/">generate better React code</a> than anything else. More React code gets written — by both humans and AI — feeding future training data. Repeat. The flywheel spins, and React’s dominance compounds with every prompt.</p>

<p>The evidence is hard to miss: give an LLM a vague prompt like “build me a web app” and you will almost invariably get React + Tailwind + shadcn/ui. Tools like <a href="https://v0.dev">v0</a>, <a href="https://lovable.dev">Lovable</a>, and <a href="https://bolt.new">Bolt.new</a> all default to this stack. v0 technically supports Vue and Svelte, but by Vercel’s own admission it “really works best using React, Tailwind and shadcn/ui.”</p>

<p>The usual reaction to this is concern. It is called <a href="https://maximilian-schwarzmueller.com/articles/the-problem-with-the-default-ai-stack/">the problem with the default AI stack</a>. Others warn about stifled innovation, outdated patterns being perpetuated at scale, and a knowledge barrier for newcomers who never discover alternatives because the AI never suggests them.</p>

<p>I see it differently.</p>

<h2 id="the-fragmentation-problem-nobody-talks-about">The Fragmentation Problem Nobody Talks About</h2>

<p>Frontend development has been drowning in choice for a long time. React, Next, Vue, Svelte, Solid, Angular, Qwik, Astro, Lit, Preact, Marko, Alpine, Htmx — and many more, and that is just frameworks. Each comes with its own ecosystem of state management libraries, routing solutions, meta-frameworks, and component libraries. Every combination produces a slightly different mental model, a different set of conventions, a different way to do the same thing.</p>

<p>This fragmentation has real costs. Teams spend weeks evaluating frameworks. Developers switching jobs need ramp-up time to learn the local flavour. Hiring becomes framework-specific. Knowledge sharing across projects is harder than it should be. The industry has been paying a quiet tax on all this optionality, and for what? For 95% of use cases, any of these frameworks would do the job just fine.</p>

<p>What AI is doing — accidentally, through the cold logic of training data statistics — is pushing the community toward standardisation. And standardisation, when the standard is good enough, is not a loss. It is a relief.</p>

<h2 id="good-enough-wins">Good Enough Wins</h2>

<p>React is not the best framework. I will say that plainly. If I had to write code by hand — really sit down and build components line by line — I would pick Svelte. It is cleaner, less verbose, and gives me a better overview of what is happening. The developer experience is genuinely superior when you are the one typing.</p>

<p>But I am not the one typing.</p>

<p>I wrote about this shift in my post on <a href="/2026/02/05/vibe-coding-quality-democratisation.html">vibe coding, product quality and democratisation</a>: the value equation has changed. When AI generates 80-90% of the code, my personal preference for a framework’s syntax becomes almost irrelevant. What matters is whether the AI can produce correct, functional code — and right now, it produces better React code than anything else. That is not ideology. It is a measurable quality gap rooted in training data volume.</p>

<p>React is not the best. But it is good enough for the vast majority of what gets built. And “good enough + excellent AI support” beats “technically superior + mediocre AI support” in every practical scenario I can think of.</p>

<p>I learned this first-hand when I was using a new Svelte version with GitHub Copilot — a long time ago it seems — when the training data had not included that version yet. Not a fun experience, having to reinstruct the LLM every time.</p>

<h2 id="the-time-argument">The Time Argument</h2>

<p>Every hour I do not spend fighting an AI tool’s weaker output in a less-supported framework is an hour I can spend on what actually matters: the product, the user experience, the business logic, the security model.</p>

<p>The cost savings are real. When Lovable or Claude Code can scaffold a working application in half an hour using React, the overhead of choosing a different framework — debugging AI-generated code that is slightly off, filling in gaps where training data is thin, manually correcting patterns the model has not seen enough of — becomes a luxury most projects cannot justify.</p>

<p>This is the argument that makes the monoculture concerns less relevant for most teams: time saved is money saved. And time is the <a href="https://www.youtube.com/watch?v=AR9hMvlOZCo">final currency</a>.</p>

<h2 id="when-i-would-not-do-this">When I Would Not Do This</h2>

<p>I am not saying React is the answer to everything. There are clear cases where another approach is justified:</p>

<p><strong>Security-critical applications.</strong> When a project demands the highest level of security assurance, I want to understand every line of code. AI-generated code — in any framework — adds a layer of uncertainty that might be unacceptable. In those cases, the framework choice should serve the security model, not the AI tooling.</p>

<p><strong>Performance as a hard requirement.</strong> If a client needs the absolute smallest bundle size or the fastest possible rendering, Svelte or Solid or plain Javascript will outperform React. When performance is a specification, not a nice-to-have, the technical choice should win over the AI convenience.</p>

<p><strong>Simplicity as a constraint.</strong> Some projects need to be small, understandable, and maintainable by non-specialists. A simple static site does not need React’s complexity. The right tool here might be vanilla Javascript, Alpine, or something deliberately minimal.</p>

<p>These are the 5% cases. They exist, they matter, and they require deliberate technical choices. But they are the exception, not the rule.</p>

<h2 id="what-about-innovation">What About Innovation?</h2>

<p>The strongest counterargument is that a React monoculture stifles innovation. If future LLMs are trained mostly on React, the reasoning goes, new frameworks will never gain enough traction to compete.</p>

<p>Here is how I see it: we have not actually seen real innovation in frontend frameworks for a long time. Frontend is complicated — genuinely, deeply complicated. And so far, none of the alternatives have found a definitive answer. They are variations. <a href="https://svelte.dev/">Svelte’s</a> compile step, <a href="https://www.solidjs.com/">Solid’s</a> fine-grained reactivity, <a href="https://astro.build/">Astro’s</a> island architecture — these are smart ideas, well-built tools, and genuine improvements in specific areas. But they are also steps back in others. They are not the next paradigm shift. They are refinements.</p>

<p>Meanwhile, the industry runs on a multi-year cycle that keeps repackaging older ideas under new names. Server-side rendering <a href="https://daily.dev/blog/server-side-rendering-renaissance">is back</a>. Signals — <a href="https://www.builder.io/blog/history-of-reactivity">called observables in Knockout.js back in 2010</a> — are back. The pendulum swings, and we call it progress.</p>

<p>A lot of developers see their framework of choice as real innovation. I understand that attachment — I feel it with Svelte. But in the grander scheme, these are variations on the same fundamental approach to building UIs. If something truly new comes along — something that genuinely changes how we think about frontend development — it will break through regardless of what LLMs default to. That kind of innovation does not need training data momentum. It needs to be undeniably better.</p>

<p>Until that happens, we are better off accepting what we have and building with it.</p>

<h2 id="the-accidental-standard">The Accidental Standard</h2>

<p>React did not plan this advantage. No committee decided it should be the AI default. It happened because React was the most popular framework when the training data was collected — a decade of documentation, tutorials, Stack Overflow answers, and open-source projects.</p>

<p>But planned or not, it gives developers a common language. It gives teams a safe default. It gives non-developers building their first app through vibe coding a foundation that actually works. And it gives the rest of us more time to spend on what we are actually building instead of debating what to build it with.</p>

<p>React apparently is not that bad.</p>

<h2 id="resources">Resources</h2>

<ul>
  <li><a href="https://maximilian-schwarzmueller.com/articles/the-problem-with-the-default-ai-stack/">The Problem with the Default AI Stack</a> — Maximilian Schwarzmüller</li>
  <li><a href="https://www.200oksolutions.com/blog/github-copilot-vs-chatgpt-vs-claude-frontend/">GitHub Copilot vs ChatGPT vs Claude for Frontend Development</a> — 200ok Solutions</li>
  <li><a href="https://www.techradar.com/pro/best-vibe-coding-tools">Best Vibe Coding Tools</a> — TechRadar</li>
  <li><a href="https://thealphaspot.com/articles/is-react-still-the-best-choice-in-2025/">Is React Still the Best Choice in 2025?</a> — The Alpha Spot</li>
  <li><a href="https://www.smashingmagazine.com/2025/01/svelte-5-future-frameworks-chat-rich-harris/">Svelte 5 and the Future of Frameworks: A Chat with Rich Harris</a> — Smashing Magazine</li>
  <li><a href="https://thenewstack.io/dhh-on-ai-vibe-coding-and-the-future-of-programming/">DHH on AI, Vibe Coding, and the Future of Programming</a> — The New Stack</li>
</ul>

<h2 id="further-reading">Further Reading</h2>

<ul>
  <li><a href="/2026/02/05/vibe-coding-quality-democratisation.html">Vibe Coding: Product Quality and Democratisation</a> — my earlier post on vibe coding and when personal tools become products</li>
  <li><a href="/2026/02/01/securing-claude-code-hooks.html">Securing YOLO Mode: How I Stop Claude Code from Nuking My System</a> — on guardrails for AI-assisted development</li>
</ul>]]></content><author><name>Albert Sikkema</name></author><category term="ai" /><category term="development" /><summary type="html"><![CDATA[The AI-React reinforcement loop is creating a monoculture. That might be exactly what frontend development needs.]]></summary><media:thumbnail xmlns:media="http://search.yahoo.com/mrss/" url="https://www.albertsikkema.com/assets/images/ai-react-convergence.jpg" /><media:content medium="image" url="https://www.albertsikkema.com/assets/images/ai-react-convergence.jpg" xmlns:media="http://search.yahoo.com/mrss/" /></entry></feed>