The GSD Framework: How to Make AI Agents Actually Ship

The Morphllm study shows the same model scores 17 benchmark problems apart based on scaffolding alone. The GSD Framework makes AI agents reliable through spec-driven development, verification gates, and persistent state.

The GSD Framework: How to Make AI Agents Actually Ship

The GSD Framework: How to Make AI Agents Actually Ship

The agent that actually ships is not the one running the best model. According to a March 2026 study by Morphllm testing 15 AI coding agents, the same underlying model scores 17 benchmark problems apart depending on which agent framework wraps it. Scaffolding is the variable. The model is almost incidental.

The GSD (Get Shit Done) framework for AI agents is the clearest articulation of this idea we have seen. Developed by Lex Christopherson and popularized by the AI Labs video "GSD Is the Missing Piece For Claude Code," GSD is not another opinionated AI coding tool. It is a scaffolding philosophy built on a simple premise: agents fail at the workflow level, not the intelligence level.

This post breaks down the GSD framework, why it works, what production agent pipelines actually need, and what we have learned running more than 25 autonomous cron agents at Context Studios every single day.


Why Most Agent Pipelines Fail

The Morphllm data is worth sitting with. Claude Code scores 80.9% on SWE-bench Verified when paired with the right scaffolding and orchestration. The same base model in a poorly structured agent pipeline can score 17 points lower on the same benchmark set. That gap is not a model gap. It is an architecture gap.

Most agent pipelines die from the same cluster of problems:

Context window chaos. Without explicit state management, the agent accumulates noise across turns. By the time it reaches the critical step — the file edit, the API call, the database migration — it has lost the original spec in a sea of intermediate reasoning.

Ambiguous success criteria. An agent with vague instructions will satisfy the letter of the prompt while missing the spirit entirely. "Update the authentication flow" leads to unpredictable behavior. "Add a check for expired tokens in auth/middleware.ts, return 401 if expired, run the auth test suite, confirm all 47 tests pass" does not.

No fail-fast loops. Human developers check their work constantly. Agents, left without explicit verification steps, will hallucinate a success and move on. The bug ships.

Missing recovery paths. When an agent hits a wall — timeout, tool error, unexpected output — there is no playbook. The pipeline either hangs silently or crashes without saving partial work.

GSD addresses all four.


What the GSD Framework Is

GSD is a spec-driven development system built entirely on top of Claude Code's native capabilities. It has accumulated more than 23,000 GitHub stars. The entire system runs on approximately 50 Markdown files, a Node.js CLI helper, and two hooks. No proprietary runtime. No framework lock-in.

The system is organized around six slash commands that map to a linear workflow:

  • /gsd:new-project — captures the idea and generates specs
  • /gsd:discuss-phase — clarifies requirements before any code is written
  • /gsd:plan-phase — breaks work into atomic, executable tasks
  • /gsd:execute-phase — runs those tasks, with parallelism where possible
  • /gsd:verify-work — validates output against explicit success criteria
  • /gsd:complete-milestone — archives state and releases

Behind these commands sit 29 skills, 12 custom agents, and 2 hooks. Each command is a Markdown file with YAML frontmatter and XML-tagged sections. The XML tagging is not cosmetic: Claude's training treats XML boundaries as structural signals, which makes the model measurably more reliable at following multi-step instructions. This is one of the insights Greg Isenberg's "Building AI Agents That Actually Work" course covers — structured prompts are not about being polite to the model, they are about giving the model a parsing surface it can reason about reliably.

GSD targets Claude Code, OpenCode, and Gemini CLI equally. The scaffolding philosophy transfers across agents.


The Four GSD Principles

The framework's production value comes from four architectural decisions that any agent system can adopt.

1. Spec before execution. Every GSD workflow starts with a machine-readable specification. The spec is not a natural-language description of a goal. It is a structured document with constraints, success criteria, and tool permissions. The /gsd:discuss-phase command exists specifically to surface ambiguity before a single line of code runs. If the spec is unclear, the plan phase will fail fast rather than producing unpredictable output.

2. Parallel research, sequential synthesis. During the research phase, four agents run simultaneously — each investigating a different dimension of the problem: tech stack, features, architecture, edge cases. Each writes results to a separate file. A synthesizer agent then processes those results sequentially, and a roadmapper produces the final plan. This fan-out/fan-in pattern caps the context window load on any single agent while extracting the benefits of parallelism. This is directly related to how Claude Code's multi-agent PR analysis handles parallel evaluation of pull request dimensions simultaneously.

3. Persistent state across context resets. Any Claude Code user knows the pain of losing context mid-project. GSD solves this by writing everything to a .planning/ directory: project definition, requirements, roadmap, and current state. Git commits happen after each step. A /gsd:resume-work command reads that state back in after a reset. The agent never starts from zero.

4. Deterministic steps handled by scripts. Bash scripts handle tasks where exact reliability matters: checking file existence, calculating phase numbers, validating test counts. As the GSD architecture makes explicit, these are tasks where a shell script is simply more reliable than asking an LLM to reason about it. This boundary — LLM for judgment, script for determinism — is one of the most important design decisions in a production agent system.


Production Reality: What We Have Learned From 25+ Daily Agents

Context Studios runs more than 25 autonomous agents on cron schedules every day. Intel gathering, SEO audits, content pipelines, social engagement rounds. We have been doing this long enough to accumulate failure patterns that the GSD documentation does not cover yet, because they only surface at production scale.

Timeout management is a discipline. We bumped five separate cron jobs this week because agents consistently hit execution limits. A well-designed agent pipeline needs explicit timeout budgets at each phase: research phase gets N seconds, execution phase gets M seconds, verification phase gets P seconds. Without phase-level budgets, the agent silently burns the entire window on the first step and errors at the worst possible moment. Think of timeout management like memory allocation — you have to budget it explicitly.

The Edit-on-large-files anti-pattern. We had two consecutive pipeline failures traced to the same root cause: an agent using file-edit tools on large files without reading them first. The agent assumed the file structure from context, applied a targeted edit, and corrupted the file. Now our pipelines include an explicit read-before-edit constraint in the scaffolding instructions. This is a case where the scaffolding should protect the agent from its own overconfidence.

Browser pre-flight checks are not optional. Several of our automation agents depend on a browser session. A dead browser is a silent failure — the agent runs for 20 minutes, reports success, and has actually done nothing. Adding a browser health check as the first step of any browser-dependent pipeline eliminates this entire class of failure. The same principle applies to any external dependency: check first, execute second.

Self-healing over manual recovery. We build recovery paths directly into our agent scaffolding. If the primary path fails, there is a fallback. If the fallback fails, the partial result is saved and flagged for review rather than discarded. This is not complex to implement — it is mostly a matter of making the scaffolding explicit about what success at each step requires, and what to do when it does not occur. Claude Code /loop takes a similar approach at the session level: persistent state across resets so autonomous work survives context windows.

The compounding effect of these patterns is measurable. The 3 tools that changed how we code with AI agents post covers some of the tooling side; this post is about the discipline side. Both matter.


Building Your Own GSD-Style Scaffolding

You do not need to adopt GSD wholesale. The principles are separable.

Start by writing explicit success criteria for every agent task. Not "write tests for the auth module" but "write tests for the auth module, achieving ≥ 90% coverage on auth/middleware.ts, with all existing tests still passing." This single change eliminates the largest class of agent failures.

Then add verification gates. After each major step, the agent should check that the expected output exists and matches the spec. This is what GSD's /gsd:verify-work command automates. You can implement it with a simple post-execution prompt: "Read the output of the previous step and confirm it satisfies the following criteria: [criteria list]. If any criteria fail, stop and report what is missing."

Then add state persistence. After each verified step, write the current state to disk. Use git commits as checkpoints. If the session resets, the agent can read back its own state and resume without losing work.

Finally, draw the LLM/script boundary explicitly. Anything that needs to be exactly right goes in a script. Anything that requires judgment goes to the model. The boundary should be visible in your scaffolding files.

The Morphllm data confirms what this architecture predicts: the agent system that structures these decisions correctly will outperform the one running a nominally stronger model but with weak scaffolding. As GPT-5.4's computer use capabilities show, even the most capable models need the right execution environment to deliver on their potential.


FAQ

What is the GSD Framework for AI agents? GSD (Get Shit Done) is a spec-driven development system built on top of Claude Code's native capabilities. It uses 50 Markdown files, 6 slash commands, and 2 hooks to orchestrate a complete development workflow from idea to shipped code. No custom runtime required.

Why does scaffolding matter more than the AI model? The Morphllm study found the same model scores 17 benchmark problems apart depending on how the agent is scaffolded. Scaffolding determines how the model receives tasks, manages context, verifies output, and recovers from failures. The model provides intelligence; scaffolding provides the structure that makes that intelligence reliable and directed.

What are the most common reasons AI agent pipelines fail in production? The four main causes are: (1) context window exhaustion from poor state management, (2) vague success criteria that lead to plausible-but-wrong outputs, (3) missing verification steps that let failures propagate silently, and (4) no recovery paths when tools error or timeouts occur. GSD's architecture addresses all four.

How does GSD handle context window resets? GSD writes all project state to a .planning/ directory and commits after each step. A /gsd:resume-work command reads that state back in after a context reset. The agent recovers its own context without human intervention.

Does the GSD Framework work with models other than Claude? Yes. The GSD architecture targets Claude Code, OpenCode, and Gemini CLI equally. The scaffolding principles — spec-first, parallel research, persistent state, LLM/script boundary — transfer to any agent system regardless of the underlying model.

How many concurrent agents can a GSD-style pipeline run? GSD's research phase uses four parallel agents by default, with a synthesizer consuming their outputs sequentially. The practical ceiling depends on your tool configuration and budget, but the fan-out/fan-in pattern itself is scalable and can be extended. Our own cron-based pipelines run 25+ agents daily with independent schedules and isolated contexts.


The Production Standard Has Shifted

The Morphllm benchmark spread — 17 problems, same model, different scaffolding — is the clearest evidence available that the quality bar for agent work is no longer about which model you use. It is about how well you define the work, constrain the execution, verify the output, and recover from failure.

GSD is one implementation of that insight. The principles behind it are available to any builder willing to write explicit specs, add verification gates, and persist state across resets.

The agents that ship reliably are the ones with clear success criteria. That is true of human engineers, too.


Sources: Morphllm AI Coding Agent Study 2026 · GSD Framework Workflow Analysis · GSD Framework for Claude Code (Medium) · Greg Isenberg: Building AI Agents That Actually Work · AI Labs: GSD Is the Missing Piece For Claude Code

Share article

Share: