Building an Agentic Team for an Open Source Project with Claude Code
- Steve Scargall
- Ai
- April 14, 2026
A core engineer on MemMachine — the one who owned the Semantic Memory subsystem — left the project. The codebase didn’t grow any less complex overnight, but the human attention available to maintain it did. That’s a familiar shape of problem in any open source project, and it’s the exact shape where a well-designed Claude Code agent team earns its keep.
This post documents what I built: a 22-agent maintenance team that lives entirely inside MemMachine’s repository, coordinates via Claude Code’s experimental Agent Teams runtime, and operates under a design I can reproduce for any existing repository with real code. The agents don’t push code, don’t sign commits, don’t merge pull requests, and don’t cut releases — humans still gatekeep every consequential action. What the agents do do is the tedious and error-prone middle of software maintenance: triage, spec drafting, implementation, QA, security review, docs, dependency and upstream tracking.
Everything lives under .claude/ and is gitignored. None of it leaks to GitHub. Other contributors working the repo without the agent team see exactly the same project they saw before.
The Situation
MemMachine is a long-term memory layer for AI agents — open source, published at docs.memmachine.ai , with a multi-package UV monorepo:
packages/server/— FastAPI + FastMCP, houses semantic_memory and episodic_memory subsystemspackages/{client,common,ts-client,meta}/— Python and TypeScript SDKsintegrations/— nine framework bridges (crewai, langchain, langgraph, dify, fastgpt, n8n, …) plus an OpenClaw pluginexamples/— runnable demosevaluation/— retrieval and episodic-memory evaluation suites, not currently in CIsrc/memmachine/— a legacy layout that still partially exists- 16 GitHub Actions workflows covering pytest (Python 3.12–3.14), integration tests, lint, type check, OpenAPI generation, Docker builds, SBOM, lock-file checks, publish jobs, and docs checks
The semantic-memory subsystem — the orphaned one — had zero tests under its test tree. Keeping it from silently regressing was going to be someone’s full-time job, but nobody had a full-time slot for it.
What I Wanted
Concrete operator goals, written down before I built anything:
- Weekly releases stay on track.
- Every open issue and PR gets reviewed — stale ones closed, actionable ones scoped, outdated ones flagged.
- Semantic memory gets tests before it regresses silently.
- Examples and integrations stay working against the current release.
- OpenClaw tracking keeps pace with its frequent upstream API changes.
- I approve everything. No agent pushes, merges, publishes, or signs.
And a design constraint that turned out to be load-bearing: everything the agent team depends on lives inside .claude/ and is gitignored. Other maintainers and contributors don’t see it, don’t have to run it, and aren’t slowed down by it. If the agent team infrastructure goes away tomorrow, the project is untouched.
Why Agent Teams, Not Subagents
Claude Code offers two ways to parallelize work: subagents
(spawned from the main session, report back when done) and Agent Teams
(separate Claude Code instances that coordinate via a shared task list and mailbox, currently experimental behind CLAUDE_CODE_EXPERIMENTAL_AGENT_TEAMS=1).
For this project I chose Agent Teams for three reasons:
- Teammates can talk to each other. When
dev-pythonandsemantic-memory-architectneed to negotiate where a fix belongs, they can exchange messages directly instead of routing every question back through me. - Independent context windows. Each teammate loads only what its role needs. The context for
qarunning the CI mirror doesn’t pollute the context forarchitectreviewing a spec. - The human stays the lead. The session I’m typing in is the fixed team lead for its lifetime. I can’t accidentally promote a teammate into a position that pushes code.
Agent Teams is explicitly experimental and has limits — /resume doesn’t restore teammates, you can’t nest teams, and token cost scales linearly with teammate count. None of those are dealbreakers for a maintenance workflow where teams are ephemeral and teammate counts stay small.
The Design Principles
Five principles drove every subsequent decision.
Local and gitignored. .claude/agents/, .claude/skills/, .claude/commands/, .claude/specs/, .claude/todos/, .claude/memory/, .claude/hooks/, and CLAUDE.md at repo root are all added to .gitignore. CLAUDE.md sits at the repo root (not under .claude/) specifically because Agent Teams auto-loads CLAUDE.md from a teammate’s working directory — moving it would break the auto-load.
Humans hold the pen. Agents stage files and write commit messages, but commits are signed by me because signing requires my password. Agents never run git commit, git push, gh pr create, or gh release create. They hand me ready-to-run commands and PR body text.
Karpathy guardrails, enforced via skill. The four principles — Think Before Coding, Simplicity First, Surgical Changes, Goal-Driven Execution, documented by Andrej Karpathy
for LLM-assisted coding — go into a single karpathy-guidelines skill that every code-touching agent loads at spawn. No agent writes code without running through the four gates.
Spec-driven by default. Any task that touches more than two files, changes a public API, or touches the memory subsystems requires a markdown spec at .claude/specs/<issue>.md before code is written. Typos and one-liners skip this. The spec has fixed sections: Problem, Scope, Non-goals, Approach, Acceptance criteria, Test plan.
Advisor pattern for expensive reasoning. The architect role runs Claude Opus; every other role runs Claude Sonnet. Teammates consult the architect via mailbox for hard calls — tiebreaking disagreements, reviewing cross-cutting specs, weighing design tradeoffs — without every teammate paying Opus rates for routine work. This mirrors the advisor pattern described in Anthropic’s advisor blog post
, which showed a 2.7 pp SWE-bench improvement and 11.9% cost reduction versus Sonnet-only agents.
The Architecture
Human (operator) = team LEAD (fixed)
│ spawns ad-hoc teammates per task from .claude/agents/*.md
│ shared task list at ~/.claude/tasks/<team>/
│ team config auto-written at ~/.claude/teams/<team>/config.json
│ mailbox for teammate↔teammate messages
└─ root CLAUDE.md auto-loaded into every teammate
I never pre-author team configs — Agent Teams writes them automatically at spawn time. What I do pre-author is the role library under .claude/agents/*.md. Each file defines one teammate type (its system prompt, its write scope, its model, its tools). Claude Code spawns whichever roles the current task needs.
The Roster
Twenty-two agents after one round of refinement. Each has a strict write scope — a list of paths it may edit. Agents refuse edits outside that scope, which keeps dev-python from wandering into semantic_memory/ and the memory architects from stepping on each other.
Coordination
pm— decomposes issues into tasks, proposes milestones, no code accessarchitect(Opus) — cross-package tradeoffs, advisor to other teammates, spec-review, disagreement tiebreakergh-triage— reads issues and PRs, classifies them, drafts labels/milestones/comments, powers the/mm:team <github-url>auto-assembly feature described below
Code specialists
semantic-memory-architect— owns the orphaned subsystem; first priority is backfilling missing testsepisodic-memory-architect— the symmetric owner of episodic memoryserver-architect— FastAPI + FastMCP surface, routing, middleware, OpenAPI, Alembic migrations, settingsdev-python— general Python glue (client, common, meta)dev-typescript—packages/ts-client
Maintenance
integrations-maintainer— tracks nine integrations against upstream framework changesopenclaw-tracker— polls OpenClaw releases, diffs the API surface, drafts adapter updatesexamples-maintainer— verifies every example runs against the latest releasebenchmark— runs the orphanedevaluation/suites and flags regressionsdocs-maintainer— covered in detail belowupstream-sync— compares fork to upstream, drafts merge/rebase plan
Quality gates
qa— runs the local CI mirror (every workflow under.github/workflows/)security— diff review,pip-audit,npm audit, SBOM, Dependabot triage
Imported reviewer agents
Six PR-review sub-agents imported from anthropics/claude-plugins-official
: code-reviewer, code-simplifier, comment-analyzer, pr-test-analyzer, silent-failure-hunter, type-design-analyzer.
Each agent is 50–120 lines of markdown with YAML frontmatter (name, description, model, tools) followed by its system prompt.
The Quality Gate
Claude Code’s Agent Teams runtime exposes three hooks I wired directly into settings.local.json:
TaskCreated— appends one line to.claude/audit.logTeammateIdle— logs when a teammate stops without outputTaskCompleted— blocks completion on code-touching tasks unlessqaandsecurityhave both writtenSignOff <qa|security> <task-id>lines to the audit log
That last one is the teeth of the whole system. An agent can’t mark a code-touching task “done” until qa has run the local CI mirror green and security has cleared the diff. The hook greps literally for tab-separated sign-off lines in .claude/audit.log; the format is defined in the audit-log skill and every sign-off agent follows it exactly.
There’s also a PreToolUse hook — imported from the security-guidance plugin
— that fires before every Edit, Write, or MultiEdit and warns about risky edit patterns (command injection in GitHub Actions workflows, unsafe eval, unvalidated input).
The /mm: Command Surface
Twenty-two slash commands, all namespaced with /mm: so they never collide with general-use Claude Code commands. The namespace comes from storing them at .claude/commands/mm/<name>.md.
Daily drivers:
/mm:bootstrap— verifies Claude CLI version,uv, Node, Python, Docker,gh, git signing config/mm:status— current branch, open TODOs, last QA run, active team, weekly token tally per agent (parsed from the audit log)/mm:triage—gh-triagesweeps all open issues and PRs, produces a classification digest at.claude/todos/triage-digest-<date>.md/mm:team <github-url>— auto-assembles a team proposal from a GitHub issue URL (described below)/mm:assemble <issue>— confirm and spawn the proposed team/mm:spawn <agent> "<prompt>"— single teammate, scoped mandate/mm:qa [branch|files]— run the local CI mirror/mm:security-scan— static review + dep audit/mm:docs-sync— run docs-maintainer sweep
Plus specialist commands for examples-check, integrations-check, openclaw-sync, upstream-sync, bench, audit-tail, todo-list, cadence, memory, and two imported ones: review-pr and revise-claude-md.
The Two Innovations Worth Stealing
Two design choices here are worth lifting for any similar project.
1. GitHub-URL-Driven Team Assembly
/mm:team https://github.com/MemMachine/MemMachine/issues/1318 does this:
gh-triagepulls the issue and its thread viagh issue view.- It applies a role-selection rubric encoded in its system prompt, mapping touched paths and labels to roles:
packages/server/src/memmachine_server/semantic_memory/**→semantic-memory-architectpackages/ts-client/**→dev-typescriptintegrations/openclaw/**→openclaw-tracker- labels
securityordependency→security - any code change at all →
qaandsecurityadded by default
- It drafts a team proposal — which roles, what each will do, proposed branch name, whether a spec is required — and writes it as the opening section of
.claude/specs/<n>.md. - It prints the proposal to me for confirmation.
- When I run
/mm:assemble <n>, the lead spawns each listed teammate with a scoped prompt that constrains them to their declared write scope.
The rubric is just prose in the agent’s markdown file. Updating it is one edit, not a code change. The proposal is always a draft I approve — the agent never spawns unilaterally.
2. The todo-docs- Handoff
Every code-touching agent (dev-python, dev-typescript, server-architect, memory architects, integrations-maintainer, openclaw-tracker) emits .claude/todos/todo-docs-<slug>.md whenever its change affects user-facing behavior — public HTTP routes, SDK methods, settings, docker-compose services, OpenAPI shape, anything described in docs/core_concepts/.
docs-maintainer scans for that filename prefix as its work queue. When a coding team finishes a PR, I can either include docs-maintainer in the team or run /mm:docs-sync afterward; either way, the docs update happens in a clean branch, builds cleanly with the Mintlify CLI (mint build + mint broken-links), passes docs-checks.yml + generate-openapi.yml locally, and lands as a draft PR for the human documentation writer to approve and merge.
The pattern generalizes: any role that needs work from another role can queue it via a predictable-prefix TODO file, and the receiving role scans that prefix on its next invocation. It’s a dumb, grep-able queue — no service, no broker, no schema. Works.
The Phased Build
I executed the build in six phases. I’m listing them both to document what got built and to give anyone reproducing this a sequence that works.
Phase 0 — Substrate. Add .claude/ and CLAUDE.md to .gitignore. Enable CLAUDE_CODE_EXPERIMENTAL_AGENT_TEAMS=1 in settings.local.json. Verify Claude Code version ≥ 2.1.32. Create directory skeleton. Write the three hook scripts and wire them into settings.
Phase 1 — Foundations. Write the five skills every agent references: karpathy-guidelines, audit-log, spec-driven, branch-and-commit, todo-as-issue. Write CLAUDE.md at repo root and .claude/README.md operator guide.
Phase 2 — Triage + gates. Write pm, architect, gh-triage, qa, security. Write run-ci-locally and dep-audit skills. Write the 12 daily-driver slash commands.
Phase 3 — Core dev roles. Write the two memory architects, server-architect, dev-python, dev-typescript.
Phase 4 — Maintenance roles. Write the six maintenance agents and their skills (openclaw-diff, example-runner, backcompat-check, eval-runner) and eight commands.
Phase 5 — Vetted imports. Walk three third-party repositories — affaan-m/everything-claude-code
, anthropics/skills
, anthropics/claude-plugins-official
— and evaluate each candidate individually. Eleven items imported after per-item review, with a provenance manifest at .claude/imported/README.md documenting what came from where, when, and with what adaptations.
Phase 6 — Validation. Run /mm:triage to produce the first real issue + PR digest. Walk one real issue end-to-end through the team to validate the pipeline.
The final counts: 22 agents, 19 skills, 22 slash commands, 4 hooks, 11 seeded TODOs. Only .gitignore became a tracked change. Everything else stays local.
Can This Be Reproduced for Any Repository?
Yes, with a caveat: it works for repositories that already have code. A greenfield repository doesn’t have paths to route TODOs to, workflows to mirror locally, or a release surface to protect.
What travels: the architecture (lead + teammate roles, write scopes, hook-enforced gates, TODO-driven handoffs, /mm: namespace), the principles (local-only, spec-driven, Karpathy guardrails, advisor pattern), the phased build sequence, and a surprising amount of the foundational skills (karpathy-guidelines, audit-log, spec-driven, branch-and-commit, todo-as-issue are all repo-agnostic).
What has to be re-derived per repository: the roster of specialists (MemMachine needed semantic-memory-architect because that subsystem was orphaned; your project won’t), the write scopes (which paths belong to which role), the CI mirror command list (every repo’s .github/workflows/ is different), and the role-selection rubric inside gh-triage.
The natural next step is a skill — let’s call it agent-team-bootstrap — that analyzes an existing repository and proposes a starter roster and rubric. Its procedure would be: inventory languages and workspaces (pyproject.toml, package.json, go.mod, Cargo.toml, …), identify major subsystems by directory clustering and recent-touch frequency (git log --since=1.year --stat), detect orphaned subsystems (low recent-touch count + high code volume + high issue-title mention rate), enumerate GitHub Actions workflows and translate each into a run-ci-locally command, propose a specialist role per identified subsystem, and write the role-selection rubric from what it learned. It would ask the operator three or four questions — governance (who signs commits, who merges), release cadence, which languages matter, what the human doesn’t want delegated — and then render the .claude/ tree.
The key insight is that none of this needs to be generic. The skill generates a project-specific agent team by reading the project. A skeleton plus a good rubric plus the foundational repo-agnostic skills gets you 70% of the way; the operator adds the last 30% (specialists for weird internal subsystems, write-scope exceptions, taste calls on cadence).
I’m going to write that skill next. When I have it, I’ll post a follow-up with the script and a worked example on a different repository.
What I’d Do Differently
- Start the agent memory layer immediately. I built
.claude/memory/as plain append-only markdown per-agent. It’s a great substrate, but I haven’t populated it yet. The first few real tasks should surface learnings (ruff rejects pattern X, installation-test flakes on Python 3.14, etc.) that go straight into.claude/memory/shared.md. An empty memory layer is a missed opportunity. - Dogfood the product. MemMachine is a memory layer. The long-term version of
.claude/memory/should be a local MemMachine server that agents query at spawn time. That’s a stretch goal, not day-one work, but it’s the right arc. - Weekly token-cost review.
/mm:statusshows a per-agent token tally for the week. I should read that report weekly and retire agents whose spend outpaces their value.
Artifacts
The full plan file is at ~/.claude/plans/floating-snuggling-unicorn.md in my local setup. Key artifacts from the build, in case you want the exact shapes:
settings.local.jsonwith the experimental flag, three hooks, and a broader permissions allowlistCLAUDE.mdat repo root with repo map, commands, Karpathy guardrails, secrets hygiene, Agent Teams caveats, keybinding crib.claude/agents/*.md— 22 role definitions.claude/skills/*/SKILL.md— 19 skill packages.claude/commands/mm/*.md— 22 slash commands.claude/hooks/*.shand.py— the four hooks.claude/todos/todo-*.md— 11 seeded TODOs pre-loaded so/mm:todo-listshows a real backlog from day one
Total token cost of the build: a few hours of interactive session, one long plan file, and a lot of trust in the Karpathy guardrails.
If you’re reproducing this for your own project and get stuck on a specific decision, or if you want to compare notes on writing the agent-team-bootstrap skill, email me. I’m happy to trade design choices.


