Files
second-brain/04_Topics/AI-Assisted_Coding_Harnesses.md

22 KiB

AI-Assisted Coding Productivity Harnesses

Research date: 2026-04-22 Research method: Four parallel subagent investigations + cross-review synthesis Sources: Anthropic, OpenAI, GitHub, Meta Engineering, Stripe, METR, Uplevel, GitClear, Stack Overflow


Executive Summary

This note compares four distinct productivity harnesses for solo developers using multiple AI coding tools (Codex, Claude Code, OpenCode, Gemini CLI, OpenRouter, OpenClaw). It's based on research into what actually works, what fails, and how experienced developers avoid the traps of "vibe coding."

Key finding: The strongest developers are not "vibe coding harder." They're building a small operating system around the models: instruction files, scoped memory, background workers, verification hooks, separate review, and compact session handoffs.

Critical warning: METR's July 2025 study of 16 experienced open-source developers found AI use made them 19% slower on real tasks, even though they expected 24% speedup and still felt 20% faster afterward. The illusion of productivity is real.


The Four Harnesses

Harness 1: The Coding Agent Army

Concept: Run a small team of specialized AI workers in parallel. You are the dispatcher and reviewer, not the typist.

Tools & Roles:

  • OpenClaw: orchestrator, routing, memory glue, background jobs
  • Codex: backend and refactor agent (parallel background execution)
  • Claude Code: frontend/UI and architecture-heavy agent
  • OpenCode: test-fix and repo-wide cleanup agent
  • Gemini CLI: long-context reader, docs digestion, codebase summarizer
  • OpenRouter: model switchboard for cheap classification/planning/second opinions

Agent Lineup:

  1. Frontend agent — UI components, styling, interaction bugs
  2. Backend agent — API routes, DB logic, services
  3. Test agent — unit/integration/e2e tests, repros, CI fixes
  4. Docs agent — changelogs, migration notes, onboarding docs
  5. Architect agent (optional) — no direct edits, only plans and reviews

Memory Strategy: Each agent gets its own brief system prompt, task scratchpad, known-files list, and small memory file (agents/frontend.md, agents/backend.md, etc.). Shared inputs: PROJECT.md for architecture, TASK.md for the ticket, DECISIONS.md for accepted choices. Rule: no agent gets the whole repo context by default.

Workflow Example — "Add teams feature with invite flow":

  1. OpenClaw parses ticket, splits into UI/API/tests/docs
  2. Architect agent drafts task graph (API first, UI second, tests alongside, docs last)
  3. Backend agent adds POST /teams, invite token model, permission checks
  4. Frontend agent builds create-team modal and invite screen
  5. Test agent writes API tests and invite flow Playwright test
  6. Docs agent updates README, env vars, admin notes
  7. Orchestrator rebases outputs, resolves overlaps, runs lint/test
  8. Human reviews final diff, not every micro-step

Strengths:

  • Fastest for medium-to-large scoped work
  • Great when tasks decompose cleanly
  • Lets different models do what they're best at
  • Reduces "one agent forgot half the ticket" failure

Weaknesses:

  • Agents step on the same files
  • Inconsistent naming or architecture across agents
  • Duplicated logic across layers
  • Orchestration overhead kills speed on small tasks
  • Weak dispatcher means chaos

Token Efficiency:

  • Use OpenRouter cheap models for triage/routing
  • Use Gemini CLI only for repo summarization or large-doc ingestion
  • Give each specialist a file bundle, not full repo
  • Persist role memories so you don't re-explain conventions every run
  • Reserve expensive models for synthesis or tricky patches

Quality Maintenance:

  • One canonical DECISIONS.md
  • One merge gate: lint, tests, typecheck, formatting
  • Architect agent reviews cross-cutting consistency
  • Test agent must validate every nontrivial change
  • Human signs off on schema, auth, and UX changes

Opinionated Rule: Never let all agents write directly to main. They work in isolated branches or patch outputs, then one orchestrator composes.


Harness 2: The Unified Context Stack

Concept: Use one primary agent with one shared project memory so the system stays coherent. Optimize for understanding, not parallelism.

Tools & Roles:

  • OpenClaw: primary shell, memory manager, execution layer
  • Codex: main coding engine (background tasks, sandboxes)
  • Gemini CLI: massive-context reader when needed
  • Claude Code: only when you need subagents or deep repo analysis
  • OpenCode: only for specific language/tool gaps

Memory Strategy: Single source of truth:

  • AGENTS.md or CLAUDE.md — project constitution, coding standards, build/test commands
  • .github/instructions/*.instructions.md — path-scoped rules (backend vs frontend)
  • notes/current-task.md — objective, changed files, failing tests, next step
  • notes/decisions.md — non-obvious decisions and why
  • session_handoff.md — last decisions, blockers, next move

Meta's advice: concise navigation beats giant docs. They recommend 25-35 line context files. Anthropic says keep CLAUDE.md under 200 lines — large memory files hurt adherence.

Workflow Example:

  1. Read AGENTS.md, plan.md, todo.md, session_handoff.md
  2. Inspect changed files / git diff
  3. Ask agent: "Identify files, invariants, commands, failure modes. Do not edit."
  4. Save result in short task brief
  5. Then implement
  6. Run tests/lint/typecheck
  7. Update session_handoff.md with decisions and next steps

Strengths:

  • Deep coherence — one model maintains full context
  • No orchestration overhead
  • Simple mental model
  • Best for complex architectural work requiring continuity

Weaknesses:

  • Single point of failure
  • No parallelization
  • Context window limits on large codebases
  • One tool's blind spots become your blind spots

Token Efficiency:

  • Use /clear between unrelated tasks
  • Use /compact with focus instructions
  • Prefer Sonnet for most work, Haiku for simple subagents
  • Prefer CLI tools over MCP when possible (MCP tool listings add context overhead)
  • Use hooks/skills to preprocess huge outputs before model sees them

Quality Maintenance:

  • Force tests, linters, typecheck, diff review into loop
  • Separate implementation from validation
  • Require task-specific acceptance criteria
  • Use hooks to reject bad writes or auto-run checks
  • Post-edit hook: run formatter/linter/tests on touched files

Harness 3: The Human-in-the-Loop Pipeline

Concept: AI generates, human validates at key gates. Structured workflow: spec → AI draft → human review → AI refine → human approve. Emphasis on quality over speed.

Tools & Roles:

  • Claude Code: planning, architecture, spec generation
  • Codex: implementation in isolated branch/worktree
  • OpenCode/Gemini: secondary review or specific gap filling
  • OpenClaw: orchestration, task state management, gate enforcement

Workflow:

  1. Spec Gate (human writes/approves):

    • Clear acceptance criteria
    • Test strategy
    • Files expected to change
    • Architecture invariants not to break
  2. AI Draft (Codex or Claude Code):

    • Implement to spec
    • Run tests
    • Produce structured self-report: problem, root cause, files touched, tests added, risks
  3. Review Gate (human + AI review agent):

    • Human: does this match the spec?
    • AI reviewer: "critique this diff for maintainability, hidden coupling, missing tests, unsafe assumptions"
    • Must pass both
  4. Refine Loop (AI fixes, human re-reviews):

    • Max 2-3 iterations
    • If still not passing, escalate to human rewrite
  5. Merge Gate (human only):

    • Final approval
    • Especially for auth, schema, UX changes

Memory Strategy:

  • Spec lives in specs/YYYY-MM-DD_feature.md
  • Review feedback lives in reviews/
  • Decision log in DECISIONS.md
  • Each iteration updates session_handoff.md

Strengths:

  • Highest quality output
  • Human maintains architectural ownership
  • Catches AI misunderstandings early
  • Builds trust through verification

Weaknesses:

  • Slowest of the four
  • Requires human availability at gates
  • Can feel like "AI-assisted bureaucracy" if gates are too heavy
  • Risk: human becomes bottleneck

Token Efficiency:

  • Invest tokens in spec clarity upfront (saves rewrite tokens later)
  • Use cheap model for first draft, expensive model for review
  • Cancel refinement loops early if direction is wrong

Quality Maintenance:

  • PR template section: "what was verified, with which command, on which inputs"
  • If no test exists, agent must add one or explain why not
  • Stripe found agents often "passed" tasks by doing invalid verification. Better runs wrote scripts to generate realistic test data.

Harness 4: The Minimalist Vibe Coder

Concept: Fewest tools possible, maximum leverage. One primary tool, others only for specific gaps. Emphasis on developer judgment and taste.

Tools:

  • One primary: Claude Code or Codex (pick one, know it deeply)
  • One backup: OpenRouter for cheap second opinions or different model access
  • OpenClaw: only for orchestration when you need background tasks

Workflow:

  1. Start with human-written plan (5-10 lines)
  2. AI implements in small chunks (one file or one function at a time)
  3. Human reviews immediately (don't batch)
  4. Run tests after every meaningful change
  5. Commit frequently (micro-commits)
  6. If AI goes off track, reset (/clear) and restate the plan

Memory Strategy:

  • One AGENTS.md in repo root
  • One TODO.md for current task
  • Git history is your memory (small commits, clear messages)
  • No complex orchestration, no handoff files

Strengths:

  • Lowest overhead
  • Fastest for small-to-medium tasks
  • Human maintains full context
  • No tool-bloat confusion
  • Best for experienced developers with strong taste

Weaknesses:

  • No parallelization
  • No background work
  • Harder for large, multi-file features
  • Requires strong human judgment to know when AI is wrong

Token Efficiency:

  • Most efficient — no orchestration tokens, no context replication
  • Only pay for actual coding
  • Reset context aggressively between tasks

Quality Maintenance:

  • Human is the quality filter
  • Frequent commits = easy rollback
  • Small changes = easy review
  • Strong test discipline

Comparative Analysis

Dimension Agent Army Unified Context Human-in-Loop Minimalist
Speed Fastest for large work Medium Slowest Fastest for small work
Quality Medium (needs orchestrator) High Highest High (human-dependent)
Complexity High Medium Medium Low
Token Cost Highest Medium Medium Lowest
Setup Time High Medium Medium None
Best For Large features, migrations Complex architecture Safety-critical code Daily dev, quick wins
Failure Mode Orchestration chaos Context window limits Human bottleneck Human gets lazy

What Actually Makes You a 10x Engineer

The research was clear: output volume ≠ productivity. Here's what actually matters:

1. Mastering Memory (Layered External Memory)

"Mastering memory" is not about magical long-term recall. It's about building a layered external memory system the agent can reload cheaply and consistently.

The four layers:

Layer Purpose Example Files
Global Personal preferences, cross-project conventions ~/.config/AGENT.md
Repo-wide Project constitution, build/test commands AGENTS.md, CLAUDE.md
Subsystem Domain/path-specific rules packages/api/AGENTS.md, .cursor/rules/backend.mdc
Session Active task state, handoffs notes/current-task.md, session_handoff.md

Key principles:

  • Keep each layer small and specific (root files < 200 lines)
  • Use imports (@AGENTS.md) to avoid duplication
  • Move deterministic behavior into hooks/scripts, not prompts
  • Prefer retrieval/search over full-repo preload for large codebases

Meta's finding: Precompute context with specialized agents first, then let execution agents work from that map. They used 50+ specialized agents to build concise context artifacts and got ~40% fewer tool calls per task.

2. Warm-Starting Sessions

Strong pattern for starting a new session:

git status
git diff --stat
cat AGENTS.md CLAUDE.md
cat notes/current-task.md notes/session_handoff.md
rg -n "TODO|FIXME|HACK" .

Then ask: "Based on these files, what are we doing and what's the next step?" This gives the agent exactly the context it needs, nothing more.

3. Token Budget Management

Proven patterns:

  • Reset aggressively: Use /clear between unrelated tasks
  • Compact with focus: Use /compact with specific instructions ("keep only the auth-related context")
  • Scope instruction files: Keep root files short, move detailed rules into path-scoped files
  • Prefer CLI over MCP: MCP tool listings add context overhead; use CLI tools when possible
  • Model selection: Sonnet for most work, Haiku for simple subagents, Opus only for synthesis
  • Avoid re-feeding: Don't paste the same repo context into multiple tools

Anthropic's explicit recommendation: target under 200 lines per CLAUDE.md. Large files hurt adherence.

4. Verification Hooks (Not Trust)

AI writes, systems verify. Proven guardrails:

  • Force tests, linters, typecheck, diff review into the loop
  • Separate implementation from validation
  • Post-edit hook: run formatter/linter/tests on touched files
  • PR template: "what was verified, with which command, on which inputs"
  • If no test exists, agent must add one or explain why not

Stripe's finding: Agents often "passed" tasks by doing invalid verification. Better runs wrote scripts to generate realistic test data.

3. Map First, Then Code

Before editing, ask: "Identify files, invariants, commands, failure modes. Do not edit." Save result in short task brief. Then implement. This prevents the most expensive failure mode: rewriting the same code 3 times because the agent misunderstood the architecture.

4. Isolate Noisy Work

Spawn subagents for tests, logs, docs search. Return only failing cases, stack traces, or summary. Don't flood the main thread with raw output.

5. Review with a Different Agent

"Critique this diff for maintainability, hidden coupling, missing tests, unsafe assumptions." The reviewer/author separation catches hallucinated fixes and hidden damage.

6. Compact Session Handoffs

Store only: objective, touched files, commands run, current failure, next move. New session starts from that, not old chat. Use /clear between unrelated tasks.


Failure Modes to Avoid

The Illusion of Productivity

  • METR study: AI made experienced developers 19% slower on real tasks
  • Uplevel study: 41% more bugs in Copilot group, little productivity gain
  • GitClear 2025: copy/paste exceeded moved/refactored code for first time
  • Stack Overflow 2026: 84% using AI, only 29% trust it

Context Fragmentation

Same task split across Cursor, Copilot, Claude, ChatGPT — each with different partial context. Result: tool disagreement, prompt drift, lost provenance, duplicate review burden.

Vibe Coding Debt

Code that demos well but isn't built like a system:

  • Auth logic that "works" but is unsafe
  • Broad catch blocks and silent failure
  • Bloated schemas with no domain fit
  • Inconsistent architectural style across files
  • Generated comments masking weak reasoning

Token Waste Patterns

  • Re-feeding same repo context into several tools
  • Asking for large rewrites before pinning requirements
  • Generating full files when diff or function-level patch would do
  • Using chat models for validation instead of running tests
  • Bouncing between tools for "second opinions"

Cross-Tool Standard (AGENTS.md)

The emerging standard is hierarchical instruction files that work across tools:

repo/
├── AGENTS.md                    # universal/cross-tool project constitution
├── CLAUDE.md                    # imports AGENTS.md, adds Claude-specific notes
├── .cursor/
│   └── rules/
│       ├── backend.mdc           # path-scoped rules (auto-attached by glob)
│       └── frontend.mdc
├── .github/
│   ├── copilot-instructions.md  # repo-wide instructions
│   └── instructions/
│       ├── backend.instructions.md    # path-scoped (applyTo: "**/*.py")
│       └── typescript.instructions.md # path-scoped (applyTo: "**/*.ts")
├── packages/
│   ├── api/
│   │   └── AGENTS.md            # subsystem rules (nearest takes precedence)
│   └── web/
│       └── AGENTS.md
├── notes/
│   ├── current-task.md          # objective, files, failure, next step
│   ├── decisions.md             # non-obvious choices and why
│   └── session_handoff.md       # last decisions, blockers
├── specs/                       # human-approved specs
├── reviews/                     # AI review outputs
└── scripts/
    ├── verify.sh                # test + lint + typecheck
    └── pre-commit.sh            # auto-run checks

Why This Structure Works

  • Stable memory (repo-wide): AGENTS.md — coding standards, build commands, architecture invariants
  • Scoped memory (path-specific): .cursor/rules/*.mdc, .github/instructions/*.md — backend/frontend conventions
  • Volatile memory (session): notes/current-task.md, session_handoff.md — active task state
  • Retrieval over replay: New sessions load only the 20-line task brief, not 50k tokens of old chat

Tool-Specific File Formats

File Tool Scope Notes
AGENTS.md Cross-tool Repo-wide Universal standard, keep < 200 lines
CLAUDE.md Claude Code Repo-wide Imports AGENTS.md via @AGENTS.md. Keep < 200 lines. Large files hurt adherence.
.cursorrules Cursor Repo-wide Legacy, being replaced by .cursor/rules/*.mdc
.cursor/rules/*.mdc Cursor Path-scoped Auto-attached by glob. Can nest in subdirectories.
.github/copilot-instructions.md Copilot Repo-wide One per repo
.github/instructions/*.instructions.md Copilot Path-scoped Uses frontmatter: applyTo: "**/*.ts"
AGENT.md Emerging Any Proposed standard with merge behavior (nearest wins)

Example AGENTS.md (Minimal, Effective)

# AGENTS.md

## Commands
- Install: `pnpm install`
- Test: `pnpm test`
- Lint: `pnpm lint`
- Typecheck: `pnpm typecheck`

## Guardrails
- Never edit `src/generated/**`
- Prefer `rg` over slower recursive search
- Run tests relevant to changed files before finishing

## Architecture
- API handlers: `packages/api/src/handlers`
- UI components: `packages/web/src/components`
- Shared schemas: `packages/shared/src/schema`

## Workflow
- For multi-file changes, update `session_handoff.md`
- Record architectural decisions in `notes/decisions/`

Example session_handoff.md

# Session Handoff

## Last completed
- Migrated auth middleware to token refresh flow

## Decisions
- Keep REST externally, internal services move to RPC
- Do not delete old middleware until admin routes migrated

## Next
- Update admin route guards
- Add integration tests for expired-token refresh

## Blockers
- CI sandbox rate limit on auth provider

Path-Scoped Rules Example

.github/instructions/typescript.instructions.md:

---
applyTo: "**/*.ts,**/*.tsx"
---
- Use zod for runtime validation
- Prefer functional components
- No implicit any

.cursor/rules/backend.mdc:

---
description: Backend API conventions
globs: "packages/api/**/*.py"
---
- Use FastAPI dependency injection
- All handlers must have type hints
- Return Pydantic models, not dicts

My Recommendation for Claudio

Given your preference for direct, high-leverage work and your experience level, I'd suggest a hybrid approach:

Default: Minimalist Vibe Coder (Harness 4)

  • One primary tool (Claude Code)
  • Small plans, immediate review, frequent commits
  • Lowest overhead, highest human agency

When a feature is large or cross-cutting: Switch to Unified Context Stack (Harness 2)

  • One agent, deep context
  • Map first, then code
  • Externalize memory in AGENTS.md and notes/

When you need overnight or parallel work: Spawn background agents via OpenClaw

  • But treat them as draft generators, not autonomous engineers
  • Morning review is mandatory
  • Never auto-merge

Avoid: The full Coding Agent Army unless you genuinely have orchestration time to invest. The overhead often exceeds the value for solo developers.

Never skip: Verification hooks. AI writes, tests verify, human approves. That's the real 10x pattern.


Sources