22 KiB
AI-Assisted Coding Productivity Harnesses
Research date: 2026-04-22 Research method: Four parallel subagent investigations + cross-review synthesis Sources: Anthropic, OpenAI, GitHub, Meta Engineering, Stripe, METR, Uplevel, GitClear, Stack Overflow
Executive Summary
This note compares four distinct productivity harnesses for solo developers using multiple AI coding tools (Codex, Claude Code, OpenCode, Gemini CLI, OpenRouter, OpenClaw). It's based on research into what actually works, what fails, and how experienced developers avoid the traps of "vibe coding."
Key finding: The strongest developers are not "vibe coding harder." They're building a small operating system around the models: instruction files, scoped memory, background workers, verification hooks, separate review, and compact session handoffs.
Critical warning: METR's July 2025 study of 16 experienced open-source developers found AI use made them 19% slower on real tasks, even though they expected 24% speedup and still felt 20% faster afterward. The illusion of productivity is real.
The Four Harnesses
Harness 1: The Coding Agent Army
Concept: Run a small team of specialized AI workers in parallel. You are the dispatcher and reviewer, not the typist.
Tools & Roles:
- OpenClaw: orchestrator, routing, memory glue, background jobs
- Codex: backend and refactor agent (parallel background execution)
- Claude Code: frontend/UI and architecture-heavy agent
- OpenCode: test-fix and repo-wide cleanup agent
- Gemini CLI: long-context reader, docs digestion, codebase summarizer
- OpenRouter: model switchboard for cheap classification/planning/second opinions
Agent Lineup:
- Frontend agent — UI components, styling, interaction bugs
- Backend agent — API routes, DB logic, services
- Test agent — unit/integration/e2e tests, repros, CI fixes
- Docs agent — changelogs, migration notes, onboarding docs
- Architect agent (optional) — no direct edits, only plans and reviews
Memory Strategy:
Each agent gets its own brief system prompt, task scratchpad, known-files list, and small memory file (agents/frontend.md, agents/backend.md, etc.). Shared inputs: PROJECT.md for architecture, TASK.md for the ticket, DECISIONS.md for accepted choices. Rule: no agent gets the whole repo context by default.
Workflow Example — "Add teams feature with invite flow":
- OpenClaw parses ticket, splits into UI/API/tests/docs
- Architect agent drafts task graph (API first, UI second, tests alongside, docs last)
- Backend agent adds
POST /teams, invite token model, permission checks - Frontend agent builds create-team modal and invite screen
- Test agent writes API tests and invite flow Playwright test
- Docs agent updates README, env vars, admin notes
- Orchestrator rebases outputs, resolves overlaps, runs lint/test
- Human reviews final diff, not every micro-step
Strengths:
- Fastest for medium-to-large scoped work
- Great when tasks decompose cleanly
- Lets different models do what they're best at
- Reduces "one agent forgot half the ticket" failure
Weaknesses:
- Agents step on the same files
- Inconsistent naming or architecture across agents
- Duplicated logic across layers
- Orchestration overhead kills speed on small tasks
- Weak dispatcher means chaos
Token Efficiency:
- Use OpenRouter cheap models for triage/routing
- Use Gemini CLI only for repo summarization or large-doc ingestion
- Give each specialist a file bundle, not full repo
- Persist role memories so you don't re-explain conventions every run
- Reserve expensive models for synthesis or tricky patches
Quality Maintenance:
- One canonical
DECISIONS.md - One merge gate: lint, tests, typecheck, formatting
- Architect agent reviews cross-cutting consistency
- Test agent must validate every nontrivial change
- Human signs off on schema, auth, and UX changes
Opinionated Rule: Never let all agents write directly to main. They work in isolated branches or patch outputs, then one orchestrator composes.
Harness 2: The Unified Context Stack
Concept: Use one primary agent with one shared project memory so the system stays coherent. Optimize for understanding, not parallelism.
Tools & Roles:
- OpenClaw: primary shell, memory manager, execution layer
- Codex: main coding engine (background tasks, sandboxes)
- Gemini CLI: massive-context reader when needed
- Claude Code: only when you need subagents or deep repo analysis
- OpenCode: only for specific language/tool gaps
Memory Strategy: Single source of truth:
AGENTS.mdorCLAUDE.md— project constitution, coding standards, build/test commands.github/instructions/*.instructions.md— path-scoped rules (backend vs frontend)notes/current-task.md— objective, changed files, failing tests, next stepnotes/decisions.md— non-obvious decisions and whysession_handoff.md— last decisions, blockers, next move
Meta's advice: concise navigation beats giant docs. They recommend 25-35 line context files. Anthropic says keep CLAUDE.md under 200 lines — large memory files hurt adherence.
Workflow Example:
- Read
AGENTS.md,plan.md,todo.md,session_handoff.md - Inspect changed files / git diff
- Ask agent: "Identify files, invariants, commands, failure modes. Do not edit."
- Save result in short task brief
- Then implement
- Run tests/lint/typecheck
- Update
session_handoff.mdwith decisions and next steps
Strengths:
- Deep coherence — one model maintains full context
- No orchestration overhead
- Simple mental model
- Best for complex architectural work requiring continuity
Weaknesses:
- Single point of failure
- No parallelization
- Context window limits on large codebases
- One tool's blind spots become your blind spots
Token Efficiency:
- Use
/clearbetween unrelated tasks - Use
/compactwith focus instructions - Prefer Sonnet for most work, Haiku for simple subagents
- Prefer CLI tools over MCP when possible (MCP tool listings add context overhead)
- Use hooks/skills to preprocess huge outputs before model sees them
Quality Maintenance:
- Force tests, linters, typecheck, diff review into loop
- Separate implementation from validation
- Require task-specific acceptance criteria
- Use hooks to reject bad writes or auto-run checks
- Post-edit hook: run formatter/linter/tests on touched files
Harness 3: The Human-in-the-Loop Pipeline
Concept: AI generates, human validates at key gates. Structured workflow: spec → AI draft → human review → AI refine → human approve. Emphasis on quality over speed.
Tools & Roles:
- Claude Code: planning, architecture, spec generation
- Codex: implementation in isolated branch/worktree
- OpenCode/Gemini: secondary review or specific gap filling
- OpenClaw: orchestration, task state management, gate enforcement
Workflow:
-
Spec Gate (human writes/approves):
- Clear acceptance criteria
- Test strategy
- Files expected to change
- Architecture invariants not to break
-
AI Draft (Codex or Claude Code):
- Implement to spec
- Run tests
- Produce structured self-report: problem, root cause, files touched, tests added, risks
-
Review Gate (human + AI review agent):
- Human: does this match the spec?
- AI reviewer: "critique this diff for maintainability, hidden coupling, missing tests, unsafe assumptions"
- Must pass both
-
Refine Loop (AI fixes, human re-reviews):
- Max 2-3 iterations
- If still not passing, escalate to human rewrite
-
Merge Gate (human only):
- Final approval
- Especially for auth, schema, UX changes
Memory Strategy:
- Spec lives in
specs/YYYY-MM-DD_feature.md - Review feedback lives in
reviews/ - Decision log in
DECISIONS.md - Each iteration updates
session_handoff.md
Strengths:
- Highest quality output
- Human maintains architectural ownership
- Catches AI misunderstandings early
- Builds trust through verification
Weaknesses:
- Slowest of the four
- Requires human availability at gates
- Can feel like "AI-assisted bureaucracy" if gates are too heavy
- Risk: human becomes bottleneck
Token Efficiency:
- Invest tokens in spec clarity upfront (saves rewrite tokens later)
- Use cheap model for first draft, expensive model for review
- Cancel refinement loops early if direction is wrong
Quality Maintenance:
- PR template section: "what was verified, with which command, on which inputs"
- If no test exists, agent must add one or explain why not
- Stripe found agents often "passed" tasks by doing invalid verification. Better runs wrote scripts to generate realistic test data.
Harness 4: The Minimalist Vibe Coder
Concept: Fewest tools possible, maximum leverage. One primary tool, others only for specific gaps. Emphasis on developer judgment and taste.
Tools:
- One primary: Claude Code or Codex (pick one, know it deeply)
- One backup: OpenRouter for cheap second opinions or different model access
- OpenClaw: only for orchestration when you need background tasks
Workflow:
- Start with human-written plan (5-10 lines)
- AI implements in small chunks (one file or one function at a time)
- Human reviews immediately (don't batch)
- Run tests after every meaningful change
- Commit frequently (micro-commits)
- If AI goes off track, reset (
/clear) and restate the plan
Memory Strategy:
- One
AGENTS.mdin repo root - One
TODO.mdfor current task - Git history is your memory (small commits, clear messages)
- No complex orchestration, no handoff files
Strengths:
- Lowest overhead
- Fastest for small-to-medium tasks
- Human maintains full context
- No tool-bloat confusion
- Best for experienced developers with strong taste
Weaknesses:
- No parallelization
- No background work
- Harder for large, multi-file features
- Requires strong human judgment to know when AI is wrong
Token Efficiency:
- Most efficient — no orchestration tokens, no context replication
- Only pay for actual coding
- Reset context aggressively between tasks
Quality Maintenance:
- Human is the quality filter
- Frequent commits = easy rollback
- Small changes = easy review
- Strong test discipline
Comparative Analysis
| Dimension | Agent Army | Unified Context | Human-in-Loop | Minimalist |
|---|---|---|---|---|
| Speed | Fastest for large work | Medium | Slowest | Fastest for small work |
| Quality | Medium (needs orchestrator) | High | Highest | High (human-dependent) |
| Complexity | High | Medium | Medium | Low |
| Token Cost | Highest | Medium | Medium | Lowest |
| Setup Time | High | Medium | Medium | None |
| Best For | Large features, migrations | Complex architecture | Safety-critical code | Daily dev, quick wins |
| Failure Mode | Orchestration chaos | Context window limits | Human bottleneck | Human gets lazy |
What Actually Makes You a 10x Engineer
The research was clear: output volume ≠ productivity. Here's what actually matters:
1. Mastering Memory (Layered External Memory)
"Mastering memory" is not about magical long-term recall. It's about building a layered external memory system the agent can reload cheaply and consistently.
The four layers:
| Layer | Purpose | Example Files |
|---|---|---|
| Global | Personal preferences, cross-project conventions | ~/.config/AGENT.md |
| Repo-wide | Project constitution, build/test commands | AGENTS.md, CLAUDE.md |
| Subsystem | Domain/path-specific rules | packages/api/AGENTS.md, .cursor/rules/backend.mdc |
| Session | Active task state, handoffs | notes/current-task.md, session_handoff.md |
Key principles:
- Keep each layer small and specific (root files < 200 lines)
- Use imports (
@AGENTS.md) to avoid duplication - Move deterministic behavior into hooks/scripts, not prompts
- Prefer retrieval/search over full-repo preload for large codebases
Meta's finding: Precompute context with specialized agents first, then let execution agents work from that map. They used 50+ specialized agents to build concise context artifacts and got ~40% fewer tool calls per task.
2. Warm-Starting Sessions
Strong pattern for starting a new session:
git status
git diff --stat
cat AGENTS.md CLAUDE.md
cat notes/current-task.md notes/session_handoff.md
rg -n "TODO|FIXME|HACK" .
Then ask: "Based on these files, what are we doing and what's the next step?" This gives the agent exactly the context it needs, nothing more.
3. Token Budget Management
Proven patterns:
- Reset aggressively: Use
/clearbetween unrelated tasks - Compact with focus: Use
/compactwith specific instructions ("keep only the auth-related context") - Scope instruction files: Keep root files short, move detailed rules into path-scoped files
- Prefer CLI over MCP: MCP tool listings add context overhead; use CLI tools when possible
- Model selection: Sonnet for most work, Haiku for simple subagents, Opus only for synthesis
- Avoid re-feeding: Don't paste the same repo context into multiple tools
Anthropic's explicit recommendation: target under 200 lines per CLAUDE.md. Large files hurt adherence.
4. Verification Hooks (Not Trust)
AI writes, systems verify. Proven guardrails:
- Force tests, linters, typecheck, diff review into the loop
- Separate implementation from validation
- Post-edit hook: run formatter/linter/tests on touched files
- PR template: "what was verified, with which command, on which inputs"
- If no test exists, agent must add one or explain why not
Stripe's finding: Agents often "passed" tasks by doing invalid verification. Better runs wrote scripts to generate realistic test data.
3. Map First, Then Code
Before editing, ask: "Identify files, invariants, commands, failure modes. Do not edit." Save result in short task brief. Then implement. This prevents the most expensive failure mode: rewriting the same code 3 times because the agent misunderstood the architecture.
4. Isolate Noisy Work
Spawn subagents for tests, logs, docs search. Return only failing cases, stack traces, or summary. Don't flood the main thread with raw output.
5. Review with a Different Agent
"Critique this diff for maintainability, hidden coupling, missing tests, unsafe assumptions." The reviewer/author separation catches hallucinated fixes and hidden damage.
6. Compact Session Handoffs
Store only: objective, touched files, commands run, current failure, next move. New session starts from that, not old chat. Use /clear between unrelated tasks.
Failure Modes to Avoid
The Illusion of Productivity
- METR study: AI made experienced developers 19% slower on real tasks
- Uplevel study: 41% more bugs in Copilot group, little productivity gain
- GitClear 2025: copy/paste exceeded moved/refactored code for first time
- Stack Overflow 2026: 84% using AI, only 29% trust it
Context Fragmentation
Same task split across Cursor, Copilot, Claude, ChatGPT — each with different partial context. Result: tool disagreement, prompt drift, lost provenance, duplicate review burden.
Vibe Coding Debt
Code that demos well but isn't built like a system:
- Auth logic that "works" but is unsafe
- Broad catch blocks and silent failure
- Bloated schemas with no domain fit
- Inconsistent architectural style across files
- Generated comments masking weak reasoning
Token Waste Patterns
- Re-feeding same repo context into several tools
- Asking for large rewrites before pinning requirements
- Generating full files when diff or function-level patch would do
- Using chat models for validation instead of running tests
- Bouncing between tools for "second opinions"
Recommended File Layout
Cross-Tool Standard (AGENTS.md)
The emerging standard is hierarchical instruction files that work across tools:
repo/
├── AGENTS.md # universal/cross-tool project constitution
├── CLAUDE.md # imports AGENTS.md, adds Claude-specific notes
├── .cursor/
│ └── rules/
│ ├── backend.mdc # path-scoped rules (auto-attached by glob)
│ └── frontend.mdc
├── .github/
│ ├── copilot-instructions.md # repo-wide instructions
│ └── instructions/
│ ├── backend.instructions.md # path-scoped (applyTo: "**/*.py")
│ └── typescript.instructions.md # path-scoped (applyTo: "**/*.ts")
├── packages/
│ ├── api/
│ │ └── AGENTS.md # subsystem rules (nearest takes precedence)
│ └── web/
│ └── AGENTS.md
├── notes/
│ ├── current-task.md # objective, files, failure, next step
│ ├── decisions.md # non-obvious choices and why
│ └── session_handoff.md # last decisions, blockers
├── specs/ # human-approved specs
├── reviews/ # AI review outputs
└── scripts/
├── verify.sh # test + lint + typecheck
└── pre-commit.sh # auto-run checks
Why This Structure Works
- Stable memory (repo-wide):
AGENTS.md— coding standards, build commands, architecture invariants - Scoped memory (path-specific):
.cursor/rules/*.mdc,.github/instructions/*.md— backend/frontend conventions - Volatile memory (session):
notes/current-task.md,session_handoff.md— active task state - Retrieval over replay: New sessions load only the 20-line task brief, not 50k tokens of old chat
Tool-Specific File Formats
| File | Tool | Scope | Notes |
|---|---|---|---|
AGENTS.md |
Cross-tool | Repo-wide | Universal standard, keep < 200 lines |
CLAUDE.md |
Claude Code | Repo-wide | Imports AGENTS.md via @AGENTS.md. Keep < 200 lines. Large files hurt adherence. |
.cursorrules |
Cursor | Repo-wide | Legacy, being replaced by .cursor/rules/*.mdc |
.cursor/rules/*.mdc |
Cursor | Path-scoped | Auto-attached by glob. Can nest in subdirectories. |
.github/copilot-instructions.md |
Copilot | Repo-wide | One per repo |
.github/instructions/*.instructions.md |
Copilot | Path-scoped | Uses frontmatter: applyTo: "**/*.ts" |
AGENT.md |
Emerging | Any | Proposed standard with merge behavior (nearest wins) |
Example AGENTS.md (Minimal, Effective)
# AGENTS.md
## Commands
- Install: `pnpm install`
- Test: `pnpm test`
- Lint: `pnpm lint`
- Typecheck: `pnpm typecheck`
## Guardrails
- Never edit `src/generated/**`
- Prefer `rg` over slower recursive search
- Run tests relevant to changed files before finishing
## Architecture
- API handlers: `packages/api/src/handlers`
- UI components: `packages/web/src/components`
- Shared schemas: `packages/shared/src/schema`
## Workflow
- For multi-file changes, update `session_handoff.md`
- Record architectural decisions in `notes/decisions/`
Example session_handoff.md
# Session Handoff
## Last completed
- Migrated auth middleware to token refresh flow
## Decisions
- Keep REST externally, internal services move to RPC
- Do not delete old middleware until admin routes migrated
## Next
- Update admin route guards
- Add integration tests for expired-token refresh
## Blockers
- CI sandbox rate limit on auth provider
Path-Scoped Rules Example
.github/instructions/typescript.instructions.md:
---
applyTo: "**/*.ts,**/*.tsx"
---
- Use zod for runtime validation
- Prefer functional components
- No implicit any
.cursor/rules/backend.mdc:
---
description: Backend API conventions
globs: "packages/api/**/*.py"
---
- Use FastAPI dependency injection
- All handlers must have type hints
- Return Pydantic models, not dicts
My Recommendation for Claudio
Given your preference for direct, high-leverage work and your experience level, I'd suggest a hybrid approach:
Default: Minimalist Vibe Coder (Harness 4)
- One primary tool (Claude Code)
- Small plans, immediate review, frequent commits
- Lowest overhead, highest human agency
When a feature is large or cross-cutting: Switch to Unified Context Stack (Harness 2)
- One agent, deep context
- Map first, then code
- Externalize memory in
AGENTS.mdandnotes/
When you need overnight or parallel work: Spawn background agents via OpenClaw
- But treat them as draft generators, not autonomous engineers
- Morning review is mandatory
- Never auto-merge
Avoid: The full Coding Agent Army unless you genuinely have orchestration time to invest. The overhead often exceeds the value for solo developers.
Never skip: Verification hooks. AI writes, tests verify, human approves. That's the real 10x pattern.
Sources
- Anthropic Claude Code docs: https://code.claude.com/docs/en/memory, https://code.claude.com/docs/en/sub-agents, https://code.claude.com/docs/en/costs, https://code.claude.com/docs/en/hooks
- GitHub Copilot custom instructions: https://docs.github.com/en/copilot/how-tos/copilot-on-github/customize-copilot/add-custom-instructions/add-repository-instructions
- OpenAI Codex: https://developers.openai.com/codex/cloud
- Meta Engineering: https://engineering.fb.com/2026/04/06/developer-tools/how-meta-used-ai-to-map-tribal-knowledge-in-large-scale-data-pipelines/
- Stripe blog: https://stripe.com/blog/can-ai-agents-build-real-stripe-integrations
- METR study (July 2025): https://metr.org/blog/2025-07-10-early-2025-ai-experienced-os-dev-study/
- Uplevel: https://uplevelteam.com/blog/ai-for-developer-productivity
- GitClear 2025: https://www.gitclear.com/ai_assistant_code_quality_2025_research
- Stack Overflow (Feb 2026): https://stackoverflow.blog/2026/02/18/closing-the-developer-ai-trust-gap/
- Reddit anecdote: https://www.reddit.com/r/ExperiencedDevs/comments/1sskw4r/getting_more_calls_to_fix_ai_generated_codebases/