Add AI-assisted coding harness comparison (4 approaches with research)
This commit is contained in:
401
04_Topics/AI-Assisted_Coding_Harnesses.md
Normal file
401
04_Topics/AI-Assisted_Coding_Harnesses.md
Normal file
@@ -0,0 +1,401 @@
|
||||
# AI-Assisted Coding Productivity Harnesses
|
||||
|
||||
> Research date: 2026-04-22
|
||||
> Research method: Four parallel subagent investigations + cross-review synthesis
|
||||
> Sources: Anthropic, OpenAI, GitHub, Meta Engineering, Stripe, METR, Uplevel, GitClear, Stack Overflow
|
||||
|
||||
---
|
||||
|
||||
## Executive Summary
|
||||
|
||||
This note compares four distinct productivity harnesses for solo developers using multiple AI coding tools (Codex, Claude Code, OpenCode, Gemini CLI, OpenRouter, OpenClaw). It's based on research into what actually works, what fails, and how experienced developers avoid the traps of "vibe coding."
|
||||
|
||||
**Key finding**: The strongest developers are not "vibe coding harder." They're building a small operating system around the models: instruction files, scoped memory, background workers, verification hooks, separate review, and compact session handoffs.
|
||||
|
||||
**Critical warning**: METR's July 2025 study of 16 experienced open-source developers found AI use made them **19% slower** on real tasks, even though they expected 24% speedup and still *felt* 20% faster afterward. The illusion of productivity is real.
|
||||
|
||||
---
|
||||
|
||||
## The Four Harnesses
|
||||
|
||||
### Harness 1: The Coding Agent Army
|
||||
|
||||
**Concept**: Run a small team of specialized AI workers in parallel. You are the dispatcher and reviewer, not the typist.
|
||||
|
||||
**Tools & Roles**:
|
||||
- **OpenClaw**: orchestrator, routing, memory glue, background jobs
|
||||
- **Codex**: backend and refactor agent (parallel background execution)
|
||||
- **Claude Code**: frontend/UI and architecture-heavy agent
|
||||
- **OpenCode**: test-fix and repo-wide cleanup agent
|
||||
- **Gemini CLI**: long-context reader, docs digestion, codebase summarizer
|
||||
- **OpenRouter**: model switchboard for cheap classification/planning/second opinions
|
||||
|
||||
**Agent Lineup**:
|
||||
1. **Frontend agent** — UI components, styling, interaction bugs
|
||||
2. **Backend agent** — API routes, DB logic, services
|
||||
3. **Test agent** — unit/integration/e2e tests, repros, CI fixes
|
||||
4. **Docs agent** — changelogs, migration notes, onboarding docs
|
||||
5. **Architect agent** (optional) — no direct edits, only plans and reviews
|
||||
|
||||
**Memory Strategy**:
|
||||
Each agent gets its own brief system prompt, task scratchpad, known-files list, and small memory file (`agents/frontend.md`, `agents/backend.md`, etc.). Shared inputs: `PROJECT.md` for architecture, `TASK.md` for the ticket, `DECISIONS.md` for accepted choices. Rule: no agent gets the whole repo context by default.
|
||||
|
||||
**Workflow Example** — "Add teams feature with invite flow":
|
||||
1. OpenClaw parses ticket, splits into UI/API/tests/docs
|
||||
2. Architect agent drafts task graph (API first, UI second, tests alongside, docs last)
|
||||
3. Backend agent adds `POST /teams`, invite token model, permission checks
|
||||
4. Frontend agent builds create-team modal and invite screen
|
||||
5. Test agent writes API tests and invite flow Playwright test
|
||||
6. Docs agent updates README, env vars, admin notes
|
||||
7. Orchestrator rebases outputs, resolves overlaps, runs lint/test
|
||||
8. Human reviews final diff, not every micro-step
|
||||
|
||||
**Strengths**:
|
||||
- Fastest for medium-to-large scoped work
|
||||
- Great when tasks decompose cleanly
|
||||
- Lets different models do what they're best at
|
||||
- Reduces "one agent forgot half the ticket" failure
|
||||
|
||||
**Weaknesses**:
|
||||
- Agents step on the same files
|
||||
- Inconsistent naming or architecture across agents
|
||||
- Duplicated logic across layers
|
||||
- Orchestration overhead kills speed on small tasks
|
||||
- Weak dispatcher means chaos
|
||||
|
||||
**Token Efficiency**:
|
||||
- Use OpenRouter cheap models for triage/routing
|
||||
- Use Gemini CLI only for repo summarization or large-doc ingestion
|
||||
- Give each specialist a file bundle, not full repo
|
||||
- Persist role memories so you don't re-explain conventions every run
|
||||
- Reserve expensive models for synthesis or tricky patches
|
||||
|
||||
**Quality Maintenance**:
|
||||
- One canonical `DECISIONS.md`
|
||||
- One merge gate: lint, tests, typecheck, formatting
|
||||
- Architect agent reviews cross-cutting consistency
|
||||
- Test agent must validate every nontrivial change
|
||||
- Human signs off on schema, auth, and UX changes
|
||||
|
||||
**Opinionated Rule**: Never let all agents write directly to `main`. They work in isolated branches or patch outputs, then one orchestrator composes.
|
||||
|
||||
---
|
||||
|
||||
### Harness 2: The Unified Context Stack
|
||||
|
||||
**Concept**: Use one primary agent with one shared project memory so the system stays coherent. Optimize for understanding, not parallelism.
|
||||
|
||||
**Tools & Roles**:
|
||||
- **OpenClaw**: primary shell, memory manager, execution layer
|
||||
- **Codex**: main coding engine (background tasks, sandboxes)
|
||||
- **Gemini CLI**: massive-context reader when needed
|
||||
- **Claude Code**: only when you need subagents or deep repo analysis
|
||||
- **OpenCode**: only for specific language/tool gaps
|
||||
|
||||
**Memory Strategy**:
|
||||
Single source of truth:
|
||||
- `AGENTS.md` or `CLAUDE.md` — project constitution, coding standards, build/test commands
|
||||
- `.github/instructions/*.instructions.md` — path-scoped rules (backend vs frontend)
|
||||
- `notes/current-task.md` — objective, changed files, failing tests, next step
|
||||
- `notes/decisions.md` — non-obvious decisions and why
|
||||
- `session_handoff.md` — last decisions, blockers, next move
|
||||
|
||||
Meta's advice: concise navigation beats giant docs. They recommend **25-35 line context files**. Anthropic says keep `CLAUDE.md` under **200 lines** — large memory files hurt adherence.
|
||||
|
||||
**Workflow Example**:
|
||||
1. Read `AGENTS.md`, `plan.md`, `todo.md`, `session_handoff.md`
|
||||
2. Inspect changed files / git diff
|
||||
3. Ask agent: "Identify files, invariants, commands, failure modes. Do not edit."
|
||||
4. Save result in short task brief
|
||||
5. Then implement
|
||||
6. Run tests/lint/typecheck
|
||||
7. Update `session_handoff.md` with decisions and next steps
|
||||
|
||||
**Strengths**:
|
||||
- Deep coherence — one model maintains full context
|
||||
- No orchestration overhead
|
||||
- Simple mental model
|
||||
- Best for complex architectural work requiring continuity
|
||||
|
||||
**Weaknesses**:
|
||||
- Single point of failure
|
||||
- No parallelization
|
||||
- Context window limits on large codebases
|
||||
- One tool's blind spots become your blind spots
|
||||
|
||||
**Token Efficiency**:
|
||||
- Use `/clear` between unrelated tasks
|
||||
- Use `/compact` with focus instructions
|
||||
- Prefer Sonnet for most work, Haiku for simple subagents
|
||||
- Prefer CLI tools over MCP when possible (MCP tool listings add context overhead)
|
||||
- Use hooks/skills to preprocess huge outputs before model sees them
|
||||
|
||||
**Quality Maintenance**:
|
||||
- Force tests, linters, typecheck, diff review into loop
|
||||
- Separate implementation from validation
|
||||
- Require task-specific acceptance criteria
|
||||
- Use hooks to reject bad writes or auto-run checks
|
||||
- Post-edit hook: run formatter/linter/tests on touched files
|
||||
|
||||
---
|
||||
|
||||
### Harness 3: The Human-in-the-Loop Pipeline
|
||||
|
||||
**Concept**: AI generates, human validates at key gates. Structured workflow: spec → AI draft → human review → AI refine → human approve. Emphasis on quality over speed.
|
||||
|
||||
**Tools & Roles**:
|
||||
- **Claude Code**: planning, architecture, spec generation
|
||||
- **Codex**: implementation in isolated branch/worktree
|
||||
- **OpenCode/Gemini**: secondary review or specific gap filling
|
||||
- **OpenClaw**: orchestration, task state management, gate enforcement
|
||||
|
||||
**Workflow**:
|
||||
1. **Spec Gate** (human writes/approves):
|
||||
- Clear acceptance criteria
|
||||
- Test strategy
|
||||
- Files expected to change
|
||||
- Architecture invariants not to break
|
||||
|
||||
2. **AI Draft** (Codex or Claude Code):
|
||||
- Implement to spec
|
||||
- Run tests
|
||||
- Produce structured self-report: problem, root cause, files touched, tests added, risks
|
||||
|
||||
3. **Review Gate** (human + AI review agent):
|
||||
- Human: does this match the spec?
|
||||
- AI reviewer: "critique this diff for maintainability, hidden coupling, missing tests, unsafe assumptions"
|
||||
- Must pass both
|
||||
|
||||
4. **Refine Loop** (AI fixes, human re-reviews):
|
||||
- Max 2-3 iterations
|
||||
- If still not passing, escalate to human rewrite
|
||||
|
||||
5. **Merge Gate** (human only):
|
||||
- Final approval
|
||||
- Especially for auth, schema, UX changes
|
||||
|
||||
**Memory Strategy**:
|
||||
- Spec lives in `specs/YYYY-MM-DD_feature.md`
|
||||
- Review feedback lives in `reviews/`
|
||||
- Decision log in `DECISIONS.md`
|
||||
- Each iteration updates `session_handoff.md`
|
||||
|
||||
**Strengths**:
|
||||
- Highest quality output
|
||||
- Human maintains architectural ownership
|
||||
- Catches AI misunderstandings early
|
||||
- Builds trust through verification
|
||||
|
||||
**Weaknesses**:
|
||||
- Slowest of the four
|
||||
- Requires human availability at gates
|
||||
- Can feel like "AI-assisted bureaucracy" if gates are too heavy
|
||||
- Risk: human becomes bottleneck
|
||||
|
||||
**Token Efficiency**:
|
||||
- Invest tokens in spec clarity upfront (saves rewrite tokens later)
|
||||
- Use cheap model for first draft, expensive model for review
|
||||
- Cancel refinement loops early if direction is wrong
|
||||
|
||||
**Quality Maintenance**:
|
||||
- PR template section: "what was verified, with which command, on which inputs"
|
||||
- If no test exists, agent must add one or explain why not
|
||||
- Stripe found agents often "passed" tasks by doing invalid verification. Better runs wrote scripts to generate realistic test data.
|
||||
|
||||
---
|
||||
|
||||
### Harness 4: The Minimalist Vibe Coder
|
||||
|
||||
**Concept**: Fewest tools possible, maximum leverage. One primary tool, others only for specific gaps. Emphasis on developer judgment and taste.
|
||||
|
||||
**Tools**:
|
||||
- **One primary**: Claude Code or Codex (pick one, know it deeply)
|
||||
- **One backup**: OpenRouter for cheap second opinions or different model access
|
||||
- **OpenClaw**: only for orchestration when you need background tasks
|
||||
|
||||
**Workflow**:
|
||||
1. Start with human-written plan (5-10 lines)
|
||||
2. AI implements in small chunks (one file or one function at a time)
|
||||
3. Human reviews immediately (don't batch)
|
||||
4. Run tests after every meaningful change
|
||||
5. Commit frequently (micro-commits)
|
||||
6. If AI goes off track, reset (`/clear`) and restate the plan
|
||||
|
||||
**Memory Strategy**:
|
||||
- One `AGENTS.md` in repo root
|
||||
- One `TODO.md` for current task
|
||||
- Git history is your memory (small commits, clear messages)
|
||||
- No complex orchestration, no handoff files
|
||||
|
||||
**Strengths**:
|
||||
- Lowest overhead
|
||||
- Fastest for small-to-medium tasks
|
||||
- Human maintains full context
|
||||
- No tool-bloat confusion
|
||||
- Best for experienced developers with strong taste
|
||||
|
||||
**Weaknesses**:
|
||||
- No parallelization
|
||||
- No background work
|
||||
- Harder for large, multi-file features
|
||||
- Requires strong human judgment to know when AI is wrong
|
||||
|
||||
**Token Efficiency**:
|
||||
- Most efficient — no orchestration tokens, no context replication
|
||||
- Only pay for actual coding
|
||||
- Reset context aggressively between tasks
|
||||
|
||||
**Quality Maintenance**:
|
||||
- Human is the quality filter
|
||||
- Frequent commits = easy rollback
|
||||
- Small changes = easy review
|
||||
- Strong test discipline
|
||||
|
||||
---
|
||||
|
||||
## Comparative Analysis
|
||||
|
||||
| Dimension | Agent Army | Unified Context | Human-in-Loop | Minimalist |
|
||||
|-----------|-----------|-----------------|---------------|------------|
|
||||
| **Speed** | Fastest for large work | Medium | Slowest | Fastest for small work |
|
||||
| **Quality** | Medium (needs orchestrator) | High | Highest | High (human-dependent) |
|
||||
| **Complexity** | High | Medium | Medium | Low |
|
||||
| **Token Cost** | Highest | Medium | Medium | Lowest |
|
||||
| **Setup Time** | High | Medium | Medium | None |
|
||||
| **Best For** | Large features, migrations | Complex architecture | Safety-critical code | Daily dev, quick wins |
|
||||
| **Failure Mode** | Orchestration chaos | Context window limits | Human bottleneck | Human gets lazy |
|
||||
|
||||
---
|
||||
|
||||
## What Actually Makes You a 10x Engineer
|
||||
|
||||
The research was clear: output volume ≠ productivity. Here's what actually matters:
|
||||
|
||||
### 1. Mastering Memory (Not Hoarding Context)
|
||||
|
||||
The best pattern is **layered memory**, not chat-history hoarding:
|
||||
|
||||
- **Stable memory**: coding standards, commands, architecture invariants (in `AGENTS.md`)
|
||||
- **Scoped memory**: subtree/domain rules (in `.github/instructions/*.md`)
|
||||
- **Volatile memory**: current task summary, open decisions, blocker, next step (in `notes/current-task.md`)
|
||||
- **Retrieval over replay**: load a 20-line task brief, not 50k tokens of old chat
|
||||
|
||||
**Meta's finding**: Precompute context with specialized agents first, then let execution agents work from that map. They used 50+ specialized agents to build concise context artifacts and got ~40% fewer tool calls per task.
|
||||
|
||||
### 2. Verification Hooks (Not Trust)
|
||||
|
||||
AI writes, systems verify. Proven guardrails:
|
||||
- Force tests, linters, typecheck, diff review into the loop
|
||||
- Separate implementation from validation
|
||||
- Post-edit hook: run formatter/linter/tests on touched files
|
||||
- PR template: "what was verified, with which command, on which inputs"
|
||||
- If no test exists, agent must add one or explain why not
|
||||
|
||||
**Stripe's finding**: Agents often "passed" tasks by doing invalid verification. Better runs wrote scripts to generate realistic test data.
|
||||
|
||||
### 3. Map First, Then Code
|
||||
|
||||
Before editing, ask: "Identify files, invariants, commands, failure modes. Do not edit." Save result in short task brief. Then implement. This prevents the most expensive failure mode: rewriting the same code 3 times because the agent misunderstood the architecture.
|
||||
|
||||
### 4. Isolate Noisy Work
|
||||
|
||||
Spawn subagents for tests, logs, docs search. Return only failing cases, stack traces, or summary. Don't flood the main thread with raw output.
|
||||
|
||||
### 5. Review with a Different Agent
|
||||
|
||||
"Critique this diff for maintainability, hidden coupling, missing tests, unsafe assumptions." The reviewer/author separation catches hallucinated fixes and hidden damage.
|
||||
|
||||
### 6. Compact Session Handoffs
|
||||
|
||||
Store only: objective, touched files, commands run, current failure, next move. New session starts from that, not old chat. Use `/clear` between unrelated tasks.
|
||||
|
||||
---
|
||||
|
||||
## Failure Modes to Avoid
|
||||
|
||||
### The Illusion of Productivity
|
||||
- **METR study**: AI made experienced developers 19% slower on real tasks
|
||||
- **Uplevel study**: 41% more bugs in Copilot group, little productivity gain
|
||||
- **GitClear 2025**: copy/paste exceeded moved/refactored code for first time
|
||||
- **Stack Overflow 2026**: 84% using AI, only 29% trust it
|
||||
|
||||
### Context Fragmentation
|
||||
Same task split across Cursor, Copilot, Claude, ChatGPT — each with different partial context. Result: tool disagreement, prompt drift, lost provenance, duplicate review burden.
|
||||
|
||||
### Vibe Coding Debt
|
||||
Code that demos well but isn't built like a system:
|
||||
- Auth logic that "works" but is unsafe
|
||||
- Broad catch blocks and silent failure
|
||||
- Bloated schemas with no domain fit
|
||||
- Inconsistent architectural style across files
|
||||
- Generated comments masking weak reasoning
|
||||
|
||||
### Token Waste Patterns
|
||||
- Re-feeding same repo context into several tools
|
||||
- Asking for large rewrites before pinning requirements
|
||||
- Generating full files when diff or function-level patch would do
|
||||
- Using chat models for validation instead of running tests
|
||||
- Bouncing between tools for "second opinions"
|
||||
|
||||
---
|
||||
|
||||
## Recommended File Layout
|
||||
|
||||
```
|
||||
repo/
|
||||
├── AGENTS.md # project constitution (keep < 200 lines)
|
||||
├── .github/
|
||||
│ ├── copilot-instructions.md
|
||||
│ └── instructions/
|
||||
│ ├── backend.instructions.md
|
||||
│ └── frontend.instructions.md
|
||||
├── notes/
|
||||
│ ├── current-task.md # objective, files, failure, next step
|
||||
│ ├── decisions.md # non-obvious choices and why
|
||||
│ └── session_handoff.md # last decisions, blockers
|
||||
├── specs/ # human-approved specs
|
||||
├── reviews/ # AI review outputs
|
||||
└── scripts/
|
||||
├── verify.sh # test + lint + typecheck
|
||||
└── pre-commit.sh # auto-run checks
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## My Recommendation for Claudio
|
||||
|
||||
Given your preference for direct, high-leverage work and your experience level, I'd suggest a **hybrid approach**:
|
||||
|
||||
**Default**: Minimalist Vibe Coder (Harness 4)
|
||||
- One primary tool (Claude Code)
|
||||
- Small plans, immediate review, frequent commits
|
||||
- Lowest overhead, highest human agency
|
||||
|
||||
**When a feature is large or cross-cutting**: Switch to Unified Context Stack (Harness 2)
|
||||
- One agent, deep context
|
||||
- Map first, then code
|
||||
- Externalize memory in `AGENTS.md` and `notes/`
|
||||
|
||||
**When you need overnight or parallel work**: Spawn background agents via OpenClaw
|
||||
- But treat them as draft generators, not autonomous engineers
|
||||
- Morning review is mandatory
|
||||
- Never auto-merge
|
||||
|
||||
**Avoid**: The full Coding Agent Army unless you genuinely have orchestration time to invest. The overhead often exceeds the value for solo developers.
|
||||
|
||||
**Never skip**: Verification hooks. AI writes, tests verify, human approves. That's the real 10x pattern.
|
||||
|
||||
---
|
||||
|
||||
## Sources
|
||||
|
||||
- Anthropic Claude Code docs: https://code.claude.com/docs/en/memory, https://code.claude.com/docs/en/sub-agents, https://code.claude.com/docs/en/costs, https://code.claude.com/docs/en/hooks
|
||||
- GitHub Copilot custom instructions: https://docs.github.com/en/copilot/how-tos/copilot-on-github/customize-copilot/add-custom-instructions/add-repository-instructions
|
||||
- OpenAI Codex: https://developers.openai.com/codex/cloud
|
||||
- Meta Engineering: https://engineering.fb.com/2026/04/06/developer-tools/how-meta-used-ai-to-map-tribal-knowledge-in-large-scale-data-pipelines/
|
||||
- Stripe blog: https://stripe.com/blog/can-ai-agents-build-real-stripe-integrations
|
||||
- METR study (July 2025): https://metr.org/blog/2025-07-10-early-2025-ai-experienced-os-dev-study/
|
||||
- Uplevel: https://uplevelteam.com/blog/ai-for-developer-productivity
|
||||
- GitClear 2025: https://www.gitclear.com/ai_assistant_code_quality_2025_research
|
||||
- Stack Overflow (Feb 2026): https://stackoverflow.blog/2026/02/18/closing-the-developer-ai-trust-gap/
|
||||
- Reddit anecdote: https://www.reddit.com/r/ExperiencedDevs/comments/1sskw4r/getting_more_calls_to_fix_ai_generated_codebases/
|
||||
Reference in New Issue
Block a user