Files

Orik a5fa47a328 Enhance AI coding harnesses with memory mastery details and cross-tool file standards

2026-04-22 20:22:49 +00:00

22 KiB

Raw Blame History

AI-Assisted Coding Productivity Harnesses

Research date: 2026-04-22 Research method: Four parallel subagent investigations + cross-review synthesis Sources: Anthropic, OpenAI, GitHub, Meta Engineering, Stripe, METR, Uplevel, GitClear, Stack Overflow

Executive Summary

This note compares four distinct productivity harnesses for solo developers using multiple AI coding tools (Codex, Claude Code, OpenCode, Gemini CLI, OpenRouter, OpenClaw). It's based on research into what actually works, what fails, and how experienced developers avoid the traps of "vibe coding."

Key finding: The strongest developers are not "vibe coding harder." They're building a small operating system around the models: instruction files, scoped memory, background workers, verification hooks, separate review, and compact session handoffs.

Critical warning: METR's July 2025 study of 16 experienced open-source developers found AI use made them 19% slower on real tasks, even though they expected 24% speedup and still felt 20% faster afterward. The illusion of productivity is real.

The Four Harnesses

Harness 1: The Coding Agent Army

Concept: Run a small team of specialized AI workers in parallel. You are the dispatcher and reviewer, not the typist.

Tools & Roles:

OpenClaw: orchestrator, routing, memory glue, background jobs
Codex: backend and refactor agent (parallel background execution)
Claude Code: frontend/UI and architecture-heavy agent
OpenCode: test-fix and repo-wide cleanup agent
Gemini CLI: long-context reader, docs digestion, codebase summarizer
OpenRouter: model switchboard for cheap classification/planning/second opinions

Agent Lineup:

Frontend agent — UI components, styling, interaction bugs
Backend agent — API routes, DB logic, services
Test agent — unit/integration/e2e tests, repros, CI fixes
Docs agent — changelogs, migration notes, onboarding docs
Architect agent (optional) — no direct edits, only plans and reviews

Memory Strategy: Each agent gets its own brief system prompt, task scratchpad, known-files list, and small memory file (agents/frontend.md, agents/backend.md, etc.). Shared inputs: PROJECT.md for architecture, TASK.md for the ticket, DECISIONS.md for accepted choices. Rule: no agent gets the whole repo context by default.

Workflow Example — "Add teams feature with invite flow":

OpenClaw parses ticket, splits into UI/API/tests/docs
Architect agent drafts task graph (API first, UI second, tests alongside, docs last)
Backend agent adds POST /teams, invite token model, permission checks
Frontend agent builds create-team modal and invite screen
Test agent writes API tests and invite flow Playwright test
Docs agent updates README, env vars, admin notes
Orchestrator rebases outputs, resolves overlaps, runs lint/test
Human reviews final diff, not every micro-step

Strengths:

Fastest for medium-to-large scoped work
Great when tasks decompose cleanly
Lets different models do what they're best at
Reduces "one agent forgot half the ticket" failure

Weaknesses:

Agents step on the same files
Inconsistent naming or architecture across agents
Duplicated logic across layers
Orchestration overhead kills speed on small tasks
Weak dispatcher means chaos

Token Efficiency:

Use OpenRouter cheap models for triage/routing
Use Gemini CLI only for repo summarization or large-doc ingestion
Give each specialist a file bundle, not full repo
Persist role memories so you don't re-explain conventions every run
Reserve expensive models for synthesis or tricky patches

Quality Maintenance:

One canonical DECISIONS.md
One merge gate: lint, tests, typecheck, formatting
Architect agent reviews cross-cutting consistency
Test agent must validate every nontrivial change
Human signs off on schema, auth, and UX changes

Opinionated Rule: Never let all agents write directly to main. They work in isolated branches or patch outputs, then one orchestrator composes.

Harness 2: The Unified Context Stack

Concept: Use one primary agent with one shared project memory so the system stays coherent. Optimize for understanding, not parallelism.

Tools & Roles:

OpenClaw: primary shell, memory manager, execution layer
Codex: main coding engine (background tasks, sandboxes)
Gemini CLI: massive-context reader when needed
Claude Code: only when you need subagents or deep repo analysis
OpenCode: only for specific language/tool gaps

Memory Strategy: Single source of truth:

AGENTS.md or CLAUDE.md — project constitution, coding standards, build/test commands
.github/instructions/*.instructions.md — path-scoped rules (backend vs frontend)
notes/current-task.md — objective, changed files, failing tests, next step
notes/decisions.md — non-obvious decisions and why
session_handoff.md — last decisions, blockers, next move

Meta's advice: concise navigation beats giant docs. They recommend 25-35 line context files. Anthropic says keep CLAUDE.md under 200 lines — large memory files hurt adherence.

Workflow Example:

Read AGENTS.md, plan.md, todo.md, session_handoff.md
Inspect changed files / git diff
Ask agent: "Identify files, invariants, commands, failure modes. Do not edit."
Save result in short task brief
Then implement
Run tests/lint/typecheck
Update session_handoff.md with decisions and next steps

Strengths:

Deep coherence — one model maintains full context
No orchestration overhead
Simple mental model
Best for complex architectural work requiring continuity

Weaknesses:

Single point of failure
No parallelization
Context window limits on large codebases
One tool's blind spots become your blind spots

Token Efficiency:

Use /clear between unrelated tasks
Use /compact with focus instructions
Prefer Sonnet for most work, Haiku for simple subagents
Prefer CLI tools over MCP when possible (MCP tool listings add context overhead)
Use hooks/skills to preprocess huge outputs before model sees them

Quality Maintenance:

Force tests, linters, typecheck, diff review into loop
Separate implementation from validation
Require task-specific acceptance criteria
Use hooks to reject bad writes or auto-run checks
Post-edit hook: run formatter/linter/tests on touched files

Harness 3: The Human-in-the-Loop Pipeline

Concept: AI generates, human validates at key gates. Structured workflow: spec → AI draft → human review → AI refine → human approve. Emphasis on quality over speed.

Tools & Roles:

Claude Code: planning, architecture, spec generation
Codex: implementation in isolated branch/worktree
OpenCode/Gemini: secondary review or specific gap filling
OpenClaw: orchestration, task state management, gate enforcement

Workflow:

Spec Gate (human writes/approves):
- Clear acceptance criteria
- Test strategy
- Files expected to change
- Architecture invariants not to break
AI Draft (Codex or Claude Code):
- Implement to spec
- Run tests
- Produce structured self-report: problem, root cause, files touched, tests added, risks
Review Gate (human + AI review agent):
- Human: does this match the spec?
- AI reviewer: "critique this diff for maintainability, hidden coupling, missing tests, unsafe assumptions"
- Must pass both
Refine Loop (AI fixes, human re-reviews):
- Max 2-3 iterations
- If still not passing, escalate to human rewrite
Merge Gate (human only):
- Final approval
- Especially for auth, schema, UX changes

Memory Strategy:

Spec lives in specs/YYYY-MM-DD_feature.md
Review feedback lives in reviews/
Decision log in DECISIONS.md
Each iteration updates session_handoff.md

Strengths:

Highest quality output
Human maintains architectural ownership
Catches AI misunderstandings early
Builds trust through verification

Weaknesses:

Slowest of the four
Requires human availability at gates
Can feel like "AI-assisted bureaucracy" if gates are too heavy
Risk: human becomes bottleneck

Token Efficiency:

Invest tokens in spec clarity upfront (saves rewrite tokens later)
Use cheap model for first draft, expensive model for review
Cancel refinement loops early if direction is wrong

Quality Maintenance:

PR template section: "what was verified, with which command, on which inputs"
If no test exists, agent must add one or explain why not
Stripe found agents often "passed" tasks by doing invalid verification. Better runs wrote scripts to generate realistic test data.

Harness 4: The Minimalist Vibe Coder

Concept: Fewest tools possible, maximum leverage. One primary tool, others only for specific gaps. Emphasis on developer judgment and taste.

Tools:

One primary: Claude Code or Codex (pick one, know it deeply)
One backup: OpenRouter for cheap second opinions or different model access
OpenClaw: only for orchestration when you need background tasks

Workflow:

Start with human-written plan (5-10 lines)
AI implements in small chunks (one file or one function at a time)
Human reviews immediately (don't batch)
Run tests after every meaningful change
Commit frequently (micro-commits)
If AI goes off track, reset (/clear) and restate the plan

Memory Strategy:

One AGENTS.md in repo root
One TODO.md for current task
Git history is your memory (small commits, clear messages)
No complex orchestration, no handoff files

Strengths:

Lowest overhead
Fastest for small-to-medium tasks
Human maintains full context
No tool-bloat confusion
Best for experienced developers with strong taste

Weaknesses:

No parallelization
No background work
Harder for large, multi-file features
Requires strong human judgment to know when AI is wrong

Token Efficiency:

Most efficient — no orchestration tokens, no context replication
Only pay for actual coding
Reset context aggressively between tasks

Quality Maintenance:

Human is the quality filter
Frequent commits = easy rollback
Small changes = easy review
Strong test discipline

Comparative Analysis

Dimension	Agent Army	Unified Context	Human-in-Loop	Minimalist
Speed	Fastest for large work	Medium	Slowest	Fastest for small work
Quality	Medium (needs orchestrator)	High	Highest	High (human-dependent)
Complexity	High	Medium	Medium	Low
Token Cost	Highest	Medium	Medium	Lowest
Setup Time	High	Medium	Medium	None
Best For	Large features, migrations	Complex architecture	Safety-critical code	Daily dev, quick wins
Failure Mode	Orchestration chaos	Context window limits	Human bottleneck	Human gets lazy

What Actually Makes You a 10x Engineer

The research was clear: output volume ≠ productivity. Here's what actually matters:

1. Mastering Memory (Layered External Memory)

"Mastering memory" is not about magical long-term recall. It's about building a layered external memory system the agent can reload cheaply and consistently.

The four layers:

Layer	Purpose	Example Files
Global	Personal preferences, cross-project conventions	`~/.config/AGENT.md`
Repo-wide	Project constitution, build/test commands	`AGENTS.md`, `CLAUDE.md`
Subsystem	Domain/path-specific rules	`packages/api/AGENTS.md`, `.cursor/rules/backend.mdc`
Session	Active task state, handoffs	`notes/current-task.md`, `session_handoff.md`

Key principles:

Keep each layer small and specific (root files < 200 lines)
Use imports (@AGENTS.md) to avoid duplication
Move deterministic behavior into hooks/scripts, not prompts
Prefer retrieval/search over full-repo preload for large codebases

Meta's finding: Precompute context with specialized agents first, then let execution agents work from that map. They used 50+ specialized agents to build concise context artifacts and got ~40% fewer tool calls per task.

2. Warm-Starting Sessions

Strong pattern for starting a new session:

git status
git diff --stat
cat AGENTS.md CLAUDE.md
cat notes/current-task.md notes/session_handoff.md
rg -n "TODO|FIXME|HACK" .

Then ask: "Based on these files, what are we doing and what's the next step?" This gives the agent exactly the context it needs, nothing more.

3. Token Budget Management

Proven patterns:

Reset aggressively: Use /clear between unrelated tasks
Compact with focus: Use /compact with specific instructions ("keep only the auth-related context")
Scope instruction files: Keep root files short, move detailed rules into path-scoped files
Prefer CLI over MCP: MCP tool listings add context overhead; use CLI tools when possible
Model selection: Sonnet for most work, Haiku for simple subagents, Opus only for synthesis
Avoid re-feeding: Don't paste the same repo context into multiple tools

Anthropic's explicit recommendation: target under 200 lines per CLAUDE.md. Large files hurt adherence.

4. Verification Hooks (Not Trust)

AI writes, systems verify. Proven guardrails:

Force tests, linters, typecheck, diff review into the loop
Separate implementation from validation
Post-edit hook: run formatter/linter/tests on touched files
PR template: "what was verified, with which command, on which inputs"
If no test exists, agent must add one or explain why not

Stripe's finding: Agents often "passed" tasks by doing invalid verification. Better runs wrote scripts to generate realistic test data.

3. Map First, Then Code

Before editing, ask: "Identify files, invariants, commands, failure modes. Do not edit." Save result in short task brief. Then implement. This prevents the most expensive failure mode: rewriting the same code 3 times because the agent misunderstood the architecture.

4. Isolate Noisy Work

Spawn subagents for tests, logs, docs search. Return only failing cases, stack traces, or summary. Don't flood the main thread with raw output.

5. Review with a Different Agent

"Critique this diff for maintainability, hidden coupling, missing tests, unsafe assumptions." The reviewer/author separation catches hallucinated fixes and hidden damage.

6. Compact Session Handoffs

Store only: objective, touched files, commands run, current failure, next move. New session starts from that, not old chat. Use /clear between unrelated tasks.

Failure Modes to Avoid

The Illusion of Productivity

METR study: AI made experienced developers 19% slower on real tasks
Uplevel study: 41% more bugs in Copilot group, little productivity gain
GitClear 2025: copy/paste exceeded moved/refactored code for first time
Stack Overflow 2026: 84% using AI, only 29% trust it

Context Fragmentation

Same task split across Cursor, Copilot, Claude, ChatGPT — each with different partial context. Result: tool disagreement, prompt drift, lost provenance, duplicate review burden.

Vibe Coding Debt

Code that demos well but isn't built like a system:

Auth logic that "works" but is unsafe
Broad catch blocks and silent failure
Bloated schemas with no domain fit
Inconsistent architectural style across files
Generated comments masking weak reasoning

Token Waste Patterns

Re-feeding same repo context into several tools
Asking for large rewrites before pinning requirements
Generating full files when diff or function-level patch would do
Using chat models for validation instead of running tests
Bouncing between tools for "second opinions"

Recommended File Layout

Cross-Tool Standard (AGENTS.md)

The emerging standard is hierarchical instruction files that work across tools:

repo/
├── AGENTS.md                    # universal/cross-tool project constitution
├── CLAUDE.md                    # imports AGENTS.md, adds Claude-specific notes
├── .cursor/
│   └── rules/
│       ├── backend.mdc           # path-scoped rules (auto-attached by glob)
│       └── frontend.mdc
├── .github/
│   ├── copilot-instructions.md  # repo-wide instructions
│   └── instructions/
│       ├── backend.instructions.md    # path-scoped (applyTo: "**/*.py")
│       └── typescript.instructions.md # path-scoped (applyTo: "**/*.ts")
├── packages/
│   ├── api/
│   │   └── AGENTS.md            # subsystem rules (nearest takes precedence)
│   └── web/
│       └── AGENTS.md
├── notes/
│   ├── current-task.md          # objective, files, failure, next step
│   ├── decisions.md             # non-obvious choices and why
│   └── session_handoff.md       # last decisions, blockers
├── specs/                       # human-approved specs
├── reviews/                     # AI review outputs
└── scripts/
    ├── verify.sh                # test + lint + typecheck
    └── pre-commit.sh            # auto-run checks

Why This Structure Works

Stable memory (repo-wide): AGENTS.md — coding standards, build commands, architecture invariants
Scoped memory (path-specific): .cursor/rules/*.mdc, .github/instructions/*.md — backend/frontend conventions
Volatile memory (session): notes/current-task.md, session_handoff.md — active task state
Retrieval over replay: New sessions load only the 20-line task brief, not 50k tokens of old chat

Tool-Specific File Formats

File	Tool	Scope	Notes
`AGENTS.md`	Cross-tool	Repo-wide	Universal standard, keep < 200 lines
`CLAUDE.md`	Claude Code	Repo-wide	Imports `AGENTS.md` via `@AGENTS.md`. Keep < 200 lines. Large files hurt adherence.
`.cursorrules`	Cursor	Repo-wide	Legacy, being replaced by `.cursor/rules/*.mdc`
`.cursor/rules/*.mdc`	Cursor	Path-scoped	Auto-attached by glob. Can nest in subdirectories.
`.github/copilot-instructions.md`	Copilot	Repo-wide	One per repo
`.github/instructions/*.instructions.md`	Copilot	Path-scoped	Uses frontmatter: `applyTo: "*/.ts"`
`AGENT.md`	Emerging	Any	Proposed standard with merge behavior (nearest wins)

Example AGENTS.md (Minimal, Effective)

# AGENTS.md

## Commands
- Install: `pnpm install`
- Test: `pnpm test`
- Lint: `pnpm lint`
- Typecheck: `pnpm typecheck`

## Guardrails
- Never edit `src/generated/**`
- Prefer `rg` over slower recursive search
- Run tests relevant to changed files before finishing

## Architecture
- API handlers: `packages/api/src/handlers`
- UI components: `packages/web/src/components`
- Shared schemas: `packages/shared/src/schema`

## Workflow
- For multi-file changes, update `session_handoff.md`
- Record architectural decisions in `notes/decisions/`

Example session_handoff.md

# Session Handoff

## Last completed
- Migrated auth middleware to token refresh flow

## Decisions
- Keep REST externally, internal services move to RPC
- Do not delete old middleware until admin routes migrated

## Next
- Update admin route guards
- Add integration tests for expired-token refresh

## Blockers
- CI sandbox rate limit on auth provider

Path-Scoped Rules Example

.github/instructions/typescript.instructions.md:

---
applyTo: "**/*.ts,**/*.tsx"
---
- Use zod for runtime validation
- Prefer functional components
- No implicit any

.cursor/rules/backend.mdc:

---
description: Backend API conventions
globs: "packages/api/**/*.py"
---
- Use FastAPI dependency injection
- All handlers must have type hints
- Return Pydantic models, not dicts

My Recommendation for Claudio

Given your preference for direct, high-leverage work and your experience level, I'd suggest a hybrid approach:

Default: Minimalist Vibe Coder (Harness 4)

One primary tool (Claude Code)
Small plans, immediate review, frequent commits
Lowest overhead, highest human agency

When a feature is large or cross-cutting: Switch to Unified Context Stack (Harness 2)

One agent, deep context
Map first, then code
Externalize memory in AGENTS.md and notes/

When you need overnight or parallel work: Spawn background agents via OpenClaw

But treat them as draft generators, not autonomous engineers
Morning review is mandatory
Never auto-merge

Avoid: The full Coding Agent Army unless you genuinely have orchestration time to invest. The overhead often exceeds the value for solo developers.

Never skip: Verification hooks. AI writes, tests verify, human approves. That's the real 10x pattern.

Sources

Anthropic Claude Code docs: https://code.claude.com/docs/en/memory, https://code.claude.com/docs/en/sub-agents, https://code.claude.com/docs/en/costs, https://code.claude.com/docs/en/hooks
GitHub Copilot custom instructions: https://docs.github.com/en/copilot/how-tos/copilot-on-github/customize-copilot/add-custom-instructions/add-repository-instructions
OpenAI Codex: https://developers.openai.com/codex/cloud
Meta Engineering: https://engineering.fb.com/2026/04/06/developer-tools/how-meta-used-ai-to-map-tribal-knowledge-in-large-scale-data-pipelines/
Stripe blog: https://stripe.com/blog/can-ai-agents-build-real-stripe-integrations
METR study (July 2025): https://metr.org/blog/2025-07-10-early-2025-ai-experienced-os-dev-study/
Uplevel: https://uplevelteam.com/blog/ai-for-developer-productivity
GitClear 2025: https://www.gitclear.com/ai_assistant_code_quality_2025_research
Stack Overflow (Feb 2026): https://stackoverflow.blog/2026/02/18/closing-the-developer-ai-trust-gap/
Reddit anecdote: https://www.reddit.com/r/ExperiencedDevs/comments/1sskw4r/getting_more_calls_to_fix_ai_generated_codebases/

22 KiB Raw Blame History

AI-Assisted Coding Productivity Harnesses

Executive Summary

The Four Harnesses

Harness 1: The Coding Agent Army

Harness 2: The Unified Context Stack

Harness 3: The Human-in-the-Loop Pipeline

Harness 4: The Minimalist Vibe Coder

Comparative Analysis

What Actually Makes You a 10x Engineer

1. Mastering Memory (Layered External Memory)

2. Warm-Starting Sessions

3. Token Budget Management

4. Verification Hooks (Not Trust)

3. Map First, Then Code

4. Isolate Noisy Work

5. Review with a Different Agent

6. Compact Session Handoffs

Failure Modes to Avoid

The Illusion of Productivity

Context Fragmentation

Vibe Coding Debt

Token Waste Patterns

Recommended File Layout

Cross-Tool Standard (AGENTS.md)

Why This Structure Works

Tool-Specific File Formats

Example AGENTS.md (Minimal, Effective)

Example session_handoff.md

Path-Scoped Rules Example

My Recommendation for Claudio

Sources

22 KiB

Raw Blame History