Add AI Coding Harness Patterns note
This commit is contained in:
633
AI Coding Harness Patterns - 2026-04-22.md
Normal file
633
AI Coding Harness Patterns - 2026-04-22.md
Normal file
@@ -0,0 +1,633 @@
|
||||
# AI Coding Harness Patterns - 2026-04-22
|
||||
|
||||
## Executive summary
|
||||
|
||||
Two answers.
|
||||
|
||||
### 1. Should you set up a separate OpenClaw software-engineer agent with its own GitHub account to work overnight?
|
||||
|
||||
Yes, but only in a narrow form.
|
||||
|
||||
It is **not** a good pattern as a general autonomous software engineer. It **is** a good pattern as a **bounded overnight batch worker** or **constrained janitor bot** that only takes on small, sharply scoped, objectively testable tasks and opens draft pull requests for review.
|
||||
|
||||
The core issue is not whether the model can write code. It can. The real question is whether the harness turns that ability into useful leverage instead of review debt, security risk, and noisy PR spam.
|
||||
|
||||
### 2. What agentic coding patterns actually work?
|
||||
|
||||
The winning pattern is not "pick one magical coding agent." It is:
|
||||
|
||||
- clear task envelopes
|
||||
- explicit plan before edit for non-trivial work
|
||||
- durable repo memory for stable conventions
|
||||
- strict verification and review gates
|
||||
- selective use of parallel agents
|
||||
- optional routing across models only after the basics are stable
|
||||
|
||||
If you want the short recommendation:
|
||||
|
||||
1. Start with a **spec-first single-agent loop**
|
||||
2. Add **durable repo memory and reusable skills**
|
||||
3. Add **selective manager + specialist subagents**
|
||||
4. Add **router-based multi-model optimization** only when the rest is already working
|
||||
5. Add an **overnight GitHub agent last**, after the loop is proven safe and boring
|
||||
|
||||
---
|
||||
|
||||
## Part I. Is the separate overnight GitHub engineer-agent a good pattern?
|
||||
|
||||
## Short answer
|
||||
|
||||
**Yes, under strict conditions. No, as a general autonomy pattern.**
|
||||
|
||||
The right mental model is:
|
||||
|
||||
- **bad model:** autonomous engineer
|
||||
- **good model:** constrained overnight batch worker
|
||||
|
||||
It is productive when it does low-risk maintenance and hands you small reviewable PRs in the morning.
|
||||
It is unproductive when it generates ambiguous, sprawling, or high-risk diffs that you have to mentally reconstruct from scratch.
|
||||
|
||||
---
|
||||
|
||||
## Where this pattern works
|
||||
|
||||
This works best for:
|
||||
|
||||
- narrow bug fixes
|
||||
- tests
|
||||
- docs
|
||||
- lint/type cleanup
|
||||
- repetitive mechanical refactors
|
||||
- small devex improvements
|
||||
- issues that a human could likely solve in under 30 to 90 minutes
|
||||
- tasks with clear acceptance criteria and deterministic checks
|
||||
|
||||
It works especially well when:
|
||||
|
||||
- the repo already has decent tests
|
||||
- CI is trustworthy
|
||||
- issue quality is high
|
||||
- coding conventions are documented
|
||||
- changes are local and reversible
|
||||
|
||||
---
|
||||
|
||||
## Where this pattern fails
|
||||
|
||||
This pattern becomes bad fast when the issue queue includes:
|
||||
|
||||
- architecture work
|
||||
- ambiguous product decisions
|
||||
- broad refactors
|
||||
- auth / billing / permissions / secrets
|
||||
- CI security or GitHub Actions logic
|
||||
- migrations
|
||||
- infra / deployment
|
||||
- concurrency or performance-critical code
|
||||
- weakly tested codepaths
|
||||
- highly tribal codebases
|
||||
|
||||
The hidden failure mode is not just wrong code.
|
||||
It is:
|
||||
|
||||
- plausible-looking diffs with hidden damage
|
||||
- too many mediocre PRs
|
||||
- review fatigue
|
||||
- permission mistakes
|
||||
- workflow security mistakes
|
||||
- erosion of trust in the agent’s output
|
||||
|
||||
---
|
||||
|
||||
## The strict conditions required
|
||||
|
||||
You should only do this if all of the following are true:
|
||||
|
||||
### Scope control
|
||||
- issues are pre-shaped into small, reviewable units
|
||||
- each issue has explicit acceptance criteria
|
||||
- task size is capped by files changed and/or line budget
|
||||
- forbidden directories and forbidden task classes are enforced
|
||||
|
||||
### Permission control
|
||||
- the agent has its own GitHub identity
|
||||
- it cannot push to protected branches
|
||||
- it cannot merge
|
||||
- it cannot deploy
|
||||
- it cannot rotate secrets
|
||||
- it cannot access more repos than necessary
|
||||
- ideally it uses a low-privilege bot or app identity rather than a powerful personal account
|
||||
|
||||
### Execution control
|
||||
- every run is time-boxed
|
||||
- every run is budget-capped
|
||||
- output is fully logged
|
||||
- work happens in isolated branches or disposable environments
|
||||
- the agent can open PRs, comments, and artifacts, but not perform irreversible actions
|
||||
|
||||
### Verification control
|
||||
- deterministic CI is required
|
||||
- lint, tests, and type checks are required where relevant
|
||||
- PR template requires summary, tests run, limitations, and unresolved questions
|
||||
- draft PRs are the default initially
|
||||
- human review is always required
|
||||
|
||||
### Governance control
|
||||
- there is a labeled queue of agent-safe issues
|
||||
- there is a reviewer responsible for merge decisions
|
||||
- there are stop conditions if quality drops
|
||||
|
||||
---
|
||||
|
||||
## Recommended rollout policy
|
||||
|
||||
If you try this, start tighter than you think.
|
||||
|
||||
### Suggested initial policy
|
||||
|
||||
- max **1 PR per night per repo**
|
||||
- **draft PRs only** at the start
|
||||
- labeled **agent-safe** issues only
|
||||
- forbidden zones: auth, billing, infra, secrets, permissions, CI security, migrations, concurrency
|
||||
- small diff budget, for example roughly under 150 changed lines and under 5 files unless explicitly approved
|
||||
- required human review before any merge
|
||||
- no auto-merge
|
||||
|
||||
### Suggested kill criteria
|
||||
|
||||
Pause or narrow the system if any of these happen too often:
|
||||
|
||||
- too many PRs closed unmerged
|
||||
- too many PRs need major rewrite
|
||||
- revert rate rises
|
||||
- reviewer time per merged PR is too high
|
||||
- repeated violations of scope boundaries
|
||||
- noisy PR generation outpaces your review bandwidth
|
||||
|
||||
A practical threshold: if roughly 20 to 30 percent of PRs are getting rejected, reverted, or heavily rewritten, the system is probably not paying for itself.
|
||||
|
||||
---
|
||||
|
||||
## My recommendation on the overnight agent
|
||||
|
||||
**Do it last, not first.**
|
||||
|
||||
I would not begin by creating a second OpenClaw agent and letting it roam GitHub overnight.
|
||||
That is backwards.
|
||||
|
||||
First prove that your coding loop can already do this reliably in a supervised way:
|
||||
|
||||
1. understand a bounded issue
|
||||
2. propose a plan
|
||||
3. change only the intended surface
|
||||
4. run checks
|
||||
5. explain what it changed
|
||||
6. fail closed when uncertain
|
||||
|
||||
Once that loop is boring and reliable, then move it into GitHub PR automation, and only later schedule it unattended.
|
||||
|
||||
So the final answer is:
|
||||
|
||||
- **good pattern eventually:** yes
|
||||
- **good first move now:** no
|
||||
|
||||
---
|
||||
|
||||
## Part II. The four harness approaches that actually work
|
||||
|
||||
Below are the four most credible operating models, ranked in the order I would recommend building them.
|
||||
|
||||
---
|
||||
|
||||
## Approach 1. Spec-first single-agent loop
|
||||
|
||||
### What it is
|
||||
|
||||
A single coding agent works in a disciplined sequence:
|
||||
|
||||
1. explore relevant code
|
||||
2. propose a plan
|
||||
3. name touched files and risks
|
||||
4. implement
|
||||
5. run checks
|
||||
6. self-review
|
||||
7. hand off for human review
|
||||
|
||||
### Why it works
|
||||
|
||||
This is the best default because it minimizes chaos.
|
||||
It keeps architectural control with you, keeps prompts compact, and produces diffs that are easier to review.
|
||||
|
||||
### Strengths
|
||||
|
||||
- best baseline maintainability
|
||||
- low orchestration overhead
|
||||
- easy to debug
|
||||
- strong review quality
|
||||
- good token efficiency compared to chaotic long chats
|
||||
|
||||
### Weaknesses
|
||||
|
||||
- not ideal for trivial fixes because planning overhead can dominate
|
||||
- limited throughput compared with parallel subagents
|
||||
- still bounded by one context window
|
||||
|
||||
### Memory and token strategy
|
||||
|
||||
- keep prompt structure stable: goal, constraints, success criteria
|
||||
- reference exact files, not whole-repo dumps
|
||||
- store stable repo rules outside the prompt in durable files
|
||||
- reset sessions when they become bloated
|
||||
|
||||
### When to use it
|
||||
|
||||
This should be your default for most serious coding work.
|
||||
|
||||
### Verdict
|
||||
|
||||
**Rank: #1**
|
||||
|
||||
This is the best first system to build because it teaches the right habits and exposes failures early.
|
||||
|
||||
---
|
||||
|
||||
## Approach 2. Durable-context harness with repo memory files and reusable skills
|
||||
|
||||
### What it is
|
||||
|
||||
You externalize stable instructions into repo-level memory and workflow artifacts.
|
||||
Examples:
|
||||
|
||||
- coding rules
|
||||
- architecture constraints
|
||||
- common commands
|
||||
- PR checklist
|
||||
- migration checklist
|
||||
- review checklist
|
||||
- task-specific reusable skills/playbooks
|
||||
|
||||
The point is to stop re-explaining the same things every session.
|
||||
|
||||
### Why it works
|
||||
|
||||
This is where compounding begins.
|
||||
Instead of stuffing rules into prompts, you build a persistent operating system for the agent.
|
||||
That improves consistency, reduces token waste, and preserves maintainability.
|
||||
|
||||
### Strengths
|
||||
|
||||
- best long-term token efficiency
|
||||
- best repeatability
|
||||
- reduces prompt drift
|
||||
- captures tribal knowledge in reusable form
|
||||
- improves onboarding of both humans and agents
|
||||
|
||||
### Weaknesses
|
||||
|
||||
- requires maintenance discipline
|
||||
- stale or bad memory files can encode bad behavior repeatedly
|
||||
- temptation to over-document everything
|
||||
|
||||
### Memory and token strategy
|
||||
|
||||
Only store stable, high-value information such as:
|
||||
|
||||
- architecture decisions
|
||||
- allowed commands
|
||||
- coding standards
|
||||
- test commands
|
||||
- repo boundaries
|
||||
- repeated pitfalls
|
||||
|
||||
Do **not** store transient debugging chatter or giant raw transcripts.
|
||||
|
||||
### When to use it
|
||||
|
||||
As soon as you have repeated workflows or repeated repo conventions.
|
||||
|
||||
### Verdict
|
||||
|
||||
**Rank: #2**
|
||||
|
||||
It is arguably the highest long-term ROI layer, but it works best after the single-agent loop is already disciplined.
|
||||
|
||||
---
|
||||
|
||||
## Approach 3. Manager + specialist subagents
|
||||
|
||||
### What it is
|
||||
|
||||
One orchestrator manages multiple bounded workers, for example:
|
||||
|
||||
- reconnaissance agent
|
||||
- implementation agent
|
||||
- test-writing agent
|
||||
- review or regression agent
|
||||
|
||||
Each worker gets a narrow brief and a small context packet.
|
||||
The manager integrates outputs and applies the final gate.
|
||||
|
||||
### Why it works
|
||||
|
||||
Parallelism helps when the work is truly separable.
|
||||
It is powerful for larger tasks, repo analysis, multi-option design, and review.
|
||||
|
||||
### Strengths
|
||||
|
||||
- highest throughput when decomposition is clean
|
||||
- better division of labor
|
||||
- natural place for review agents
|
||||
- good for comparative research and implementation planning
|
||||
|
||||
### Weaknesses
|
||||
|
||||
- coordination overhead
|
||||
- context drift between agents
|
||||
- duplicate exploration if briefs are sloppy
|
||||
- easy to create token waste
|
||||
- easier to lose architectural coherence
|
||||
|
||||
### Memory and token strategy
|
||||
|
||||
- use narrow briefs
|
||||
- pass summaries, not full transcripts
|
||||
- keep a shared contract: scope, files, acceptance criteria, stop conditions
|
||||
- avoid overlapping edit surfaces unless one manager owns the merge logic
|
||||
|
||||
### When to use it
|
||||
|
||||
Once you already have a strong baseline loop and clear task decomposition.
|
||||
|
||||
### Verdict
|
||||
|
||||
**Rank: #3**
|
||||
|
||||
Very powerful, but not a good first system. It amplifies both strengths and weaknesses.
|
||||
|
||||
---
|
||||
|
||||
## Approach 4. Router harness with cheap models for exploration and strong models for commit-worthy diffs
|
||||
|
||||
### What it is
|
||||
|
||||
You route different task types to different models.
|
||||
Typical split:
|
||||
|
||||
- cheap/fast model for search, repo mapping, summarization, logs, triage
|
||||
- strong model for implementation and review of commit-worthy diffs
|
||||
- optional alternate model for independent review
|
||||
|
||||
### Why it works
|
||||
|
||||
Not all work deserves premium tokens.
|
||||
Routing can massively improve cost efficiency if the handoffs are clean.
|
||||
|
||||
### Strengths
|
||||
|
||||
- best cost/performance potential
|
||||
- flexible across providers
|
||||
- useful when model strengths shift quickly
|
||||
- avoids wasting premium models on repo spelunking
|
||||
|
||||
### Weaknesses
|
||||
|
||||
- more harness complexity
|
||||
- risk of lossy handoffs between models
|
||||
- easy to optimize for price instead of code quality
|
||||
- harder to debug when failures come from routing logic rather than model behavior
|
||||
|
||||
### Memory and token strategy
|
||||
|
||||
- summarize exploration into compact handoff packets
|
||||
- use provider-specific caching/checkpointing where available
|
||||
- keep routing logic explicit and debuggable
|
||||
- route by task class, not vibe
|
||||
|
||||
### When to use it
|
||||
|
||||
After your basic workflow, memory discipline, and verification stack are already solid.
|
||||
|
||||
### Verdict
|
||||
|
||||
**Rank: #4**
|
||||
|
||||
Useful, but I would add it later. It solves a real problem, but only after you’ve earned the complexity.
|
||||
|
||||
---
|
||||
|
||||
## Ranked recommendation
|
||||
|
||||
### First: spec-first single-agent loop
|
||||
Because it is the simplest reliable foundation and the easiest to reason about.
|
||||
|
||||
### Second: durable repo memory and reusable skills
|
||||
Because this is where long-term consistency and token efficiency compound.
|
||||
|
||||
### Third: manager + specialist subagents
|
||||
Because selective parallelism becomes powerful once your baseline discipline is already strong.
|
||||
|
||||
### Fourth: router harness
|
||||
Because it adds complexity for cost/performance gains that matter more later than early.
|
||||
|
||||
---
|
||||
|
||||
## Recommended stack shape
|
||||
|
||||
If I were designing your system, I would think in layers.
|
||||
|
||||
### Layer 1. Operator and orchestration surface
|
||||
Use **OpenClaw** as the control plane.
|
||||
It is a strong place to launch, route, supervise, and review work.
|
||||
But it should not be the place where giant raw coding context accumulates forever.
|
||||
|
||||
### Layer 2. Primary disciplined coding loop
|
||||
Use a strong coding agent in a spec-first loop.
|
||||
The important thing is less the brand and more the discipline:
|
||||
|
||||
- plan first
|
||||
- narrow context
|
||||
- explicit constraints
|
||||
- exact success criteria
|
||||
- deterministic checks
|
||||
|
||||
### Layer 3. Durable memory
|
||||
Use repo files and reusable playbooks/skills to capture:
|
||||
|
||||
- architecture boundaries
|
||||
- coding norms
|
||||
- test commands
|
||||
- task checklists
|
||||
- forbidden moves
|
||||
|
||||
### Layer 4. Review and verification gate
|
||||
Always require:
|
||||
|
||||
- tests
|
||||
- lint/type checks where relevant
|
||||
- concise change explanation
|
||||
- unresolved-question section
|
||||
- human review of design quality, not just green CI
|
||||
|
||||
### Layer 5. Selective parallelism
|
||||
Add subagents only where decomposition is clean:
|
||||
|
||||
- reconnaissance
|
||||
- implementation
|
||||
- testing
|
||||
- review
|
||||
|
||||
### Layer 6. Cost/performance routing
|
||||
Add multi-model routing only when you are feeling real cost or latency pain.
|
||||
|
||||
### Layer 7. Overnight GitHub agent
|
||||
Add the scheduled GitHub worker last, after the above is already stable.
|
||||
|
||||
---
|
||||
|
||||
## What actually makes someone feel 10x here
|
||||
|
||||
Not raw code generation.
|
||||
|
||||
The real leverage comes from building a system that is better than your unaided default at:
|
||||
|
||||
- slicing work into reviewable units
|
||||
- preserving intent and constraints across sessions
|
||||
- keeping token usage under control
|
||||
- making verification automatic
|
||||
- catching regressions early
|
||||
- reducing context rebuild cost
|
||||
- preventing the same mistakes from recurring
|
||||
- turning ambiguous ideas into explicit specs quickly
|
||||
|
||||
The "10x" feeling is usually the result of:
|
||||
|
||||
- less re-explaining
|
||||
- less context loss
|
||||
- less thrash
|
||||
- less low-value coding
|
||||
- more parallel reconnaissance
|
||||
- better morning review packets
|
||||
|
||||
That is a harness design problem more than a model problem.
|
||||
|
||||
---
|
||||
|
||||
## Concrete implementation plan I would recommend
|
||||
|
||||
### Phase 1. Build the supervised baseline
|
||||
|
||||
Build a local or chat-triggered harness that can:
|
||||
|
||||
1. ingest a task
|
||||
2. generate a short plan
|
||||
3. identify touched files
|
||||
4. implement in a branch/worktree
|
||||
5. run checks
|
||||
6. summarize changes and risks
|
||||
|
||||
Do this with one agent first.
|
||||
|
||||
### Phase 2. Add repo memory
|
||||
|
||||
Create durable files for:
|
||||
|
||||
- coding conventions
|
||||
- architecture notes
|
||||
- commands
|
||||
- test procedures
|
||||
- review checklist
|
||||
- forbidden zones
|
||||
- issue templates for agent-safe tasks
|
||||
|
||||
### Phase 3. Add review discipline
|
||||
|
||||
Require each agent run to produce:
|
||||
|
||||
- change summary
|
||||
- tests run
|
||||
- unresolved questions
|
||||
- rollback note
|
||||
- self-critique
|
||||
|
||||
### Phase 4. Add selective subagents
|
||||
|
||||
Only after the first three are working, add bounded workers for:
|
||||
|
||||
- reconnaissance
|
||||
- implementation
|
||||
- test generation
|
||||
- review
|
||||
|
||||
### Phase 5. Add GitHub PR automation
|
||||
|
||||
Make it human-invoked first.
|
||||
Have it open draft PRs only.
|
||||
|
||||
### Phase 6. Add unattended overnight scheduling
|
||||
|
||||
Only after human-invoked PR automation becomes boring and reliable.
|
||||
Start with one PR/night.
|
||||
|
||||
---
|
||||
|
||||
## Final recommendation
|
||||
|
||||
### On the second OpenClaw software-engineer agent
|
||||
|
||||
**Yes, but later, and with hard limits.**
|
||||
|
||||
Good version:
|
||||
- low-privilege bot identity
|
||||
- agent-safe issue queue
|
||||
- small draft PRs
|
||||
- deterministic CI
|
||||
- strict forbidden zones
|
||||
- human merge gate
|
||||
- one PR/night to start
|
||||
|
||||
Bad version:
|
||||
- broad autonomy
|
||||
- vague issues
|
||||
- architecture work
|
||||
- sensitive repos or workflows
|
||||
- direct merge/deploy power
|
||||
- large overnight diff generation
|
||||
|
||||
### On the broader vibecoding harness
|
||||
|
||||
Build this in order:
|
||||
|
||||
1. **spec-first single-agent loop**
|
||||
2. **durable repo memory + reusable skills**
|
||||
3. **manager + specialist subagents**
|
||||
4. **router harness**
|
||||
5. **overnight GitHub agent**
|
||||
|
||||
That order gives you the best chance of gaining real leverage instead of building a very expensive PR noise machine.
|
||||
|
||||
---
|
||||
|
||||
## Sources
|
||||
|
||||
- OpenAI Codex best practices: https://developers.openai.com/codex/learn/best-practices
|
||||
- OpenAI Codex CLI: https://developers.openai.com/codex/cli
|
||||
- OpenAI Codex cloud: https://developers.openai.com/codex/cloud
|
||||
- OpenAI Codex workflows: https://developers.openai.com/codex/workflows
|
||||
- OpenAI on SWE-bench Verified contamination: https://openai.com/index/why-we-no-longer-evaluate-swe-bench-verified/
|
||||
- GitHub Copilot coding agent: https://docs.github.com/en/copilot/concepts/about-assigning-tasks-to-copilot
|
||||
- GitHub Copilot agent risks and mitigations: https://docs.github.com/copilot/concepts/agents/coding-agent/risks-and-mitigations
|
||||
- GitHub responsible use, code review: https://docs.github.com/en/copilot/responsible-use/code-review
|
||||
- GitHub required status checks: https://docs.github.com/articles/about-required-status-checks
|
||||
- GitHub CODEOWNERS: https://docs.github.com/articles/about-code-owners
|
||||
- GitHub Actions workflow events: https://docs.github.com/en/actions/reference/workflows-and-actions/events-that-trigger-workflows
|
||||
- GitHub Security Lab, preventing pwn requests: https://securitylab.github.com/resources/github-actions-preventing-pwn-requests/
|
||||
- GitHub merge queue: https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/incorporating-changes-from-a-pull-request/merging-a-pull-request-with-a-merge-queue
|
||||
- GitHub push protection: https://docs.github.com/en/code-security/secret-scanning/introduction/about-push-protection
|
||||
- GitHub artifact attestations: https://docs.github.com/actions/concepts/security/artifact-attestations
|
||||
- Anthropic Claude Code best practices: https://code.claude.com/docs/en/best-practices
|
||||
- Anthropic Claude Code overview: https://code.claude.com/docs/en/overview
|
||||
- OpenCode docs: https://opencode.ai/docs
|
||||
- OpenRouter quickstart: https://openrouter.ai/docs/quickstart
|
||||
- OpenClaw docs: https://docs.openclaw.ai
|
||||
- Devin best practices: https://docs.devin.ai/enterprise/best-Practices
|
||||
- Devin work with Devin: https://docs.devin.ai/work-with-devin
|
||||
Reference in New Issue
Block a user