Files
second-brain/AI Coding Harness Patterns - 2026-04-22.md

18 KiB
Raw Permalink Blame History

AI Coding Harness Patterns - 2026-04-22

Executive summary

Two answers.

1. Should you set up a separate OpenClaw software-engineer agent with its own GitHub account to work overnight?

Yes, but only in a narrow form.

It is not a good pattern as a general autonomous software engineer. It is a good pattern as a bounded overnight batch worker or constrained janitor bot that only takes on small, sharply scoped, objectively testable tasks and opens draft pull requests for review.

The core issue is not whether the model can write code. It can. The real question is whether the harness turns that ability into useful leverage instead of review debt, security risk, and noisy PR spam.

2. What agentic coding patterns actually work?

The winning pattern is not "pick one magical coding agent." It is:

  • clear task envelopes
  • explicit plan before edit for non-trivial work
  • durable repo memory for stable conventions
  • strict verification and review gates
  • selective use of parallel agents
  • optional routing across models only after the basics are stable

If you want the short recommendation:

  1. Start with a spec-first single-agent loop
  2. Add durable repo memory and reusable skills
  3. Add selective manager + specialist subagents
  4. Add router-based multi-model optimization only when the rest is already working
  5. Add an overnight GitHub agent last, after the loop is proven safe and boring

Part I. Is the separate overnight GitHub engineer-agent a good pattern?

Short answer

Yes, under strict conditions. No, as a general autonomy pattern.

The right mental model is:

  • bad model: autonomous engineer
  • good model: constrained overnight batch worker

It is productive when it does low-risk maintenance and hands you small reviewable PRs in the morning. It is unproductive when it generates ambiguous, sprawling, or high-risk diffs that you have to mentally reconstruct from scratch.


Where this pattern works

This works best for:

  • narrow bug fixes
  • tests
  • docs
  • lint/type cleanup
  • repetitive mechanical refactors
  • small devex improvements
  • issues that a human could likely solve in under 30 to 90 minutes
  • tasks with clear acceptance criteria and deterministic checks

It works especially well when:

  • the repo already has decent tests
  • CI is trustworthy
  • issue quality is high
  • coding conventions are documented
  • changes are local and reversible

Where this pattern fails

This pattern becomes bad fast when the issue queue includes:

  • architecture work
  • ambiguous product decisions
  • broad refactors
  • auth / billing / permissions / secrets
  • CI security or GitHub Actions logic
  • migrations
  • infra / deployment
  • concurrency or performance-critical code
  • weakly tested codepaths
  • highly tribal codebases

The hidden failure mode is not just wrong code. It is:

  • plausible-looking diffs with hidden damage
  • too many mediocre PRs
  • review fatigue
  • permission mistakes
  • workflow security mistakes
  • erosion of trust in the agents output

The strict conditions required

You should only do this if all of the following are true:

Scope control

  • issues are pre-shaped into small, reviewable units
  • each issue has explicit acceptance criteria
  • task size is capped by files changed and/or line budget
  • forbidden directories and forbidden task classes are enforced

Permission control

  • the agent has its own GitHub identity
  • it cannot push to protected branches
  • it cannot merge
  • it cannot deploy
  • it cannot rotate secrets
  • it cannot access more repos than necessary
  • ideally it uses a low-privilege bot or app identity rather than a powerful personal account

Execution control

  • every run is time-boxed
  • every run is budget-capped
  • output is fully logged
  • work happens in isolated branches or disposable environments
  • the agent can open PRs, comments, and artifacts, but not perform irreversible actions

Verification control

  • deterministic CI is required
  • lint, tests, and type checks are required where relevant
  • PR template requires summary, tests run, limitations, and unresolved questions
  • draft PRs are the default initially
  • human review is always required

Governance control

  • there is a labeled queue of agent-safe issues
  • there is a reviewer responsible for merge decisions
  • there are stop conditions if quality drops

If you try this, start tighter than you think.

Suggested initial policy

  • max 1 PR per night per repo
  • draft PRs only at the start
  • labeled agent-safe issues only
  • forbidden zones: auth, billing, infra, secrets, permissions, CI security, migrations, concurrency
  • small diff budget, for example roughly under 150 changed lines and under 5 files unless explicitly approved
  • required human review before any merge
  • no auto-merge

Suggested kill criteria

Pause or narrow the system if any of these happen too often:

  • too many PRs closed unmerged
  • too many PRs need major rewrite
  • revert rate rises
  • reviewer time per merged PR is too high
  • repeated violations of scope boundaries
  • noisy PR generation outpaces your review bandwidth

A practical threshold: if roughly 20 to 30 percent of PRs are getting rejected, reverted, or heavily rewritten, the system is probably not paying for itself.


My recommendation on the overnight agent

Do it last, not first.

I would not begin by creating a second OpenClaw agent and letting it roam GitHub overnight. That is backwards.

First prove that your coding loop can already do this reliably in a supervised way:

  1. understand a bounded issue
  2. propose a plan
  3. change only the intended surface
  4. run checks
  5. explain what it changed
  6. fail closed when uncertain

Once that loop is boring and reliable, then move it into GitHub PR automation, and only later schedule it unattended.

So the final answer is:

  • good pattern eventually: yes
  • good first move now: no

Part II. The four harness approaches that actually work

Below are the four most credible operating models, ranked in the order I would recommend building them.


Approach 1. Spec-first single-agent loop

What it is

A single coding agent works in a disciplined sequence:

  1. explore relevant code
  2. propose a plan
  3. name touched files and risks
  4. implement
  5. run checks
  6. self-review
  7. hand off for human review

Why it works

This is the best default because it minimizes chaos. It keeps architectural control with you, keeps prompts compact, and produces diffs that are easier to review.

Strengths

  • best baseline maintainability
  • low orchestration overhead
  • easy to debug
  • strong review quality
  • good token efficiency compared to chaotic long chats

Weaknesses

  • not ideal for trivial fixes because planning overhead can dominate
  • limited throughput compared with parallel subagents
  • still bounded by one context window

Memory and token strategy

  • keep prompt structure stable: goal, constraints, success criteria
  • reference exact files, not whole-repo dumps
  • store stable repo rules outside the prompt in durable files
  • reset sessions when they become bloated

When to use it

This should be your default for most serious coding work.

Verdict

Rank: #1

This is the best first system to build because it teaches the right habits and exposes failures early.


Approach 2. Durable-context harness with repo memory files and reusable skills

What it is

You externalize stable instructions into repo-level memory and workflow artifacts. Examples:

  • coding rules
  • architecture constraints
  • common commands
  • PR checklist
  • migration checklist
  • review checklist
  • task-specific reusable skills/playbooks

The point is to stop re-explaining the same things every session.

Why it works

This is where compounding begins. Instead of stuffing rules into prompts, you build a persistent operating system for the agent. That improves consistency, reduces token waste, and preserves maintainability.

Strengths

  • best long-term token efficiency
  • best repeatability
  • reduces prompt drift
  • captures tribal knowledge in reusable form
  • improves onboarding of both humans and agents

Weaknesses

  • requires maintenance discipline
  • stale or bad memory files can encode bad behavior repeatedly
  • temptation to over-document everything

Memory and token strategy

Only store stable, high-value information such as:

  • architecture decisions
  • allowed commands
  • coding standards
  • test commands
  • repo boundaries
  • repeated pitfalls

Do not store transient debugging chatter or giant raw transcripts.

When to use it

As soon as you have repeated workflows or repeated repo conventions.

Verdict

Rank: #2

It is arguably the highest long-term ROI layer, but it works best after the single-agent loop is already disciplined.


Approach 3. Manager + specialist subagents

What it is

One orchestrator manages multiple bounded workers, for example:

  • reconnaissance agent
  • implementation agent
  • test-writing agent
  • review or regression agent

Each worker gets a narrow brief and a small context packet. The manager integrates outputs and applies the final gate.

Why it works

Parallelism helps when the work is truly separable. It is powerful for larger tasks, repo analysis, multi-option design, and review.

Strengths

  • highest throughput when decomposition is clean
  • better division of labor
  • natural place for review agents
  • good for comparative research and implementation planning

Weaknesses

  • coordination overhead
  • context drift between agents
  • duplicate exploration if briefs are sloppy
  • easy to create token waste
  • easier to lose architectural coherence

Memory and token strategy

  • use narrow briefs
  • pass summaries, not full transcripts
  • keep a shared contract: scope, files, acceptance criteria, stop conditions
  • avoid overlapping edit surfaces unless one manager owns the merge logic

When to use it

Once you already have a strong baseline loop and clear task decomposition.

Verdict

Rank: #3

Very powerful, but not a good first system. It amplifies both strengths and weaknesses.


Approach 4. Router harness with cheap models for exploration and strong models for commit-worthy diffs

What it is

You route different task types to different models. Typical split:

  • cheap/fast model for search, repo mapping, summarization, logs, triage
  • strong model for implementation and review of commit-worthy diffs
  • optional alternate model for independent review

Why it works

Not all work deserves premium tokens. Routing can massively improve cost efficiency if the handoffs are clean.

Strengths

  • best cost/performance potential
  • flexible across providers
  • useful when model strengths shift quickly
  • avoids wasting premium models on repo spelunking

Weaknesses

  • more harness complexity
  • risk of lossy handoffs between models
  • easy to optimize for price instead of code quality
  • harder to debug when failures come from routing logic rather than model behavior

Memory and token strategy

  • summarize exploration into compact handoff packets
  • use provider-specific caching/checkpointing where available
  • keep routing logic explicit and debuggable
  • route by task class, not vibe

When to use it

After your basic workflow, memory discipline, and verification stack are already solid.

Verdict

Rank: #4

Useful, but I would add it later. It solves a real problem, but only after youve earned the complexity.


Ranked recommendation

First: spec-first single-agent loop

Because it is the simplest reliable foundation and the easiest to reason about.

Second: durable repo memory and reusable skills

Because this is where long-term consistency and token efficiency compound.

Third: manager + specialist subagents

Because selective parallelism becomes powerful once your baseline discipline is already strong.

Fourth: router harness

Because it adds complexity for cost/performance gains that matter more later than early.


If I were designing your system, I would think in layers.

Layer 1. Operator and orchestration surface

Use OpenClaw as the control plane. It is a strong place to launch, route, supervise, and review work. But it should not be the place where giant raw coding context accumulates forever.

Layer 2. Primary disciplined coding loop

Use a strong coding agent in a spec-first loop. The important thing is less the brand and more the discipline:

  • plan first
  • narrow context
  • explicit constraints
  • exact success criteria
  • deterministic checks

Layer 3. Durable memory

Use repo files and reusable playbooks/skills to capture:

  • architecture boundaries
  • coding norms
  • test commands
  • task checklists
  • forbidden moves

Layer 4. Review and verification gate

Always require:

  • tests
  • lint/type checks where relevant
  • concise change explanation
  • unresolved-question section
  • human review of design quality, not just green CI

Layer 5. Selective parallelism

Add subagents only where decomposition is clean:

  • reconnaissance
  • implementation
  • testing
  • review

Layer 6. Cost/performance routing

Add multi-model routing only when you are feeling real cost or latency pain.

Layer 7. Overnight GitHub agent

Add the scheduled GitHub worker last, after the above is already stable.


What actually makes someone feel 10x here

Not raw code generation.

The real leverage comes from building a system that is better than your unaided default at:

  • slicing work into reviewable units
  • preserving intent and constraints across sessions
  • keeping token usage under control
  • making verification automatic
  • catching regressions early
  • reducing context rebuild cost
  • preventing the same mistakes from recurring
  • turning ambiguous ideas into explicit specs quickly

The "10x" feeling is usually the result of:

  • less re-explaining
  • less context loss
  • less thrash
  • less low-value coding
  • more parallel reconnaissance
  • better morning review packets

That is a harness design problem more than a model problem.


Concrete implementation plan I would recommend

Phase 1. Build the supervised baseline

Build a local or chat-triggered harness that can:

  1. ingest a task
  2. generate a short plan
  3. identify touched files
  4. implement in a branch/worktree
  5. run checks
  6. summarize changes and risks

Do this with one agent first.

Phase 2. Add repo memory

Create durable files for:

  • coding conventions
  • architecture notes
  • commands
  • test procedures
  • review checklist
  • forbidden zones
  • issue templates for agent-safe tasks

Phase 3. Add review discipline

Require each agent run to produce:

  • change summary
  • tests run
  • unresolved questions
  • rollback note
  • self-critique

Phase 4. Add selective subagents

Only after the first three are working, add bounded workers for:

  • reconnaissance
  • implementation
  • test generation
  • review

Phase 5. Add GitHub PR automation

Make it human-invoked first. Have it open draft PRs only.

Phase 6. Add unattended overnight scheduling

Only after human-invoked PR automation becomes boring and reliable. Start with one PR/night.


Final recommendation

On the second OpenClaw software-engineer agent

Yes, but later, and with hard limits.

Good version:

  • low-privilege bot identity
  • agent-safe issue queue
  • small draft PRs
  • deterministic CI
  • strict forbidden zones
  • human merge gate
  • one PR/night to start

Bad version:

  • broad autonomy
  • vague issues
  • architecture work
  • sensitive repos or workflows
  • direct merge/deploy power
  • large overnight diff generation

On the broader vibecoding harness

Build this in order:

  1. spec-first single-agent loop
  2. durable repo memory + reusable skills
  3. manager + specialist subagents
  4. router harness
  5. overnight GitHub agent

That order gives you the best chance of gaining real leverage instead of building a very expensive PR noise machine.


Sources