orikbot/second-brain

Fork 0

Files

Orik ab2d85074f Add AI Coding Harness Patterns note

2026-04-22 22:51:20 +00:00

18 KiB

Raw Permalink Blame History

AI Coding Harness Patterns - 2026-04-22

Executive summary

Two answers.

1. Should you set up a separate OpenClaw software-engineer agent with its own GitHub account to work overnight?

Yes, but only in a narrow form.

It is not a good pattern as a general autonomous software engineer. It is a good pattern as a bounded overnight batch worker or constrained janitor bot that only takes on small, sharply scoped, objectively testable tasks and opens draft pull requests for review.

The core issue is not whether the model can write code. It can. The real question is whether the harness turns that ability into useful leverage instead of review debt, security risk, and noisy PR spam.

2. What agentic coding patterns actually work?

The winning pattern is not "pick one magical coding agent." It is:

clear task envelopes
explicit plan before edit for non-trivial work
durable repo memory for stable conventions
strict verification and review gates
selective use of parallel agents
optional routing across models only after the basics are stable

If you want the short recommendation:

Start with a spec-first single-agent loop
Add durable repo memory and reusable skills
Add selective manager + specialist subagents
Add router-based multi-model optimization only when the rest is already working
Add an overnight GitHub agent last, after the loop is proven safe and boring

Part I. Is the separate overnight GitHub engineer-agent a good pattern?

Short answer

Yes, under strict conditions. No, as a general autonomy pattern.

The right mental model is:

bad model: autonomous engineer
good model: constrained overnight batch worker

It is productive when it does low-risk maintenance and hands you small reviewable PRs in the morning. It is unproductive when it generates ambiguous, sprawling, or high-risk diffs that you have to mentally reconstruct from scratch.

Where this pattern works

This works best for:

narrow bug fixes
tests
docs
lint/type cleanup
repetitive mechanical refactors
small devex improvements
issues that a human could likely solve in under 30 to 90 minutes
tasks with clear acceptance criteria and deterministic checks

It works especially well when:

the repo already has decent tests
CI is trustworthy
issue quality is high
coding conventions are documented
changes are local and reversible

Where this pattern fails

This pattern becomes bad fast when the issue queue includes:

architecture work
ambiguous product decisions
broad refactors
auth / billing / permissions / secrets
CI security or GitHub Actions logic
migrations
infra / deployment
concurrency or performance-critical code
weakly tested codepaths
highly tribal codebases

The hidden failure mode is not just wrong code. It is:

plausible-looking diffs with hidden damage
too many mediocre PRs
review fatigue
permission mistakes
workflow security mistakes
erosion of trust in the agent’s output

The strict conditions required

You should only do this if all of the following are true:

Scope control

issues are pre-shaped into small, reviewable units
each issue has explicit acceptance criteria
task size is capped by files changed and/or line budget
forbidden directories and forbidden task classes are enforced

Permission control

the agent has its own GitHub identity
it cannot push to protected branches
it cannot merge
it cannot deploy
it cannot rotate secrets
it cannot access more repos than necessary
ideally it uses a low-privilege bot or app identity rather than a powerful personal account

Execution control

every run is time-boxed
every run is budget-capped
output is fully logged
work happens in isolated branches or disposable environments
the agent can open PRs, comments, and artifacts, but not perform irreversible actions

Verification control

deterministic CI is required
lint, tests, and type checks are required where relevant
PR template requires summary, tests run, limitations, and unresolved questions
draft PRs are the default initially
human review is always required

Governance control

there is a labeled queue of agent-safe issues
there is a reviewer responsible for merge decisions
there are stop conditions if quality drops

Recommended rollout policy

If you try this, start tighter than you think.

Suggested initial policy

max 1 PR per night per repo
draft PRs only at the start
labeled agent-safe issues only
forbidden zones: auth, billing, infra, secrets, permissions, CI security, migrations, concurrency
small diff budget, for example roughly under 150 changed lines and under 5 files unless explicitly approved
required human review before any merge
no auto-merge

Suggested kill criteria

Pause or narrow the system if any of these happen too often:

too many PRs closed unmerged
too many PRs need major rewrite
revert rate rises
reviewer time per merged PR is too high
repeated violations of scope boundaries
noisy PR generation outpaces your review bandwidth

A practical threshold: if roughly 20 to 30 percent of PRs are getting rejected, reverted, or heavily rewritten, the system is probably not paying for itself.

My recommendation on the overnight agent

Do it last, not first.

I would not begin by creating a second OpenClaw agent and letting it roam GitHub overnight. That is backwards.

First prove that your coding loop can already do this reliably in a supervised way:

understand a bounded issue
propose a plan
change only the intended surface
run checks
explain what it changed
fail closed when uncertain

Once that loop is boring and reliable, then move it into GitHub PR automation, and only later schedule it unattended.

So the final answer is:

good pattern eventually: yes
good first move now: no

Part II. The four harness approaches that actually work

Below are the four most credible operating models, ranked in the order I would recommend building them.

Approach 1. Spec-first single-agent loop

What it is

A single coding agent works in a disciplined sequence:

explore relevant code
propose a plan
name touched files and risks
implement
run checks
self-review
hand off for human review

Why it works

This is the best default because it minimizes chaos. It keeps architectural control with you, keeps prompts compact, and produces diffs that are easier to review.

Strengths

best baseline maintainability
low orchestration overhead
easy to debug
strong review quality
good token efficiency compared to chaotic long chats

Weaknesses

not ideal for trivial fixes because planning overhead can dominate
limited throughput compared with parallel subagents
still bounded by one context window

Memory and token strategy

keep prompt structure stable: goal, constraints, success criteria
reference exact files, not whole-repo dumps
store stable repo rules outside the prompt in durable files
reset sessions when they become bloated

When to use it

This should be your default for most serious coding work.

Verdict

Rank: #1

This is the best first system to build because it teaches the right habits and exposes failures early.

Approach 2. Durable-context harness with repo memory files and reusable skills

What it is

You externalize stable instructions into repo-level memory and workflow artifacts. Examples:

coding rules
architecture constraints
common commands
PR checklist
migration checklist
review checklist
task-specific reusable skills/playbooks

The point is to stop re-explaining the same things every session.

Why it works

This is where compounding begins. Instead of stuffing rules into prompts, you build a persistent operating system for the agent. That improves consistency, reduces token waste, and preserves maintainability.

Strengths

best long-term token efficiency
best repeatability
reduces prompt drift
captures tribal knowledge in reusable form
improves onboarding of both humans and agents

Weaknesses

requires maintenance discipline
stale or bad memory files can encode bad behavior repeatedly
temptation to over-document everything

Memory and token strategy

Only store stable, high-value information such as:

architecture decisions
allowed commands
coding standards
test commands
repo boundaries
repeated pitfalls

Do not store transient debugging chatter or giant raw transcripts.

When to use it

As soon as you have repeated workflows or repeated repo conventions.

Verdict

Rank: #2

It is arguably the highest long-term ROI layer, but it works best after the single-agent loop is already disciplined.

Approach 3. Manager + specialist subagents

What it is

One orchestrator manages multiple bounded workers, for example:

reconnaissance agent
implementation agent
test-writing agent
review or regression agent

Each worker gets a narrow brief and a small context packet. The manager integrates outputs and applies the final gate.

Why it works

Parallelism helps when the work is truly separable. It is powerful for larger tasks, repo analysis, multi-option design, and review.

Strengths

highest throughput when decomposition is clean
better division of labor
natural place for review agents
good for comparative research and implementation planning

Weaknesses

coordination overhead
context drift between agents
duplicate exploration if briefs are sloppy
easy to create token waste
easier to lose architectural coherence

Memory and token strategy

use narrow briefs
pass summaries, not full transcripts
keep a shared contract: scope, files, acceptance criteria, stop conditions
avoid overlapping edit surfaces unless one manager owns the merge logic

When to use it

Once you already have a strong baseline loop and clear task decomposition.

Verdict

Rank: #3

Very powerful, but not a good first system. It amplifies both strengths and weaknesses.

Approach 4. Router harness with cheap models for exploration and strong models for commit-worthy diffs

What it is

You route different task types to different models. Typical split:

cheap/fast model for search, repo mapping, summarization, logs, triage
strong model for implementation and review of commit-worthy diffs
optional alternate model for independent review

Why it works

Not all work deserves premium tokens. Routing can massively improve cost efficiency if the handoffs are clean.

Strengths

best cost/performance potential
flexible across providers
useful when model strengths shift quickly
avoids wasting premium models on repo spelunking

Weaknesses

more harness complexity
risk of lossy handoffs between models
easy to optimize for price instead of code quality
harder to debug when failures come from routing logic rather than model behavior

Memory and token strategy

summarize exploration into compact handoff packets
use provider-specific caching/checkpointing where available
keep routing logic explicit and debuggable
route by task class, not vibe

When to use it

After your basic workflow, memory discipline, and verification stack are already solid.

Verdict

Rank: #4

Useful, but I would add it later. It solves a real problem, but only after you’ve earned the complexity.

Ranked recommendation

First: spec-first single-agent loop

Because it is the simplest reliable foundation and the easiest to reason about.

Second: durable repo memory and reusable skills

Because this is where long-term consistency and token efficiency compound.

Third: manager + specialist subagents

Because selective parallelism becomes powerful once your baseline discipline is already strong.

Fourth: router harness

Because it adds complexity for cost/performance gains that matter more later than early.

Recommended stack shape

If I were designing your system, I would think in layers.

Layer 1. Operator and orchestration surface

Use OpenClaw as the control plane. It is a strong place to launch, route, supervise, and review work. But it should not be the place where giant raw coding context accumulates forever.

Layer 2. Primary disciplined coding loop

Use a strong coding agent in a spec-first loop. The important thing is less the brand and more the discipline:

plan first
narrow context
explicit constraints
exact success criteria
deterministic checks

Layer 3. Durable memory

Use repo files and reusable playbooks/skills to capture:

architecture boundaries
coding norms
test commands
task checklists
forbidden moves

Layer 4. Review and verification gate

Always require:

tests
lint/type checks where relevant
concise change explanation
unresolved-question section
human review of design quality, not just green CI

Layer 5. Selective parallelism

Add subagents only where decomposition is clean:

reconnaissance
implementation
testing
review

Layer 6. Cost/performance routing

Add multi-model routing only when you are feeling real cost or latency pain.

Layer 7. Overnight GitHub agent

Add the scheduled GitHub worker last, after the above is already stable.

What actually makes someone feel 10x here

Not raw code generation.

The real leverage comes from building a system that is better than your unaided default at:

slicing work into reviewable units
preserving intent and constraints across sessions
keeping token usage under control
making verification automatic
catching regressions early
reducing context rebuild cost
preventing the same mistakes from recurring
turning ambiguous ideas into explicit specs quickly

The "10x" feeling is usually the result of:

less re-explaining
less context loss
less thrash
less low-value coding
more parallel reconnaissance
better morning review packets

That is a harness design problem more than a model problem.

Phase 1. Build the supervised baseline

Build a local or chat-triggered harness that can:

ingest a task
generate a short plan
identify touched files
implement in a branch/worktree
run checks
summarize changes and risks

Do this with one agent first.

Phase 2. Add repo memory

Create durable files for:

coding conventions
architecture notes
commands
test procedures
review checklist
forbidden zones
issue templates for agent-safe tasks

Phase 3. Add review discipline

Require each agent run to produce:

change summary
tests run
unresolved questions
rollback note
self-critique

Phase 4. Add selective subagents

Only after the first three are working, add bounded workers for:

reconnaissance
implementation
test generation
review

Phase 5. Add GitHub PR automation

Make it human-invoked first. Have it open draft PRs only.

Phase 6. Add unattended overnight scheduling

Only after human-invoked PR automation becomes boring and reliable. Start with one PR/night.

Final recommendation

On the second OpenClaw software-engineer agent

Yes, but later, and with hard limits.

Good version:

low-privilege bot identity
agent-safe issue queue
small draft PRs
deterministic CI
strict forbidden zones
human merge gate
one PR/night to start

Bad version:

broad autonomy
vague issues
architecture work
sensitive repos or workflows
direct merge/deploy power
large overnight diff generation

On the broader vibecoding harness

Build this in order:

spec-first single-agent loop
durable repo memory + reusable skills
manager + specialist subagents
router harness
overnight GitHub agent

That order gives you the best chance of gaining real leverage instead of building a very expensive PR noise machine.

Sources

OpenAI Codex best practices: https://developers.openai.com/codex/learn/best-practices
OpenAI Codex CLI: https://developers.openai.com/codex/cli
OpenAI Codex cloud: https://developers.openai.com/codex/cloud
OpenAI Codex workflows: https://developers.openai.com/codex/workflows
OpenAI on SWE-bench Verified contamination: https://openai.com/index/why-we-no-longer-evaluate-swe-bench-verified/
GitHub Copilot coding agent: https://docs.github.com/en/copilot/concepts/about-assigning-tasks-to-copilot
GitHub Copilot agent risks and mitigations: https://docs.github.com/copilot/concepts/agents/coding-agent/risks-and-mitigations
GitHub responsible use, code review: https://docs.github.com/en/copilot/responsible-use/code-review
GitHub required status checks: https://docs.github.com/articles/about-required-status-checks
GitHub CODEOWNERS: https://docs.github.com/articles/about-code-owners
GitHub Actions workflow events: https://docs.github.com/en/actions/reference/workflows-and-actions/events-that-trigger-workflows
GitHub Security Lab, preventing pwn requests: https://securitylab.github.com/resources/github-actions-preventing-pwn-requests/
GitHub merge queue: https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/incorporating-changes-from-a-pull-request/merging-a-pull-request-with-a-merge-queue
GitHub push protection: https://docs.github.com/en/code-security/secret-scanning/introduction/about-push-protection
GitHub artifact attestations: https://docs.github.com/actions/concepts/security/artifact-attestations
Anthropic Claude Code best practices: https://code.claude.com/docs/en/best-practices
Anthropic Claude Code overview: https://code.claude.com/docs/en/overview
OpenCode docs: https://opencode.ai/docs
OpenRouter quickstart: https://openrouter.ai/docs/quickstart
OpenClaw docs: https://docs.openclaw.ai
Devin best practices: https://docs.devin.ai/enterprise/best-Practices
Devin work with Devin: https://docs.devin.ai/work-with-devin

18 KiB Raw Permalink Blame History Unescape Escape

AI Coding Harness Patterns - 2026-04-22

Executive summary

1. Should you set up a separate OpenClaw software-engineer agent with its own GitHub account to work overnight?

2. What agentic coding patterns actually work?

Part I. Is the separate overnight GitHub engineer-agent a good pattern?

Short answer

Where this pattern works

Where this pattern fails

The strict conditions required

Scope control

Permission control

Execution control

Verification control

Governance control

Recommended rollout policy

Suggested initial policy

Suggested kill criteria

My recommendation on the overnight agent

Part II. The four harness approaches that actually work

Approach 1. Spec-first single-agent loop

What it is

Why it works

Strengths

Weaknesses

Memory and token strategy

When to use it

Verdict

Approach 2. Durable-context harness with repo memory files and reusable skills

What it is

Why it works

Strengths

Weaknesses

Memory and token strategy

When to use it

Verdict

Approach 3. Manager + specialist subagents

What it is

Why it works

Strengths

Weaknesses

Memory and token strategy

When to use it

Verdict

Approach 4. Router harness with cheap models for exploration and strong models for commit-worthy diffs

What it is

Why it works

Strengths

Weaknesses

Memory and token strategy

When to use it

Verdict

Ranked recommendation

First: spec-first single-agent loop

Second: durable repo memory and reusable skills

Third: manager + specialist subagents

Fourth: router harness

Recommended stack shape

Layer 1. Operator and orchestration surface

Layer 2. Primary disciplined coding loop

Layer 3. Durable memory

Layer 4. Review and verification gate

Layer 5. Selective parallelism

Layer 6. Cost/performance routing

Layer 7. Overnight GitHub agent

What actually makes someone feel 10x here

Concrete implementation plan I would recommend

Phase 1. Build the supervised baseline

Phase 2. Add repo memory

Phase 3. Add review discipline

Phase 4. Add selective subagents

Phase 5. Add GitHub PR automation

Phase 6. Add unattended overnight scheduling

Final recommendation

On the second OpenClaw software-engineer agent

On the broader vibecoding harness

Sources

18 KiB

Raw Permalink Blame History