multi-agent tdd prompts architecture red-green-blue

Prompts as Scaffolding: Code for Each Other

The conversation history vanishes. But the codebase persists. What if agents embedded prompts in comments—turning code into an inter-agent communication channel?

Vario aka Mnehmos

December 30, 2025

💡 The Ephemeral Problem

Multi-agent systems have a fundamental problem: context vanishes between conversations. The orchestrator hands off to a worker, the worker completes the task, and when the next worker picks it up... all that rich context is gone.

The Context Collapse

You've designed the perfect multi-agent workflow. An Orchestrator breaks down work. A Red Phase agent writes failing tests. A Green Phase agent implements. A Blue Phase agent refactors. Beautiful in theory.

But what actually happens?

REALITY OF MULTI-AGENT HANDOFFS

🔴 Red Phase: Writes tests with deep understanding of edge cases, noting why certain tests exist

📤 Handoff: "Tests written. Task complete."

🟢 Green Phase: Opens test file. No idea why the tests exist. Just makes them pass.

📤 Handoff: "Implementation done. All tests green."

🔵 Blue Phase: Opens both files. No history. No intent. Just... code.

🔥 Context collapsed at every handoff 🔥

The boomerang payload helps—it carries summary and notes. But it's still external to the code. It lives in the conversation history. And conversation history is ephemeral, token-limited, and invisible to future agents in future sessions.

The Codebase Is Persistent. Use It.

The insight hit me while analyzing the multi-agent framework: the codebase is the only thing that survives between agent sessions. So why not use it as the communication channel?

❌ Conversation History

Ephemeral. Token-limited. Lost between sessions. Invisible to agents that join later. Requires explicit handoff payloads that are easily forgotten.

✓ The Codebase

Persistent. Git-tracked. Always available. Comments, TODOs, JSDoc—they're already part of the code. Future agents will read the code. They can read instructions too.

 // The key insight:
 // Instead of just writing code...
 // Write code that INSTRUCTS the next agent.
 
 // TODO(green-phase): This test expects retry logic with exponential backoff.
 // TODO(green-phase): Max 3 retries, base delay 1000ms.
 // CONTEXT: User complained about flaky network in issue #47.
 
 test('should retry failed requests with exponential backoff', ...);

The Scaffolding Pattern

I call this Prompts as Scaffolding: using code comments as inter-agent communication that persists across sessions and survives context collapse.

The key is structured annotations that tell the next phase exactly what to do:

🔴 Red Phase → Green Phase

// TODO(green-phase): Implement validateToken function
// SIGNATURE: (token: string) => Promise<TokenPayload | null>
// REQUIREMENTS:
//   - Return null for invalid/expired tokens
//   - Parse JWT and extract payload
//   - Cache valid tokens for 5 minutes
// EDGE_CASES:
//   - Empty string should return null
//   - Malformed JWT should return null, not throw
// CONTEXT: This is security-critical. See auth ADR-003.

test('should validate and cache JWT tokens', async () => {
  // Test implementation...
});

🟢 Green Phase → Blue Phase

// REFACTOR(blue-phase): This cache implementation is naive.
// ISSUES:
//   - Memory leak: no eviction beyond TTL
//   - No cache warming strategy
//   - Consider using LRU cache library
// LEAVE_ALONE: Core JWT parsing is intentionally simple

const tokenCache = new Map<string, {payload: TokenPayload, expiresAt: number}>();

export async function validateToken(token: string): Promise<TokenPayload | null> {
  // Implementation that makes tests pass...
}

🔵 Blue Phase → Future Work

// EXTEND(feature): Add token refresh capability
// WHEN: User requests keep-me-logged-in feature
// APPROACH: Check expiry before validation, refresh if under 5min remaining
// BLOCKED_BY: Need refresh token endpoint from backend team

/**
 * Validates a JWT token and returns its payload.
 *
 * @security This is the auth boundary. All routes depend on this.
 * @perf Uses LRU cache with 1000-entry limit, 5-minute TTL.
 * @see ADR-003 for design decisions
 */
export async function validateToken(token: string): Promise<TokenPayload | null> {
  // Clean, refactored implementation...
}

An Annotation Taxonomy

Different annotations serve different purposes. Here's a taxonomy for inter-agent communication through code:

TODO(target-mode)

Action required. The target mode should do this.

CONTEXT:

Background information. Why this exists.

REFACTOR(target-mode)

Works, but needs improvement. Non-blocking.

REQUIREMENTS:

Specification of expected behavior. Testable.

EDGE_CASES:

Non-obvious inputs that must be handled.

LEAVE_ALONE:

Explicitly not for optimization. Has reasons.

BLOCKED_BY:

Cannot proceed until external dependency resolves.

EXTEND(feature)

Future enhancement point. Not for current scope.

Combining with Boomerang Payloads

Prompts-as-scaffolding doesn't replace structured handoffs—it augments them. The boomerang payload carries the immediate summary; the code annotations carry the deep context.

COMPLETE HANDOFF PATTERN

1️⃣ Code Annotations (Persistent)

 // Embedded in test files, source files, JSDoc
 // Survives sessions, searchable, git-tracked

2️⃣ Boomerang Payload (Immediate)

 {
  "type": "task-completed",
  "from": "red-phase",
  "to": "orchestrator",
  "summary": "Wrote 7 failing tests for validateToken",
  "files_changed": ["tests/auth/validateToken.test.ts"],
  "notes": "See TODO(green-phase) annotations for impl hints"
}

3️⃣ Orchestrator Decision


// Orchestrator reads boomerang, knows to dispatch green-phase

// Green-phase reads code, finds TODO(green-phase) annotations

// Full context survives even if orchestrator restarts

Why This Actually Works

This isn't just theory. There are structural reasons why prompts-as-scaffolding outperforms pure conversation-based handoffs:

1. Locality of Reference

The instructions are next to the code they describe. Green-phase doesn't need to remember what the orchestrator said—it's right there above the test it needs to implement.

2. Git History as Audit Trail

When annotations are removed (TODO completed), it shows in the diff. You can trace what guidance was given and when it was fulfilled. Conversation history can't do this.

3. Searchable Context

grep "TODO(blue-phase)" instantly shows all pending refactors. grep "BLOCKED_BY" shows all external dependencies. Try doing that with conversation history.

4. Human-Readable Too

These annotations are useful for human developers too. Good comments are good comments. This isn't LLM-specific markup—it's just structured documentation.

Implementation: Mode Contracts

The pattern works best when it's part of the mode contract itself—not something agents "should" do, but something they're required to do.

Red Phase Contract

# Red Phase - TDD Test Writing Contract

1. Write failing tests that define expected behavior
2. ⚠️ FOR EACH TEST, INCLUDE:
   - TODO(green-phase) with implementation hints
   - REQUIREMENTS block listing expected behaviors  
   - EDGE_CASES block for non-obvious inputs
   - CONTEXT block explaining why this test exists

3. Test file becomes a SPECIFICATION DOCUMENT
   not just assertions

Green Phase Contract

# Green Phase - TDD Implementation Contract

1. Read all TODO(green-phase) annotations before coding
2. Implement EXACTLY what annotations specify
3. ⚠️ WHEN DONE:
   - Remove completed TODO(green-phase) annotations
   - Add REFACTOR(blue-phase) for known compromises
   - Add LEAVE_ALONE for intentional simplicity

4. If annotation is unclear: ask, don't guess

Blue Phase Contract

# Blue Phase - TDD Refactor Contract

1. Search for REFACTOR(blue-phase) annotations
2. Respect LEAVE_ALONE markers
3. ⚠️ WHEN DONE:
   - Remove completed REFACTOR annotations
   - Add EXTEND(feature) for future enhancements
   - Update JSDoc with @security, @perf, @see

4. Final code should have NO phase-specific TODOs
   Only EXTEND markers for future work

The Codebase as Communication Medium

Multi-agent systems fail when they treat code as output and conversation as context. The code persists; the conversation evaporates.

Prompts-as-scaffolding inverts this: embed the context in the code. When agents write code, they're not just solving the current task—they're teaching the next agent how to continue their work.

The Pattern in One Sentence

Write code that instructs the next agent.

Red Phase writes tests that teach Green Phase what to implement.
Green Phase writes code that teaches Blue Phase what to refactor.
Blue Phase writes documentation that teaches Future Work what to extend.

The codebase becomes a persistent, searchable, git-tracked communication channel between agents across time. Context collapse becomes context accumulation.

And the best part? It's just good documentation. The techniques that help agents work together also help humans understand the code. There's no LLM-specific magic—just structured comments with a specific purpose.