The Orchestra: A Conductor Layer That Coordinates Multi-Agent Workflows

You have agents that work. Each one does its job -- research, writing, analysis, monitoring. The safety net catches failures. The trust system lets your best agents run autonomously. Everything you built in Issues #13 through #18 is humming.

But your agents do not talk to each other.

Your research agent pulls market data every morning. Your writing agent drafts a newsletter every afternoon. The writing agent does not know what the research agent found. It uses yesterday's context because nobody told it to wait for today's data.

Your analysis agent scores 50 items and writes the results to a file. Your email agent sends the top 5. But the analysis agent ran late -- the scores were not ready when the email agent fired. So the email goes out with stale rankings. Nobody is wrong. The timing is.

This is the coordination problem. Individual agents are reliable. The system of agents is fragile. Not because any single agent fails, but because the dependencies between them are invisible and unmanaged.

The fix is a conductor -- a layer that knows which agents depend on which other agents, manages the sequence and timing, and ensures every agent starts with the right inputs from its upstream dependencies.

The Framework: Three Components

Component	Purpose	Artifact
Dependency Map	Declare what connects to what	`dependency_map.json`
Workflow Executor	Run agents in wave order	`run_summary.json`
Handoff Validator	Verify every agent-to-agent handoff	`validation_report.json`

Key insight: The dependency map is declarative -- it describes what IS. The executor is imperative -- it enforces the sequence. The validator is the quality gate between every handoff. Together, they turn a collection of independent agents into a coordinated system.

Component 1: Dependency Map

The dependency map is the score sheet for your orchestra. Without it, you are hoping agents run in the right order. With it, the conductor knows exactly what to coordinate.

Dependency Mapper

You are a workflow architect. Your job is to analyze an existing set of agents and produce a dependency map that describes how they connect.

Read all agent configuration files in ~/agents/configs/.
Read all output manifests in ~/agents/output_manifest.json.
Read the incident log at ~/incidents/incident_log.jsonl for
timing-related failures.

For each agent, determine:

1. OUTPUTS: What files does this agent produce?
   - List every output file path
   - Note the expected refresh frequency
   - Note the format (JSON, markdown, HTML, CSV)

2. INPUTS: What files does this agent read before running?
   - List every input file path
   - Note whether the input is required or optional
   - Note the minimum freshness requirement ("must be from today",
     "must be less than 1 hour old", "any age is fine")

3. DEPENDENCIES: Which other agents produce this agent's inputs?
   - Map each input file to the agent that creates it
   - Flag circular dependencies (Agent A needs B, B needs A)
   - Flag orphan inputs (files this agent reads that no agent produces)

4. TIMING: Based on refresh frequencies and dependencies, what is
   the natural execution order?
   - Group agents into waves: Wave 1 has no dependencies on other
     agents. Wave 2 depends only on Wave 1 outputs. And so on.
   - Within each wave, agents can run in parallel.
   - Flag any agent whose schedule conflicts with its dependencies
     (e.g., runs at 8 AM but depends on an agent that runs at 9 AM)

Output (save to ~/workflows/dependency_map.json):
{
  "generated": "[ISO timestamp]",
  "agents": {
    "[agent_name]": {
      "outputs": [
        {"path": "...", "format": "...", "refresh": "..."}
      ],
      "inputs": [
        {
          "path": "...",
          "required": true/false,
          "min_freshness": "...",
          "produced_by": "[agent_name]" or null
        }
      ],
      "wave": [number],
      "estimated_runtime": "[duration]",
      "schedule": "[cron or interval]"
    }
  },
  "waves": {
    "1": ["agent_a", "agent_b"],
    "2": ["agent_c"],
    "3": ["agent_d", "agent_e"]
  },
  "circular_dependencies": [],
  "orphan_inputs": [],
  "timing_conflicts": []
}

Rules:
- Be exhaustive. Every file read and every file written must be mapped.
- If an agent reads a file but you cannot determine which agent writes
  it, mark it as an orphan input. Orphans are risks.
- Circular dependencies are blockers. Flag them prominently.
- The wave assignment must be deterministic — same inputs always
  produce the same wave numbers.

The dependency map surfaces three things you probably did not know about your system: which agents secretly depend on each other (implicit connections through shared files), which inputs come from nowhere (orphans -- a reliability risk), and which schedules conflict with their dependencies (timing bombs).

Component 2: Workflow Executor

The executor is the conductor. It reads the dependency map, checks readiness, and triggers agents in wave order. It does not run the agents itself -- it calls their existing runners. Its job is sequencing and gating.

Workflow Executor

You are a workflow executor. Your job is to orchestrate a multi-agent
pipeline by running agents in the correct order based on dependencies.

Read the dependency map at ~/workflows/dependency_map.json.
Read the current run state at ~/workflows/run_state.json (or create
it if this is the first run).

Execute the workflow:

1. FOR EACH WAVE (starting from Wave 1):

   a. PRE-FLIGHT CHECK: For each agent in this wave, verify that
      all required inputs exist and meet freshness requirements.
      - If an input is missing or stale, check if the producing
        agent is in an earlier wave that already ran. If yes,
        something went wrong — log the failure and skip this agent.
      - If the producing agent has not run yet (wave ordering error),
        halt and report the dependency map is incorrect.

   b. PARALLEL EXECUTION: Start all agents in this wave that pass
      pre-flight. They can run simultaneously because none depend
      on each other.
      - Track start time for each agent.
      - Set a timeout based on the agent's estimated_runtime (2x
        the estimate as the hard limit).

   c. COMPLETION GATE: Wait for all agents in this wave to complete.
      - If an agent times out, mark it as FAILED and continue.
        Downstream agents that depend on it will fail pre-flight.
      - If an agent produces output but the output fails validation,
        mark it as FAILED.

   d. RECORD: Update run_state.json with results for this wave.

2. AFTER ALL WAVES COMPLETE:

   Produce a run summary (save to ~/workflows/run_summary_[DATE].json):
   {
     "run_id": "[UUID]",
     "started": "[ISO timestamp]",
     "completed": "[ISO timestamp]",
     "total_duration": "[duration]",
     "waves_executed": [number],
     "agents_succeeded": [count],
     "agents_failed": [count],
     "agents_skipped": [count],
     "failures": [
       {
         "agent": "[name]",
         "wave": [number],
         "reason": "TIMEOUT" or "PRE_FLIGHT_FAILED" or "VALIDATION_FAILED",
         "downstream_impact": ["list of agents that were skipped because
                                this agent failed"]
       }
     ]
   }

3. RETRY LOGIC:
   - Failed agents do NOT automatically retry in the same run.
   - After all waves complete, if any agents failed, produce a
     retry manifest at ~/workflows/retry_[RUN_ID].json listing
     which agents to retry and in what order.
   - Retries run as a separate workflow execution with only the
     failed agents and their downstream dependents.

Rules:
- Never run a downstream agent if its upstream dependency failed.
  The cascade of failure is explicit, not silent.
- Never modify the dependency map. If the map seems wrong, log
  the issue — a human or the dependency mapper fixes it.
- If zero agents fail, the retry manifest is empty. This is the
  success case. Log it clearly.
- The executor is stateless between runs. All state lives in
  run_state.json.

Pro Tip

The retry manifest is key. Instead of re-running the entire pipeline when one agent fails, you re-run only the failed agent and its downstream dependents. A 7-agent pipeline where agent 2 fails becomes a 3-agent retry, not a full re-run. This saves time, tokens, and prevents redundant API calls to external services.

Component 3: Handoff Validator

The validator sits between every agent-to-agent connection. It is the quality gate at the boundary -- ensuring Agent A's output is what Agent B expects before Agent B starts.

Handoff Validator

You are a handoff validator. Your job is to verify that an agent's
output meets the input requirements of every downstream consumer
before those consumers start.

Read the dependency map at ~/workflows/dependency_map.json.
Read the output that was just produced: [OUTPUT_FILE_PATH].
Read the downstream consumer requirements from the dependency map.

Validate the handoff:

1. FORMAT CHECK:
   - Is the output in the expected format (JSON, markdown, etc.)?
   - Does it parse without errors?
   - If JSON: does it match the expected schema? Check required
     fields, data types, array lengths.

2. FRESHNESS CHECK:
   - Is the output timestamp from the current run?
   - Does it meet the min_freshness requirement of every downstream
     consumer?

3. CONTENT CHECK:
   - Are the values within expected ranges?
   - Are there any null or empty fields that downstream agents
     require to be populated?
   - Is the data internally consistent? (e.g., if it contains a
     count field and an array, does the count match the array
     length?)

4. COMPATIBILITY CHECK:
   - For each downstream consumer, verify the output contains
     everything the consumer needs.
   - Flag any fields the consumer expects that are missing.
   - Flag any fields present in the output that no consumer uses
     (potential waste or drift).

Output (save to ~/workflows/validations/[AGENT]_[TIMESTAMP].json):
{
  "producer": "[agent_name]",
  "output_file": "[path]",
  "timestamp": "[ISO timestamp]",
  "consumers": ["agent_b", "agent_c"],
  "checks": {
    "format": "PASS" or "FAIL",
    "freshness": "PASS" or "FAIL",
    "content": "PASS" or "FAIL",
    "compatibility": {
      "agent_b": "PASS" or "FAIL",
      "agent_c": "PASS" or "FAIL"
    }
  },
  "overall": "PASS" or "FAIL",
  "failures": [
    {
      "check": "[which check]",
      "detail": "[what specifically failed]",
      "severity": "BLOCKING" or "WARNING",
      "consumer_impact": ["which consumers are affected"]
    }
  ]
}

Rules:
- BLOCKING failures prevent downstream agents from starting.
- WARNING failures are logged but do not block execution.
- A missing required field is always BLOCKING.
- A null optional field is WARNING.
- If all checks pass, the downstream agents are cleared to start.
- The validator runs between waves — after Wave N completes and
  before Wave N+1 begins.

The four validation checks catch the most common handoff failures: wrong format (agent B expects JSON, agent A wrote markdown), stale data (agent A's output is from yesterday), bad content (nulls, out-of-range values, inconsistent counts), and missing fields (agent A dropped a field that agent B needs). Catching these at the boundary -- before the downstream agent starts -- prevents cascading garbage.

Wiring the Orchestra

Step 1: Map your dependencies. Run the dependency mapper against your existing agents. You will likely discover connections you did not know about -- agents reading files produced by other agents without any explicit coordination. You will also discover orphan inputs and timing conflicts. Fix those first.

Step 2: Assign waves. The dependency mapper produces wave assignments automatically. Review them. Wave 1 should be your data-fetching agents (no upstream dependencies). Wave 2 should be your analysis agents (consume raw data). Wave 3 should be your output agents (consume analyzed data). If your wave structure does not follow this pattern, your dependency graph may have an issue.

Step 3: Run a dry execution. Use the workflow executor with a dry-run flag that checks pre-flight conditions and logs what it would do, without actually starting any agents. This surfaces timing conflicts and stale-input problems before they cause real failures.

Step 4: Wire the handoff validators. Place a validator between every producer-consumer pair. The first run will likely surface schema mismatches and missing fields you did not know about. Good -- fix them now, not during a real failure at 3 AM.

Step 5: Run for real. Execute the full orchestrated workflow. Compare the results to your old approach (agents on independent schedules). You should see: fewer timing-related failures, faster end-to-end completion (parallel execution within waves), and clear visibility into what failed and why.

What the System Looks Like Running

6:00 AM -- Wave 1

Three data-fetching agents run in parallel: market data, news sentiment, and portfolio positions. All three complete in under 2 minutes. Handoff validator confirms all outputs are fresh, properly formatted, and contain the fields downstream agents expect.

6:03 AM -- Wave 2

Two analysis agents run in parallel: signal scoring (consumes market data + positions) and risk assessment (consumes market data + sentiment). Signal scorer finishes in 90 seconds. Risk assessor in 45 seconds. Both outputs validated.

6:05 AM -- Wave 3

Newsletter agent runs (consumes signal scores + risk assessment). Dashboard agent runs in parallel (consumes all upstream outputs). Newsletter draft ready by 6:08 AM. Dashboard refreshed by 6:07 AM.

6:08 AM -- Complete

Total pipeline time: 8 minutes. Every agent had fresh inputs. No stale data. No timing collisions. Run summary: 7 agents succeeded, 0 failed, 0 skipped.

Compare this to the old approach: agents on independent cron schedules, sometimes running before their inputs are ready, sometimes running with yesterday's data because the upstream agent was 5 minutes late. The orchestra does not make your agents faster -- it makes them reliable as a system.

What Could Go Wrong

Over-orchestration. Not every agent needs to be in the dependency map. Agents that truly have no upstream or downstream connections -- a standalone monitoring bot, a one-off research agent -- should stay independent. The orchestra is for agents that depend on each other. Adding independent agents just creates unnecessary complexity.
Wave bottlenecks. If one agent in Wave 2 takes 10 minutes and the rest take 30 seconds, Wave 3 waits for the slowest. Identify bottleneck agents and either optimize them or move them to a later wave if their output is only needed by a subset of Wave 3 agents.
Rigid sequencing. The dependency map describes what IS, not what MUST BE. If you restructure an agent's inputs, update the map. Stale dependency maps cause silent failures -- the executor skips agents because it thinks an input is missing when the agent no longer needs it. Run the dependency mapper monthly to keep the map current.
Single point of failure. The executor itself can fail. Wire it into your safety net from Issue #18. If the executor crashes mid-run, the incident detector should catch it (no run summary produced), and the escalation router should handle it (re-run from the failed wave, not from the beginning).

The Bottom Line

Individual agent reliability is solved. Issues #13 through #18 gave you persistent agents, dashboards, quality gates, feedback loops, earned autonomy, and incident response. Each agent is solid on its own.

The orchestra solves the next problem: system reliability. When agents depend on each other -- and they always do, eventually -- you need explicit coordination. Not agents hoping their inputs are ready. Not cron schedules that overlap by coincidence. A conductor that knows the dependencies, enforces the sequence, validates every handoff, and gives you a clear picture of what happened and why.

8 min

Full 7-agent pipeline, orchestrated end-to-end

Wave-based parallel execution. Every agent starts with fresh, validated inputs. A single failure cascades visibly, not silently. The run summary tells you exactly what happened in one glance.

Try It This Week

Pick your three most connected agents -- the ones where Agent A's output feeds Agent B, and Agent B's output feeds Agent C. Write the dependency map for just those three agents by hand. List the outputs, the inputs, the freshness requirements, and the wave assignments.

Then ask yourself: in the last month, how many times did Agent B run before Agent A's output was ready? How many times did Agent C use stale data from Agent B? If the answer is more than zero, the orchestra is not optional -- it is the fix for a problem you are already having.

Reply with your three-agent dependency map and I will review the wave assignments and flag any timing conflicts you missed.

Next Issue: Issue #20

The Control Room

You have agents, dashboards, quality gates, feedback loops, autonomy tiers, a safety net, and an orchestra. But where do you go to see all of it at once? We will build a unified control room -- a single interface where you monitor every agent, every workflow, every incident, and every metric. The command center for your AI operation.