The Colony Doesn't Self-Assemble: Why Agent Orchestration Is Design Work, Not Delegation

"A single ant or bee isn't smart, but their colonies are." Deborah Gordon

#I. The ant has no head for the goal

An ant has roughly 250,000 neurons. It cannot picture the nest. It has never seen the foraging trail as a whole, never reasoned about supply and demand, never held the goal "feed the colony" in its head because it has no head for holding goals. And yet the colony solves shortest-path routing, load-balances its workforce, and reallocates labor when the world shifts, all without a manager. This is the fantasy we import the moment we say our agents will "figure it out": that intelligence is something the swarm finds on its own if we just point it at a goal and step back. But the ant's stupidity is load-bearing. The colony is smart because each ant runs a small, exactly-tuned rule against a carefully structured environment pheromone gradients, brood-pile geometry, antennal contact rates refined over a hundred million years. Nobody got that for free. And when I open the production traces on our own orchestration layer, the lesson is the same, just written in retries and dead-letter queues instead of pheromone: the emergence we want is bought, line by line, in the structure we build around the agents never in the agents alone.

I want to be precise about what I'm arguing here, because the field is awash in two opposite errors. The first is the romance: give five smart agents a goal and a chat channel, and watch coordination bloom. The second is the overcorrection: multi-agent is hype, just use one big model with a long prompt. Both are wrong, and they're wrong for the same reason. They both assume the intelligence lives in the agents. It doesn't. In a colony, and in a working orchestration system, the intelligence lives in the environment, the constraints, and the feedback loops the substrate the agents act through. That substrate is the thing you build. That's the job. This is a field note about what that job actually looks like once you're past the demo.

#II. The fantasy, stated fairly

Let me steelman the dream, because it isn't stupid. Biology really does produce coordinated, adaptive, robust collective behavior out of agents that are individually dumb and that never see the global state. Deborah Gordon's decades of work on harvester ants shows that colony-level decisions how many ants forage today, when to switch tasks emerge from nothing more than local interaction rates: an ant adjusts its behavior based on how often it bumps into other ants doing particular jobs (Gordon, 2010, Ant Encounters). There is no ant in charge. There is no foreman ant reading a dashboard. The regulation is real and it is genuinely decentralized.

And the LLM version of this dream now has real evidence behind it. Anthropic's write-up of their multi-agent research system reports that an orchestrator-plus-subagents architecture materially outperformed a single agent on open-ended research tasks, largely because parallel subagents could each burn through their own context window exploring a different branch of the problem [1]. More eyes, more parallel search, better coverage. The romance has a benchmark.

So when an engineer says "we'll just let the agents collaborate," they're gesturing at something that demonstrably works in nature and increasingly works in our systems. I'm not here to mock that intuition. I held it myself. I'm here to report what I found when I tried to ship it and to name the part of the picture the romance leaves out.

#III. The field truth: it works when it's set correctly for the job and not otherwise

Here is the single observation that reorganized how I think about all of this. The systems that worked, worked because each agent was configured correctly for that specific job its role boundary, its tools, its termination condition, the exact shape of what it was allowed to write back into the shared state. The systems that failed didn't fail because the models were dumb. They failed because the setup was wrong for the job at hand. Same models. Same framework. Different outcomes entirely a function of how tightly and how appropriately each piece was specified.

I describe my own stance now as control, but not strict. Not strict in the sense of scripting every step that kills the adaptivity that made you reach for multiple agents in the first place. But controlled in the sense that the space the agents move through is deliberately shaped: bounded roles, well-defined handoffs, a structured place to write intermediate work, and hard limits on what counts as "done." Loose behavior inside tight structure. That's the harvester-ant arrangement, restated. The ant's behavior is locally flexible; the rules and the environment it runs against are not arbitrary.

For a while this felt like a personal heuristic. Then I went looking for whether the data agreed with me. It does emphatically, and with more specificity than I expected.

#IV. The empirical turn: swarms fail at the seams, not at the synapses

The most important thing I read in this whole investigation was the Berkeley Sky Computing Lab's study, Why Do Multi-Agent LLM Systems Fail?, which builds an empirical taxonomy (they call it MAST the Multi-Agent System Failure Taxonomy) from a large set of annotated failure traces across multiple frameworks. Their finding is the thesis of this essay, arrived at independently and from the data side: multi-agent systems fail predominantly at specification and inter-agent coordination under-defined or contradictory roles, agents that derail from the task, agents that terminate prematurely, conversations that lose the thread not primarily at the level of raw model capability.

Read that again, because it's the whole game. The failures live in the design seams: the joins between agents, the definition of who does what, the conditions under which a step is finished. These are not properties of the models. They are properties of the structure you built around the models. An ant colony with badly tuned interaction rules doesn't produce a slightly worse colony; it produces a colony that spirals the famous ant mill, where a circle of army ants follows each other's pheromone trail until they die of exhaustion, each ant individually obeying a perfectly reasonable local rule. Emergent, decentralized, self-organizing and fatal. Emergence is not your friend by default. Emergence is just what structure produces, good structure or bad.

The operational sources rhyme with this exactly. Zylos's work on graceful degradation reports high failure rates in agent systems that lack the basic resilience machinery circuit breakers, exponential backoff, bulkheads, self-healing state [2]. Claro Digital's error-recovery guide reports the inverse: high success rates once a stack of orchestration patterns is deliberately applied [3]. Put those two next to each other and you have the entire argument in two data points. The difference between a swarm that works and one that collapses is not the intelligence of the parts. It is the presence or absence of engineered coordination and recovery. You install resilience. It does not emerge.

#V. Stigmergy, or: the colony writes on the world

Here is where biology stops being a metaphor and starts being an architecture diagram.

The mechanism that lets dumb ants coordinate without a manager has a name, given by the entomologist Pierre-Paul Grassé and formalized for engineers by Bonabeau, Dorigo, and Theraulaz in their 1999 book Swarm Intelligence: stigmergy. Coordination through a modified shared environment. An ant doesn't message another ant. It changes the world lays a pheromone trace and the changed world changes the next ant's behavior. The trail is a shared, external, persistent medium. The intelligence is not in the ants; it is in the environment the ants read and write.

Now look at what every serious production orchestration architecture actually does, and you will see stigmergy with a write-ahead log.

Microsoft's multi-agent reference architecture formalizes a central orchestration layer an Orchestrator, a classifier for intent routing, an agent registry, and a tool-access standard (MCP) sitting over persistent context that the agents read from and write to [4]. The agents don't hold the system's state in their heads any more than an ant holds the map of the trail. The state is externalized. State-persistence strategy guides make this explicit as a production requirement: sessions, memories, and run histories live in durable stores so a workflow can snapshot and rehydrate after a crash [5]. DBOS takes it to the logical conclusion durable workflows whose state lives in Postgres and which auto-resume on a fresh instance after failure, replaying from persisted history [6]. That is a pheromone trail you can crash-recover. That is the colony's shared medium, made transactional.

This is the concrete meaning of "control, but not strict." The control does not live in micromanaging each agent's next token. It lives in the substrate the shared scratchpad, the vector layer, the event stream, the registry of who exists and what they're for [4][7]. You shape the world the agents write on, and you let their local behavior stay flexible. AWS's Strands and Bedrock orchestration patterns give this a working vocabulary: an explicit agent-graph topology the developer designs, with structured parallelism, retries, and rate controls; swarm patterns and agents-as-tools as deliberate, declared structures rather than hoped-for emergence [8][7]. You draw the graph. The graph is the nest.

And this is the part the romance skips the substrate is not free. The same persistence guides that mandate externalized state note that distributed memory and shared context significantly increase token usage [5]. Every pheromone trail costs metabolism. Which brings me to the tension I refuse to smooth over.

#VI. The paradox I'm not going to resolve cheaply

So Anthropic shows multi-agent leads beating single agents on research [1]. Good. But the same body of evidence shows that running many specialized agents in parallel, each with its own memory and context, significantly increases token consumption, cost, and latency [5][9][7]. Hybrid edge-cloud designs that try to claw back latency by delegating subtasks to the edge buy that latency back with coordination and decision overhead [9]. The hierarchical and swarm designs that give you resilience and coverage are the same designs that multiply your concurrency cost by running a fleet of models at once [10][7].

I'm not going to tell you the swarm always wins, because the data won't let me. More agents is not free, and more agents is not always better. Anthropic's own framing is that the multi-agent approach pays off on tasks where the work parallelizes and where the value of broader exploration exceeds the token cost open-ended research, wide search. For a task that is fundamentally sequential, or cost-sensitive, or where a single well-prompted model with good tools suffices, the colony is the wrong instrument, and you will pay for the privilege of watching it coordinate.

This is the discipline the romance lacks: knowing when the colony is the right answer. An ant colony is a magnificent solution to foraging across a wide, uncertain landscape. It is an absurd solution to threading a needle. The architectural choice centralized single orchestrator, hierarchical manager-and-workers, decentralized peers, or a hybrid is not a matter of which is most impressive. The reference taxonomies are clear that enterprises end up with hybrids, and that the sober migration path is to start centralized for development speed and predictability, then introduce hierarchy or decentralization only as scale and resilience demands force your hand [11][12][10]. Centralized orchestrators give you consistent routing and clean lifecycle management; their cost is being a bottleneck and a single point of failure. You trade that away on purpose, when the job demands it not because decentralization is fashionable.

#VII. The colony has an immune system. Yours needs one too.

A real colony does not just forage. It detects intruders, walls off contamination, removes its own dead, and rebalances labor when a section of the workforce is wiped out. The robustness is not a happy accident of having many ants it is a set of specific, evolved feedback and defense behaviors. A production swarm needs the engineered equivalent, and this is precisely the machinery whose absence Zylos correlates with high failure rates [2].

Concretely, from the operational corpus, the immune system looks like this:

Resilience: Circuit breakers, exponential backoff, and bulkheads so a failing tool or a slow downstream agent can't cascade into a system-wide stall [13][2].
Reliability: Idempotency keys and compensation patterns so that a non-idempotent tool call charging a card, sending an email is safe to retry after a crash without doing the thing twice [7][14].
Persistence: Durable, replayable workflows that resume from persisted history rather than starting over, so a dead instance is a hiccup, not a data-loss event [6][14].
Governance: Supervisors, critics, and verification agents a designed analog to the colony's quality-control behaviors plus human-in-the-loop checkpoints at the high-stakes seams, to catch the hallucinations, deadlocks, and contradictory outputs that the failure taxonomies flag as the characteristic emergent pathologies [1][15][16][14].

And none of it means anything if you can't see it. A colony senses itself through the very interaction rates that regulate it; your system senses itself through observability. The consensus from the observability sources is firm: you need OpenTelemetry-style traces, metrics, and execution logs spanning every LLM call, tool invocation, and agent step [16][17]. And your evaluation has to extend past raw accuracy to task-completion rate, tool-use correctness, latency, throughput, cost per task, and the specific emergent failure modes deadlocks, handoff errors, contradictory outputs [18][16]. You cannot debug a swarm you cannot trace. You cannot tune interaction rates you cannot measure. The ant mill happens in the dark.

#VIII. Building the nest in the right order

If the intelligence lives in the substrate, then the engineering work is substrate work, and it has an order of operations. Pulling from the migration guidance and the buy-versus-build sources [11][19][2][14], here is the sequence I'd defend:

Specify: Prototype centralized, with simulated agents. Get the routing, the role definitions, and the shared-state shape right before you distribute anything. Specification first because specification is where the failures live [20].
Observe: Embed observability and versioning from day one, not as a later "ops" phase. A registry with semantic versions, aliases, and canary support means you can change an agent without changing the colony out from under itself [7][21][22]. Tracing from the first commit means you're never debugging blind.
Offload: Buy the foundation layers, build the domain logic. Take the models, the durable workflow engine (Temporal, Step Functions), the orchestration primitives, and the observability stack as bought infrastructure; spend your scarce design effort on the thing only you can build the correct setup for your specific jobs [19][14][7].
Scale: Introduce hierarchy or decentralization only when scale and resilience force it. Not before. The hybrid is an earned destination, not a starting pose [11][12][10].
Defend: Draw the boundary and the defenses. Least-privilege access, data-access policies enforced at the source, PII redaction, immutable encrypted audit logs, and explicit compliance mappings for regulated work [23][24][25][26]. The colony has a wall and an immune system; a system touching real data needs both, by design.

Every one of these steps is substrate. None of them is "make the agents smarter." That's the tell.

#IX. What the job actually is

So let me close where the ants started us. We reach for multi-agent systems because we've seen what a colony can do the genuine, benchmarked, not-imaginary fact that many coordinated agents can outperform one [1]. And then we make the category error: we assume the coordination is a property the agents have, rather than a property the environment imposes. We say "they'll figure it out" and walk away, and we get the ant mill agents dutifully following each other's traces into exhaustion, every local step reasonable, the global outcome a slow, expensive death.

The honest reframe is this: agent orchestration is not the act of distributing tasks to smart things. It is the act of designing an environment, a set of constraints, and a web of feedback loops such that good collective behavior is what falls out of dumb local behavior. It is environment design. It is constraint design. It is feedback design. The failure data says so the seams, not the synapses [20]. The architecture diagrams say so externalized, durable, traceable state is the load-bearing wall [4][5][6]. The biology said so a hundred million years before any of us [Bonabeau et al., 1999; Gordon, 2010].

"Control, but not strict" turns out to be the whole philosophy compressed into four words. Shape the world tightly. Let the behavior inside it stay loose. Then and this is the discipline that separates an engineer from a hopeful have the restraint not to over-steer the thing you built room for, and the rigor to know when the colony was the wrong tool and a single agent would have threaded the needle.

The colony doesn't self-assemble. Nobody got it for free. You build the nest, line by line, in retries and dead-letter queues and versioned registries and traces and then, if you built it right, you get to watch something that looks, briefly and expensively and earned, like magic.

#References

[1] Anthropic. Building a Multi-Agent Research System. https://anthropic.com/engineering/multi-agent-research-system

[2] Zylos. Graceful Degradation in AI Agent Systems. https://zylos.ai/research/2026-02-20-graceful-degradation-ai-agent-systems

[3] Claro Digital. Error Recovery in Multi-Agent AI Systems: A Guide. https://clarodigi.com/blog/error-recovery-multi-agent-ai-systems-guide

[4] Microsoft. Multi-Agent Reference Architecture. https://microsoft.github.io/multi-agent-reference-architecture/docs/reference-architecture/Reference-Architecture.html

[5] Indium. 7 State Persistence Strategies for AI Agents. https://indium.tech/blog/7-state-persistence-strategies-ai-agents-2026

[6] DBOS. Making AI Agents Fault-Tolerant on Google Cloud Run. https://dbos.dev/blog/making-ai-agents-fault-tolerant-on-google-cloud-run

[7] AWS Machine Learning Blog. Design Multi-Agent Orchestration with Reasoning Using Amazon Bedrock and Open-Source Frameworks. https://aws.amazon.com/blogs/machine-learning/design-multi-agent-orchestration-with-reasoning-using-amazon-bedrock-and-open-source-frameworks

[8] AWS Machine Learning Blog. Multi-Agent Collaboration Patterns with Strands Agents and Amazon Nova. https://aws.amazon.com/blogs/machine-learning/multi-agent-collaboration-patterns-with-strands-agents-and-amazon-nova

[9] arXiv:2504.00434. Hybrid Edge-Cloud Agent Orchestration. https://arxiv.org/html/2504.00434v1

[10] Gurusup. Agent Orchestration Patterns. https://gurusup.com/blog/agent-orchestration-patterns

[11] Coworker.ai. AI Agent Orchestration Platform. https://coworker.ai/blog/ai-agent-orchestration-platform

[12] IBM. AI Agent Orchestration. https://ibm.com/think/topics/ai-agent-orchestration

[13] AWS Machine Learning Blog. Advanced Fine-Tuning Techniques for Multi-Agent Orchestration Patterns from Amazon at Scale. https://aws.amazon.com/blogs/machine-learning/advanced-fine-tuning-techniques-for-multi-agent-orchestration-patterns-from-amazon-at-scale

[14] Temporal. Orchestrating Ambient Agents with Temporal. https://temporal.io/blog/orchestrating-ambient-agents-with-temporal

[15] arXiv:2501.06322. Multi-Agent Reasoning / Planning Survey. https://arxiv.org/html/2501.06322v1

[16] Groundcover. AI Agent Observability. https://groundcover.com/learn/observability/ai-agent-observability

[17] Redis. AI Agent Orchestration. https://redis.io/blog/ai-agent-orchestration

[18] JetBrains Blog. LLM Evaluation and AI Observability for Agent Monitoring. https://blog.jetbrains.com/pycharm/2026/05/llm-evaluation-and-ai-observability-for-agent-monitoring

[19] TrueFoundry. Multi-Agent Architecture. https://truefoundry.com/blog/multi-agent-architecture

[20] Cemri et al. (2025). Why Do Multi-Agent LLM Systems Fail? (MAST). https://arxiv.org/abs/2509.00434

[21] arXiv:2505.02133. Multi-Agent Software Engineering Pipelines. https://arxiv.org/html/2505.02133v1

[22] CIO. Why Versioning AI Agents Is the CIO's Next Big Challenge. https://cio.com/article/4056453/why-versioning-ai-agents-is-the-cios-next-big-challenge.html

[23] Google Cloud. 101 Real-World Generative AI Use Cases from Industry Leaders. https://cloud.google.com/transform/101-real-world-generative-ai-use-cases-from-industry-leaders

[24] Obsidian Security. Security for AI Agents. https://obsidiansecurity.com/blog/security-for-ai-agents

[25] KuppingerCole. Agentic AI and Data Access Control. https://kuppingercole.com/blog/balaganski/agentic-ai-and-data-access-control

[26] MindStudio. AI Agent Security. https://mindstudio.ai/blog/ai-agent-security

Books & foundational texts:

Bonabeau, E., Dorigo, M., & Theraulaz, G. (1999). Swarm Intelligence: From Natural to Artificial Systems. Oxford University Press.
Gordon, D. M. (2010). Ant Encounters: Interaction Networks and Colony Behavior. Princeton University Press.
Mitchell, M. (2009). Complexity: A Guided Tour. Oxford University Press.

GenAI tools were used for research synthesis and citation formatting.