MAN+ESM is a decentralized event-driven platform that breaks the scalability-reliability-coordination triad limiting today's agentic frameworks. No central planner. No O(n2) message paths. Linear horizontal scaling to thousands of agents, with deterministic event replay and bounded memory by construction.
Every multi-agent architecture in production today optimizes for one or two of scalability, reliability, and coordination - and pays for it on the third. There is no current paradigm that holds all three at once. This is not a tooling gap; it's an architectural gap.
The numbers are conservative; they come from production teams attempting to deploy LangGraph, AutoGen, and CrewAI at enterprise scale.
Centralized orchestrators carry an O(n2) message burden. Context windows fill with irrelevant chatter. Compute scales with noise, not signal - and at 100 agents a single coordinator handles 10,000 message paths, pushing latency from seconds to minutes.
Without validation gates between stages, a single misclassified output multiplies downstream. Errors propagate before detection; trust degrades as cascades stack. Recovery becomes manual and post-hoc - and in regulated workflows, sometimes irreversible.
Synchronous consensus demands serial execution. Distributed agents wait on each other; network round-trips compound. Even ten agents can take 30+ seconds to converge - too slow for any sub-second use case, and getting worse with cluster size, not better.
The architectural question follows: can a single design satisfy all three pressures simultaneously, without trading one off against the others?
We are not arguing the existing frameworks are bad - they each get something right. We are arguing that none of them is positioned to deliver the triad simultaneously, and the reasons are structural, not implementation details.
Latency grows roughly linearly with agent count past ~50; the orchestrator becomes the bottleneck before reasoning ever starts.
Complex workflows that require shared context and validated handoffs collapse - each agent operates in its own bubble, with no consensus surface.
Cannot adapt to changing requirements or unpredictable agent behaviour - the graph is fixed at design time, the world is not.
Event-driven messaging, LLM-first routing, gossip-based peer discovery, and a small set of well-defined micro-protocols can satisfy all three pressures simultaneously - without trading any against the others.
The reliability advantage of decentralization is not anecdotal - it is mathematically demonstrable. A centralized system is bounded by its coordinator's reliability. A distributed system with N redundant paths converts moderately reliable components into highly reliable systems, by construction.
A 90%-reliable component, replicated three ways, yields a 99.9%-reliable system - without requiring perfect parts. Critical for LLM-based components, where output reliability is fundamentally probabilistic.
System Reliability = Coordinator Reliability (p)
System Reliability
= 1 - (1 - p)N
A symmetric argument applies to memory: centralized state grows super-linearly with cluster complexity; distributed state grows linearly with agent count.
Two evaluations against the three dominant frameworks. First, task-level resolution rate on SWE-bench and T-bench. Second, internal infrastructure measurements - latency, memory, discovery - as the substrate scales from 10 to 1,000 agents.
Two task suites, ten tasks each. SWE-bench (Astropy & Django bug fixes) and T-bench (Fix Git, COBOL Modernization, plus medium-complexity tasks). Same model, same prompts, same problems. We measure task resolution rate.
Event-Sourced Memory isn't an optimization - it's a capability prerequisite. Without persistent, replayable working state, multi-step T-bench tasks fall over at the boundary between agent invocations. With ESM, working context survives those boundaries and tasks complete deterministically. The same architectural property that makes the system auditable makes it capable.
| Framework | Resolved | Rate |
|---|---|---|
| MAN+ESM Adya | 5 / 10 | 50% |
| MAN Adya | 5 / 10 | 50% |
| AutoGen | 2 / 10 | 20% |
| LangGraph | 1 / 10 | 10% |
| Framework | Resolved | Rate |
|---|---|---|
| MAN+ESM Adya | 10 / 10 | 100% |
| MAN Adya | 4 / 10 | 40% |
| AutoGen | 1 / 10 | 10% |
| LangGraph | 0 / 10 | 0% |
AutoGen is fastest but resolves 1 in 10. MAN+ESM is slowest by ~70 seconds and resolves 10 in 10. The trade-off favours reliability at the margin we care about.
Each suite runs 10 tasks - small N by design, since the goal was a same-model, same-prompt comparison across four frameworks rather than a full bench sweep. Larger sweeps with the full SWE-bench Lite and T-bench task universes will publish alongside the public release. The relative ordering of frameworks is consistent across both suites.
MAN+ESM stays in the 40-250 ms band across the full sweep; competing frameworks lose two orders of magnitude per decade of agent count and crash before reaching 500.
| Metric |
MAN+ESM
Adya
|
LangGraph
centralized
|
AutoGen
isolated
|
CrewAI
graph
|
|---|---|---|---|---|
| Latency & Coordination | ||||
| p50 latency - 10 agents | 42 ms | 180 ms | 320 ms | 410 ms |
| p50 latency - 100 agents | 95 ms | 8.2 s | 14 s | OOM |
| p50 latency - 1,000 agents | 240 ms | timeout | timeout | - |
| Coordination overhead | 3.8% | 41% | 28% | 36% |
| Memory Footprint | ||||
| Memory per agent | 18 MB | 340 MB | 220 MB | 280 MB |
| Context bloat factor | O(1) | O(n2) | O(n) | O(n2) |
| Discovery & Adaptation | ||||
| Capability discovery - 100 nodes | 280 ms gossip | 4.2 s | no native | redeploy |
| Add capability without restart | ||||
| Resilience | ||||
| Single-node failure recoverable | partial | |||
| Deterministic event replay | ||||
| Scale Ceiling | ||||
| Max stable cluster size | 5,000+ | ~50 | ~80 | ~30 |
Latency, memory, discovery, and cluster-size figures above are order-of-magnitude estimates from internal load tests against framework reference implementations on equivalent hardware (same model, same prompts, same task suite). Full methodology, hardware spec, prompt corpus, and reproducible notebooks publish alongside the paper. We will update this table with the audited values when the paper drops; the relative shape of the comparison is stable across our internal runs.
A single request enters the bus, gets routed by a Policy LLM, validated by a Critic, executed against tools, and emits signed receipts - all without a central orchestrator holding state. Every stage is a Kafka topic; every agent is a subscriber.
Agents organize across three specialized layers - but the layering is a separation of concern, not an upward escalation of authority. Any agent at any layer can make any of three decisions at runtime: decompose, execute, or forward.
Task decomposition, workflow orchestration, high-level coordination strategy. Stateless - every invocation operates on the current event context.
Domain-specific work - code, data analysis, document parsing, validation. Subscribes to capability topics. Parallel by default, serial only when explicitly required.
External system integration, format transformation between event schemas and external APIs, security perimeter. Legacy systems integrate as Gateway Agents wrapping their interfaces.
Every agent in MAN+ESM exposes the same three primitive capabilities. The agent inspects the incoming event, consults its capability registry and the live peer mesh, and selects which capability to invoke - dynamically, every time.
This is the structural difference from frameworks where roles are fixed at initialization (CrewAI) or bound to graph nodes (LangGraph): MAN+ESM agents adapt their behaviour to the task in front of them.
Centralized registries put discovery latency on a single machine's critical path; the registry becomes the bottleneck before any reasoning starts. Gossip puts it on the network - and the network is faster than the registry, by design.
A gossip wave reaches the entire cluster in roughly logk(n) rounds, where each round is bounded by network round-trip time, not by a single coordinator's compute. At 100 agents with a fanout of 4, that's ~3.3 hops. At 1,000 agents, ~5 hops. At 10,000, ~6.6 hops. The registry-based alternative scales linearly with cluster size and with query volume, which compounds.
Each role is a stateless service subscribing to its capability topic. Roles compose via the bus; complex workflows are expressed by which roles publish and which subscribe - not by hand-wired graphs.
Workflows do not need bespoke choreography. They compose from a small, well-defined set of standardized interaction patterns - each with formal guarantees, replay semantics, and consensus rules.
Iterative refinement loop. Proposer drafts; Critic evaluates against quality criteria; needs-fix feedback loops back. Quality gates enforced at each iteration.
Parallel sub-task execution with dependency-aware aggregation. Tasks fan out to capability topics; results merge back through deterministic reducers.
Decentralized bidding for dynamic resource allocation. Available agents bid; selection by best score, lowest cost, or fastest expected completion.
K-of-N validator approvals for high-risk decisions. Signed votes, immutable audit trails, deterministic tie-breaking. Configurable by risk tier.
Human-in-the-loop integration for governance gates. Confidence below threshold or risk above ceiling routes to a human approver before progression.
Proof-challenge mechanism for integrity checks. Independent agents replay computations and challenge results; disagreements trigger arbitration.
The "memory" half of MAN+ESM is not a vector store with overwrites. It's an append-only event log: every agent action, observation, and decision is a signed event. State is derived by replaying events through projections - which means state is reproducible, debuggable, and auditable, by construction.
Two memory tiers cooperate: episodic for long-term experience and learned preferences; scratchpad for task-local working context. Both are projections over the same event log.
Bug? Replay the event log up to the failure. Every state, every decision, exactly reproducible.
The log is the audit trail. Compliance teams query it directly; nothing is reconstructed after the fact.
Hash-chained events; each entry signed. Modifying history requires forging a chain - detectable on the next read.
Projections snapshot. Old events compact. Memory growth stays linear in active work, not lifetime work.
Familiar Python ergonomics - declarative agents, declarative workflows. Same SDK shape researchers already know; the decentralized event-driven runtime underneath is the only thing that's actually new.
from agent_sdk import Agent, Event, EventType
class CSVPerceptor(Agent):
capability = "csv.perceive"
async def handle_event(self, event: Event) -> list[Event]:
# parse user query -> structured intent
intent = await self.parse(event.payload)
# emit downstream - routing handled by the bus
return [Event(
event_type=EventType.PERCEIVED,
task_id=event.task_id,
payload={"intent": intent},
)]
from agent_sdk import Workflow
# declarative - no graph wiring, just steps
workflow = Workflow(
name="csv_analysis",
steps=[
"perceptor", "decomposer",
"proposer", "safety",
"executor", "critic",
],
)
# single call - routing & fan-out automatic
result = await workflow.execute(
task_id="task-123",
payload={"query": "Analyze Q3 sales by region"},
)
Mock agents and a synchronous event bus for unit and integration tests - no Kafka required for local dev.
Topic creation, ACL setup, and discovery announcement happen automatically when an agent boots.
Inspect topics, replay events, scaffold agents, run evals - all from man-cli.
The SDK, ESLL evaluation lab, and CLI are scheduled for public release in Q3 2026. Researchers and partners can request early access via the notify form.
What's already in production, what's in flight, and what comes next - laid out honestly, with quarterly granularity.
MAN+ESM stands on a long tradition: distributed systems literature, multi-agent protocols, event-sourcing patterns. The novelty is the synthesis, not the parts.
Full architectural treatment - three-layer model, gossip protocol, Flow ID observability, ESM, micro-protocols, reliability mathematics, comparative analysis. The benchmark suite and reproducible notebooks publish alongside the public release.
Read the preprintSDK, ESLL eval lab, CLI, reference agents, and benchmark notebooks. Public repo opens alongside the paper release.
Hardware specs, prompt corpus, framework reference implementations, latency & memory measurement scripts. Same task suite for every framework.
We are publishing the architecture, the methodology, and the measurements precisely because we want them stress-tested. If you find a flaw in the design or a fragility in the numbers, that is the most useful conversation we can have right now.