Research - Multi-Agent Network + Event-Sourced Memory

Multi-agent reasoning, without the orchestrator.

MAN+ESM is a decentralized event-driven platform that breaks the scalability-reliability-coordination triad limiting today's agentic frameworks. No central planner. No O(n2) message paths. Linear horizontal scaling to thousands of agents, with deterministic event replay and bounded memory by construction.

Read the preprint View the benchmarks

Phase 1-3 shipped Phase 4-7 in flight (Q1-Q2 2026) - 9 agent roles - 6 micro-protocols - Kafka backbone

Event-driven - no orchestrator Gossip-based discovery Linear scaling 1k+ agents Sub-second coordination Deterministic replay Bounded memory Schema-validated boundaries

The problem

Three pressures, no winner.

Every multi-agent architecture in production today optimizes for one or two of scalability, reliability, and coordination - and pays for it on the third. There is no current paradigm that holds all three at once. This is not a tooling gap; it's an architectural gap.

The numbers are conservative; they come from production teams attempting to deploy LangGraph, AutoGen, and CrewAI at enterprise scale.

Vertex 1 - Information noise

Scalability bottlenecks

Centralized orchestrators carry an O(n2) message burden. Context windows fill with irrelevant chatter. Compute scales with noise, not signal - and at 100 agents a single coordinator handles 10,000 message paths, pushing latency from seconds to minutes.

Vertex 2 - Reliability crisis

Hallucination cascades

Without validation gates between stages, a single misclassified output multiplies downstream. Errors propagate before detection; trust degrades as cascades stack. Recovery becomes manual and post-hoc - and in regulated workflows, sometimes irreversible.

Vertex 3 - Consensus paradox

Coordination latency

Synchronous consensus demands serial execution. Distributed agents wait on each other; network round-trips compound. Even ten agents can take 30+ seconds to converge - too slow for any sub-second use case, and getting worse with cluster size, not better.

The architectural question follows: can a single design satisfy all three pressures simultaneously, without trading one off against the others?

The architectural gap

Why the field's three dominant paradigms each fail

We are not arguing the existing frameworks are bad - they each get something right. We are arguing that none of them is positioned to deliver the triad simultaneously, and the reasons are structural, not implementation details.

LangGraph

Centralized orchestration

Strengths

Simple coordination model
Predictable execution flow
Easy debugging on small graphs

Weaknesses

Single point of failure (orchestrator)
O(n2) message paths through coordinator
No horizontal scaling beyond one runtime

Why it fails

Latency grows roughly linearly with agent count past ~50; the orchestrator becomes the bottleneck before reasoning ever starts.

AutoGen

Isolated agents

Strengths

True horizontal scaling at the agent level
High individual fault tolerance
Embarrassingly parallel for independent tasks

Weaknesses

No shared goals or canonical context
Inconsistent memory across instances
Duplicated work, no coordination protocol

Why it fails

Complex workflows that require shared context and validated handoffs collapse - each agent operates in its own bubble, with no consensus surface.

CrewAI

Disconnected graphs

Strengths

Defined workflows, role-based clarity
Some parallelism within a crew
Distributed processing for known graphs

Weaknesses

Static, rigid graph structure
No real-time consensus mechanism
Breaks on a single agent failure

Why it fails

Cannot adapt to changing requirements or unpredictable agent behaviour - the graph is fixed at design time, the world is not.

The hypothesis

Event-driven messaging, LLM-first routing, gossip-based peer discovery, and a small set of well-defined micro-protocols can satisfy all three pressures simultaneously - without trading any against the others.

This is the bet. The rest of this page is the evidence.

The reliability math

Why distributed redundancy compounds rather than averages.

The reliability advantage of decentralization is not anecdotal - it is mathematically demonstrable. A centralized system is bounded by its coordinator's reliability. A distributed system with N redundant paths converts moderately reliable components into highly reliable systems, by construction.

A 90%-reliable component, replicated three ways, yields a 99.9%-reliable system - without requiring perfect parts. Critical for LLM-based components, where output reliability is fundamentally probabilistic.

Centralized - capped

System Reliability
  = Coordinator Reliability (p)

at p = 0.90

90%

Distributed - compounds

System Reliability
  = 1 - (1 - p)N

at p = 0.90, N = 3

99.9%

A symmetric argument applies to memory: centralized state grows super-linearly with cluster complexity; distributed state grows linearly with agent count.

Benchmarks

The numbers, side by side.

Two evaluations against the three dominant frameworks. First, task-level resolution rate on SWE-bench and T-bench. Second, internal infrastructure measurements - latency, memory, discovery - as the substrate scales from 10 to 1,000 agents.

01 / Task-level evaluation

Where the memory layer earns its name.

Two task suites, ten tasks each. SWE-bench (Astropy & Django bug fixes) and T-bench (Fix Git, COBOL Modernization, plus medium-complexity tasks). Same model, same prompts, same problems. We measure task resolution rate.

The ESM uplift - T-bench

40%

100%

MAN alone resolves 4 of 10 tasks. With ESM as the memory substrate, MAN+ESM resolves all 10.

Event-Sourced Memory isn't an optimization - it's a capability prerequisite. Without persistent, replayable working state, multi-step T-bench tasks fall over at the boundary between agent invocations. With ESM, working context survives those boundaries and tasks complete deterministically. The same architectural property that makes the system auditable makes it capable.

SWE-bench

Astropy - Django bug fixes

10 tasks

Framework	Resolved	Rate
MAN+ESM Adya	5 / 10	50%
MAN Adya	5 / 10	50%
AutoGen	2 / 10	20%
LangGraph	1 / 10	10%

MAN frameworks resolve half the suite; the closest competitor lands one in five.

T-bench

Fix Git - COBOL - mixed complexity

10 tasks

Framework	Resolved	Rate
MAN+ESM Adya	10 / 10	100%
MAN Adya	4 / 10	40%
AutoGen	1 / 10	10%
LangGraph	0 / 10	0%

MAN+ESM clears the suite; LangGraph resolves nothing.

Mean execution time

T-bench - per-task wall clock

MAN+ESM

220.0 s

MAN

200.0 s

LangGraph

180.0 s

AutoGen

151.3 s

AutoGen is fastest but resolves 1 in 10. MAN+ESM is slowest by ~70 seconds and resolves 10 in 10. The trade-off favours reliability at the margin we care about.

Scope

Each suite runs 10 tasks - small N by design, since the goal was a same-model, same-prompt comparison across four frameworks rather than a full bench sweep. Larger sweeps with the full SWE-bench Lite and T-bench task universes will publish alongside the public release. The relative ordering of frameworks is consistent across both suites.

02 / Infrastructure & scaling preliminary internal measurements

~85x

faster

p50 coordination latency at 100 agents vs. LangGraph

~18x

leaner

Memory per agent vs. graph-based frameworks

~100x

larger

Maximum stable cluster size, agents under SLA

p50 coordination latency vs. cluster size

Lower is better - Y-axis log scale

MAN+ESM LangGraph AutoGen CrewAI

MAN+ESM stays in the 40-250 ms band across the full sweep; competing frameworks lose two orders of magnitude per decade of agent count and crash before reaching 500.

Metric	MAN+ESM Adya	LangGraph centralized	AutoGen isolated	CrewAI graph
Latency & Coordination
p50 latency - 10 agents	42 ms	180 ms	320 ms	410 ms
p50 latency - 100 agents	95 ms	8.2 s	14 s	OOM
p50 latency - 1,000 agents	240 ms	timeout	timeout	-
Coordination overhead	3.8%	41%	28%	36%
Memory Footprint
Memory per agent	18 MB	340 MB	220 MB	280 MB
Context bloat factor	O(1)	O(n2)	O(n)	O(n2)
Discovery & Adaptation
Capability discovery - 100 nodes	280 ms gossip	4.2 s	no native	redeploy
Add capability without restart
Resilience
Single-node failure recoverable			partial
Deterministic event replay
Scale Ceiling
Max stable cluster size	5,000+	~50	~80	~30

Preliminary numbers - methodology forthcoming

Latency, memory, discovery, and cluster-size figures above are order-of-magnitude estimates from internal load tests against framework reference implementations on equivalent hardware (same model, same prompts, same task suite). Full methodology, hardware spec, prompt corpus, and reproducible notebooks publish alongside the paper. We will update this table with the audited values when the paper drops; the relative shape of the comparison is stable across our internal runs.

Architecture

The intelligence loop, end to end.

A single request enters the bus, gets routed by a Policy LLM, validated by a Critic, executed against tools, and emits signed receipts - all without a central orchestrator holding state. Every stage is a Kafka topic; every agent is a subscriber.

Principle 1

Decentralization

Independent stateless agents, capability-announced peer discovery, no single point of failure.

Principle 2

Contracts first

Avro schema registry as canonical source. Versioned, type-safe inter-agent messages.

Principle 3

Fail-closed rails

Schema validation at every boundary. ACL enforcement. SLA & budget gates.

Principle 4

Decentralized reasoning

Policy LLM for routing, Critic LLM for validation, confidence-thresholded escalation.

Principle 5

Production-grade lifecycle

OpenTelemetry, eval harness, HITL, governance, chargeback, fault tolerance.

The coordination model

Three layers, no hierarchy.

Agents organize across three specialized layers - but the layering is a separation of concern, not an upward escalation of authority. Any agent at any layer can make any of three decisions at runtime: decompose, execute, or forward.

Layer 1

Decision Layer

Primary agents

Task decomposition, workflow orchestration, high-level coordination strategy. Stateless - every invocation operates on the current event context.

Owns

decomposition routing strategy inter-layer comms global optimization

Layer 2

Execution Layer

Specialist agents

Domain-specific work - code, data analysis, document parsing, validation. Subscribes to capability topics. Parallel by default, serial only when explicitly required.

Owns

domain processing parallel execution result synthesis validation

Layer 3

Interface Layer

Gateway agents

External system integration, format transformation between event schemas and external APIs, security perimeter. Legacy systems integrate as Gateway Agents wrapping their interfaces.

Owns

auth boundaries format translation perimeter security legacy adaptation

The runtime decision

Roles are not assigned. Agents choose at runtime.

Every agent in MAN+ESM exposes the same three primitive capabilities. The agent inspects the incoming event, consults its capability registry and the live peer mesh, and selects which capability to invoke - dynamically, every time.

decompose

break into sub-tasks

execute

handle directly

forward

route to a more capable peer

This is the structural difference from frameworks where roles are fixed at initialization (CrewAI) or bound to graph nodes (LangGraph): MAN+ESM agents adapt their behaviour to the task in front of them.

Discovery - the gossip protocol

How agents find each other in O(log n) hops.

Centralized registries put discovery latency on a single machine's critical path; the registry becomes the bottleneck before any reasoning starts. Gossip puts it on the network - and the network is faster than the registry, by design.

The competition

Centralized registry lookup

~4.2 s @ 100

Every discovery query funnels through one machine

Registry crash = entire fleet loses peer awareness

Latency grows linearly with cluster size

Our approach

Gossip-based propagation

280 ms @ 100

Capability announcements propagate peer-to-peer in log time

No single failure point - kill any node, the mesh reroutes

TTL-based expiry - stale capabilities self-evict

The math

Why O(log n) wins at scale

A gossip wave reaches the entire cluster in roughly log_k(n) rounds, where each round is bounded by network round-trip time, not by a single coordinator's compute. At 100 agents with a fanout of 4, that's ~3.3 hops. At 1,000 agents, ~5 hops. At 10,000, ~6.6 hops. The registry-based alternative scales linearly with cluster size and with query volume, which compounds.

The capability palette

Nine specialized agent roles, collaborating autonomously.

Each role is a stateless service subscribing to its capability topic. Roles compose via the bus; complex workflows are expressed by which roles publish and which subscribe - not by hand-wired graphs.

Perceptor

Ingestion & parsing

Natural-language understanding
Multimodal content extraction
Data structure discovery

Decomposer

Task planning

Complex task breakdown
Parallel sub-task generation
Self-consistency (K-sampling)

Proposer

Content generation

Code generation & content creation
Solution proposal
Memory-enhanced generation

Critic

Validation & quality

Output validation & assessment
Needs-fix loops with feedback
Confidence scoring

Safety

Governance & compliance

PII detection & redaction
Tool allowlist enforcement
Regional routing compliance

Executor

Tool execution

Secure code execution (sandboxed)
API calls with signed receipts
Side-effect verification

Scheduler

Resource management

Budget & quota enforcement
Priority calculation
Rate limiting & fairness

Observer

Telemetry & monitoring

Timeline reconstruction
Anomaly detection
Performance & cost tracking

Ledger

Audit & provenance

Immutable append-only log
Lineage queries
Receipt verification

Reasoning roles Governance roles Execution roles Operations roles

Micro-protocols

Six interaction patterns, composable everywhere.

Workflows do not need bespoke choreography. They compose from a small, well-defined set of standardized interaction patterns - each with formal guarantees, replay semantics, and consensus rules.

iterative

Propose-Critique

Iterative refinement loop. Proposer drafts; Critic evaluates against quality criteria; needs-fix feedback loops back. Quality gates enforced at each iteration.

Use - code & content generation

parallel

Decompose-Merge

Parallel sub-task execution with dependency-aware aggregation. Tasks fan out to capability topics; results merge back through deterministic reducers.

Use - multi-step reasoning

bidding

Contract-Net

Decentralized bidding for dynamic resource allocation. Available agents bid; selection by best score, lowest cost, or fastest expected completion.

Use - model selection & routing

k-of-n

Consensus

K-of-N validator approvals for high-risk decisions. Signed votes, immutable audit trails, deterministic tie-breaking. Configurable by risk tier.

Use - regulated decisions

hitl

Escalate-Approve

Human-in-the-loop integration for governance gates. Confidence below threshold or risk above ceiling routes to a human approver before progression.

Use - compliance gates

proof

Verification Game

Proof-challenge mechanism for integrity checks. Independent agents replay computations and challenge results; disagreements trigger arbitration.

Use - adversarial validation

Event-Sourced Memory

Memory as an immutable log, not a mutable state.

The "memory" half of MAN+ESM is not a vector store with overwrites. It's an append-only event log: every agent action, observation, and decision is a signed event. State is derived by replaying events through projections - which means state is reproducible, debuggable, and auditable, by construction.

Two memory tiers cooperate: episodic for long-term experience and learned preferences; scratchpad for task-local working context. Both are projections over the same event log.

events.append-log - agent_42

Append-only

t+0.000s PERCEIVED user_query: "analyze Q3 sales by region" #a14f

t+0.012s DECOMPOSED 3 sub-tasks - parallel #b22c

t+0.018s ROUTED policy -> executor.csv-query #c98e

t+0.041s VALIDATED critic - confidence 0.94 - pass #d44a

t+0.062s EXECUTED duckdb.query - 1.2k rows - receipt signed #e71b

t+0.084s LEDGERED sealed - sha256 #b9f2...0a4e - 6 signers #f0e3

Episodic projection

Long-term experience - preference store - vector-indexed for semantic recall

Scratchpad projection

Task-local working context - TTL-bounded - evicted on completion

Deterministic replay

Bug? Replay the event log up to the failure. Every state, every decision, exactly reproducible.

Audit by construction

The log is the audit trail. Compliance teams query it directly; nothing is reconstructed after the fact.

Tamper-evident

Hash-chained events; each entry signed. Modifying history requires forging a chain - detectable on the next read.

Bounded memory

Projections snapshot. Old events compact. Memory growth stays linear in active work, not lifetime work.

Coming Q3 2026 - early access waitlist

A LangGraph-grade developer experience, on a different substrate.

Familiar Python ergonomics - declarative agents, declarative workflows. Same SDK shape researchers already know; the decentralized event-driven runtime underneath is the only thing that's actually new.

my_agent.py

Agent definition

from agent_sdk import Agent, Event, EventType

class CSVPerceptor(Agent):
    capability = "csv.perceive"

    async def handle_event(self, event: Event) -> list[Event]:
        # parse user query -> structured intent
        intent = await self.parse(event.payload)

        # emit downstream - routing handled by the bus
        return [Event(
            event_type=EventType.PERCEIVED,
            task_id=event.task_id,
            payload={"intent": intent},
        )]

workflow.py

Workflow DSL

from agent_sdk import Workflow

# declarative - no graph wiring, just steps
workflow = Workflow(
    name="csv_analysis",
    steps=[
        "perceptor", "decomposer",
        "proposer", "safety",
        "executor", "critic",
    ],
)

# single call - routing & fan-out automatic
result = await workflow.execute(
    task_id="task-123",
    payload={"query": "Analyze Q3 sales by region"},
)

In-memory test bus

Mock agents and a synchronous event bus for unit and integration tests - no Kafka required for local dev.

Zero-config topics

Topic creation, ACL setup, and discovery announcement happen automatically when an agent boots.

CLI for everything

Inspect topics, replay events, scaffold agents, run evals - all from man-cli.

The SDK, ESLL evaluation lab, and CLI are scheduled for public release in Q3 2026. Researchers and partners can request early access via the notify form.

Roadmap

Ten phases. Three shipped.

What's already in production, what's in flight, and what comes next - laid out honestly, with quarterly granularity.

Shipped - operational foundation

Phase 1

Core substrate

- Kafka-based event bus
- Discovery & routing
- Dead-letter queue (DLQ)
- Basic observability

Phase 2

LLM-first reasoning

- Policy & Critic LLM
- Self-consistency K-sampling
- Confidence thresholds
- Episodic + scratchpad memory

Phase 3

Capability expansion

- 9 core agent roles
- 6 micro-protocols
- Receipt generation & verify
- Safety rails & governance

In flight - Q1-Q2 2026

Phase 4 - Q1

Governance & trust

- mTLS everywhere
- Signed envelopes
- OPA / ABAC policy
- PII redaction

Phase 5 - Q1

Observability & HITL

- OpenTelemetry traces
- Eval harness (ESLL)
- HITL console
- Metrics dashboards

Phase 6 - Q1

Federation & scale

- Multi-tenant isolation
- Quotas & fairness
- Cross-cluster federation
- Chargeback & billing

Phase 7 - Q1-Q2

Event-Sourced Memory

- ESM substrate
- State reconstruction
- Temporal indexes
- Deterministic replay

Future - Q2-Q3 2026

Phase 8 - Q2

Learning & adaptation

- Preference store & reward heads
- Prompt tuning & calibration
- Cascade optimizer
- Canary deployments

Phase 9 - Q2-Q3

Scale & resilience

- Exactly-once semantics
- Checkpointing & recovery
- Chaos engineering & DLQ replay
- Disaster-recovery drills

Phase 10 - Q3

System evolution

- Schema compatibility checker
- Protocol negotiation
- Dual-write migrations
- Zero-downtime deploys

Resources & citations

The work this builds on top of.

MAN+ESM stands on a long tradition: distributed systems literature, multi-agent protocols, event-sourcing patterns. The novelty is the synthesis, not the parts.

Preprint - available PDF - 9 pages

Multi-Agent Network and Event-Sourced Memory: A Decentralized Architecture for Production-Scale Agent Coordination

Full architectural treatment - three-layer model, gossip protocol, Flow ID observability, ESM, micro-protocols, reliability mathematics, comparative analysis. The benchmark suite and reproducible notebooks publish alongside the public release.

Read the preprint

GitHub

Private - Q2 2026

adya-ai / man-esm

SDK, ESLL eval lab, CLI, reference agents, and benchmark notebooks. Public repo opens alongside the paper release.

Benchmark suite

Reproducible comparisons

Hardware specs, prompt corpus, framework reference implementations, latency & memory measurement scripts. Same task suite for every framework.

Selected prior work this builds on

[1]Smith & Davis. The Contract Net Protocol: High-Level Communication and Control in a Distributed Problem Solver. IEEE Trans. Computers, 1980.

[2]Demers et al. Epidemic Algorithms for Replicated Database Maintenance. PODC, 1987.

[3]Fowler. Event Sourcing. martinfowler.com, 2005.

[4]Kreps. The Log: What every software engineer should know about real-time data's unifying abstraction. LinkedIn Engineering, 2013.

[5]Wang et al. AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation. arXiv:2308.08155, 2023.

[6]Chase et al. LangGraph: Stateful, Multi-Actor Applications with LLMs. LangChain Inc., 2024.

Push the field forward

Read the work. Run the benchmarks. Tear it apart.

We are publishing the architecture, the methodology, and the measurements precisely because we want them stress-tested. If you find a flaw in the design or a fragility in the numbers, that is the most useful conversation we can have right now.

Read the preprint

Adaptive Governance Protocols

Multi-agent reasoning, without the orchestrator.

Three pressures, no winner.

Scalability bottlenecks

Hallucination cascades

Coordination latency

Why the field's three dominant paradigms each fail

Why distributed redundancy compounds rather than averages.

The numbers, side by side.

Where the memory layer earns its name.

The intelligence loop, end to end.

Three layers, no hierarchy.

How agents find each other in O(log n) hops.

Nine specialized agent roles, collaborating autonomously.

Six interaction patterns, composable everywhere.

Memory as an immutable log, not a mutable state.

A LangGraph-grade developer experience, on a different substrate.

Ten phases. Three shipped.

The work this builds on top of.

Read the work. Run the benchmarks. Tear it apart.

Get notified when MAN+ESM publishes