Research - Multi-Agent Network + Event-Sourced Memory

Multi-agent reasoning, without the orchestrator.

MAN+ESM is a decentralized event-driven platform that breaks the scalability-reliability-coordination triad limiting today's agentic frameworks. No central planner. No O(n2) message paths. Linear horizontal scaling to thousands of agents, with deterministic event replay and bounded memory by construction.

Phase 1-3 shipped Phase 4-7 in flight (Q1-Q2 2026) - 9 agent roles - 6 micro-protocols - Kafka backbone
GOSSIP-BASED PROPAGATION - O(log n) HOPS
Event-driven - no orchestrator Gossip-based discovery Linear scaling 1k+ agents Sub-second coordination Deterministic replay Bounded memory Schema-validated boundaries
The problem

Three pressures, no winner.

Every multi-agent architecture in production today optimizes for one or two of scalability, reliability, and coordination - and pays for it on the third. There is no current paradigm that holds all three at once. This is not a tooling gap; it's an architectural gap.

The numbers are conservative; they come from production teams attempting to deploy LangGraph, AutoGen, and CrewAI at enterprise scale.

SCALE AT THE COST OF RELIABILITY SPEED AT THE COST OF SCALE RELIABILITY AT THE COST OF SPEED No paradigm sits here SCALABILITY information noise RELIABILITY hallucination cascades COORDINATION consensus paradox
Vertex 1 - Information noise

Scalability bottlenecks

Centralized orchestrators carry an O(n2) message burden. Context windows fill with irrelevant chatter. Compute scales with noise, not signal - and at 100 agents a single coordinator handles 10,000 message paths, pushing latency from seconds to minutes.

Vertex 2 - Reliability crisis

Hallucination cascades

Without validation gates between stages, a single misclassified output multiplies downstream. Errors propagate before detection; trust degrades as cascades stack. Recovery becomes manual and post-hoc - and in regulated workflows, sometimes irreversible.

Vertex 3 - Consensus paradox

Coordination latency

Synchronous consensus demands serial execution. Distributed agents wait on each other; network round-trips compound. Even ten agents can take 30+ seconds to converge - too slow for any sub-second use case, and getting worse with cluster size, not better.

The architectural question follows: can a single design satisfy all three pressures simultaneously, without trading one off against the others?

The architectural gap

Why the field's three dominant paradigms each fail

We are not arguing the existing frameworks are bad - they each get something right. We are arguing that none of them is positioned to deliver the triad simultaneously, and the reasons are structural, not implementation details.

LangGraph
Centralized orchestration
Strengths
  • Simple coordination model
  • Predictable execution flow
  • Easy debugging on small graphs
Weaknesses
  • Single point of failure (orchestrator)
  • O(n2) message paths through coordinator
  • No horizontal scaling beyond one runtime
Why it fails

Latency grows roughly linearly with agent count past ~50; the orchestrator becomes the bottleneck before reasoning ever starts.

AutoGen
Isolated agents
Strengths
  • True horizontal scaling at the agent level
  • High individual fault tolerance
  • Embarrassingly parallel for independent tasks
Weaknesses
  • No shared goals or canonical context
  • Inconsistent memory across instances
  • Duplicated work, no coordination protocol
Why it fails

Complex workflows that require shared context and validated handoffs collapse - each agent operates in its own bubble, with no consensus surface.

CrewAI
Disconnected graphs
Strengths
  • Defined workflows, role-based clarity
  • Some parallelism within a crew
  • Distributed processing for known graphs
Weaknesses
  • Static, rigid graph structure
  • No real-time consensus mechanism
  • Breaks on a single agent failure
Why it fails

Cannot adapt to changing requirements or unpredictable agent behaviour - the graph is fixed at design time, the world is not.

The hypothesis

Event-driven messaging, LLM-first routing, gossip-based peer discovery, and a small set of well-defined micro-protocols can satisfy all three pressures simultaneously - without trading any against the others.

This is the bet. The rest of this page is the evidence.
The reliability math

Why distributed redundancy compounds rather than averages.

The reliability advantage of decentralization is not anecdotal - it is mathematically demonstrable. A centralized system is bounded by its coordinator's reliability. A distributed system with N redundant paths converts moderately reliable components into highly reliable systems, by construction.

A 90%-reliable component, replicated three ways, yields a 99.9%-reliable system - without requiring perfect parts. Critical for LLM-based components, where output reliability is fundamentally probabilistic.

Centralized - capped
System Reliability
  = Coordinator Reliability (p)
at p = 0.90
90%
Distributed - compounds
System Reliability
  = 1 - (1 - p)N
at p = 0.90, N = 3
99.9%

A symmetric argument applies to memory: centralized state grows super-linearly with cluster complexity; distributed state grows linearly with agent count.

Benchmarks

The numbers, side by side.

Two evaluations against the three dominant frameworks. First, task-level resolution rate on SWE-bench and T-bench. Second, internal infrastructure measurements - latency, memory, discovery - as the substrate scales from 10 to 1,000 agents.

01 / Task-level evaluation

Where the memory layer earns its name.

Two task suites, ten tasks each. SWE-bench (Astropy & Django bug fixes) and T-bench (Fix Git, COBOL Modernization, plus medium-complexity tasks). Same model, same prompts, same problems. We measure task resolution rate.

The ESM uplift - T-bench
40%
100%
MAN alone resolves 4 of 10 tasks. With ESM as the memory substrate, MAN+ESM resolves all 10.

Event-Sourced Memory isn't an optimization - it's a capability prerequisite. Without persistent, replayable working state, multi-step T-bench tasks fall over at the boundary between agent invocations. With ESM, working context survives those boundaries and tasks complete deterministically. The same architectural property that makes the system auditable makes it capable.

SWE-bench
Astropy - Django bug fixes
10 tasks
Framework Resolved Rate
MAN+ESM Adya 5 / 10 50%
MAN Adya 5 / 10 50%
AutoGen 2 / 10 20%
LangGraph 1 / 10 10%
MAN frameworks resolve half the suite; the closest competitor lands one in five.
T-bench
Fix Git - COBOL - mixed complexity
10 tasks
Framework Resolved Rate
MAN+ESM Adya 10 / 10 100%
MAN Adya 4 / 10 40%
AutoGen 1 / 10 10%
LangGraph 0 / 10 0%
MAN+ESM clears the suite; LangGraph resolves nothing.
Mean execution time
T-bench - per-task wall clock
MAN+ESM
220.0 s
MAN
200.0 s
LangGraph
180.0 s
AutoGen
151.3 s

AutoGen is fastest but resolves 1 in 10. MAN+ESM is slowest by ~70 seconds and resolves 10 in 10. The trade-off favours reliability at the margin we care about.

Scope

Each suite runs 10 tasks - small N by design, since the goal was a same-model, same-prompt comparison across four frameworks rather than a full bench sweep. Larger sweeps with the full SWE-bench Lite and T-bench task universes will publish alongside the public release. The relative ordering of frameworks is consistent across both suites.

02 / Infrastructure & scaling preliminary internal measurements
~85x
faster
p50 coordination latency at 100 agents vs. LangGraph
~18x
leaner
Memory per agent vs. graph-based frameworks
~100x
larger
Maximum stable cluster size, agents under SLA
p50 coordination latency vs. cluster size
Lower is better - Y-axis log scale
MAN+ESM LangGraph AutoGen CrewAI
10 ms 100 ms 1 s 10 s 100 s 10 50 100 500 1,000 AGENT COUNT 240ms TIMEOUT TIMEOUT OOM P50 LATENCY

MAN+ESM stays in the 40-250 ms band across the full sweep; competing frameworks lose two orders of magnitude per decade of agent count and crash before reaching 500.

Metric
MAN+ESM
Adya
LangGraph
centralized
AutoGen
isolated
CrewAI
graph
Latency & Coordination
p50 latency - 10 agents 42 ms 180 ms 320 ms 410 ms
p50 latency - 100 agents 95 ms 8.2 s 14 s OOM
p50 latency - 1,000 agents 240 ms timeout timeout -
Coordination overhead 3.8% 41% 28% 36%
Memory Footprint
Memory per agent 18 MB 340 MB 220 MB 280 MB
Context bloat factor O(1) O(n2) O(n) O(n2)
Discovery & Adaptation
Capability discovery - 100 nodes 280 ms gossip 4.2 s no native redeploy
Add capability without restart
Resilience
Single-node failure recoverable partial
Deterministic event replay
Scale Ceiling
Max stable cluster size 5,000+ ~50 ~80 ~30
Preliminary numbers - methodology forthcoming

Latency, memory, discovery, and cluster-size figures above are order-of-magnitude estimates from internal load tests against framework reference implementations on equivalent hardware (same model, same prompts, same task suite). Full methodology, hardware spec, prompt corpus, and reproducible notebooks publish alongside the paper. We will update this table with the audited values when the paper drops; the relative shape of the comparison is stable across our internal runs.

Architecture

The intelligence loop, end to end.

A single request enters the bus, gets routed by a Policy LLM, validated by a Critic, executed against tools, and emits signed receipts - all without a central orchestrator holding state. Every stage is a Kafka topic; every agent is a subscriber.

INTELLIGENCE LOOP Producer CLIENT APP Requests KAFKA TOPIC Policy ROUTING LLM Critic VALIDATION LLM Executor TOOL USE needs-fix - iterate Receipts OUTPUT TOPIC Consumer FINAL RESULT PUB SUB EMIT SUB Event Bus (Kafka) Intelligent Agents Receipts & Artifacts
Principle 1
Decentralization
Independent stateless agents, capability-announced peer discovery, no single point of failure.
Principle 2
Contracts first
Avro schema registry as canonical source. Versioned, type-safe inter-agent messages.
Principle 3
Fail-closed rails
Schema validation at every boundary. ACL enforcement. SLA & budget gates.
Principle 4
Decentralized reasoning
Policy LLM for routing, Critic LLM for validation, confidence-thresholded escalation.
Principle 5
Production-grade lifecycle
OpenTelemetry, eval harness, HITL, governance, chargeback, fault tolerance.
The coordination model

Three layers, no hierarchy.

Agents organize across three specialized layers - but the layering is a separation of concern, not an upward escalation of authority. Any agent at any layer can make any of three decisions at runtime: decompose, execute, or forward.

Layer 1
Decision Layer
Primary agents

Task decomposition, workflow orchestration, high-level coordination strategy. Stateless - every invocation operates on the current event context.

Owns
decomposition routing strategy inter-layer comms global optimization
Layer 2
Execution Layer
Specialist agents

Domain-specific work - code, data analysis, document parsing, validation. Subscribes to capability topics. Parallel by default, serial only when explicitly required.

Owns
domain processing parallel execution result synthesis validation
Layer 3
Interface Layer
Gateway agents

External system integration, format transformation between event schemas and external APIs, security perimeter. Legacy systems integrate as Gateway Agents wrapping their interfaces.

Owns
auth boundaries format translation perimeter security legacy adaptation
The runtime decision
Roles are not assigned. Agents choose at runtime.

Every agent in MAN+ESM exposes the same three primitive capabilities. The agent inspects the incoming event, consults its capability registry and the live peer mesh, and selects which capability to invoke - dynamically, every time.

decompose
break into sub-tasks
execute
handle directly
forward
route to a more capable peer

This is the structural difference from frameworks where roles are fixed at initialization (CrewAI) or bound to graph nodes (LangGraph): MAN+ESM agents adapt their behaviour to the task in front of them.

Discovery - the gossip protocol

How agents find each other in O(log n) hops.

Centralized registries put discovery latency on a single machine's critical path; the registry becomes the bottleneck before any reasoning starts. Gossip puts it on the network - and the network is faster than the registry, by design.

The competition
Centralized registry lookup
~4.2 s @ 100
REGISTRY SINGLE BOTTLENECK - O(n)
Every discovery query funnels through one machine
Registry crash = entire fleet loses peer awareness
Latency grows linearly with cluster size
Our approach
Gossip-based propagation
280 ms @ 100
PEER-TO-PEER - O(log n)
Capability announcements propagate peer-to-peer in log time
No single failure point - kill any node, the mesh reroutes
TTL-based expiry - stale capabilities self-evict
The math
Why O(log n) wins at scale

A gossip wave reaches the entire cluster in roughly logk(n) rounds, where each round is bounded by network round-trip time, not by a single coordinator's compute. At 100 agents with a fanout of 4, that's ~3.3 hops. At 1,000 agents, ~5 hops. At 10,000, ~6.6 hops. The registry-based alternative scales linearly with cluster size and with query volume, which compounds.

The capability palette

Nine specialized agent roles, collaborating autonomously.

Each role is a stateless service subscribing to its capability topic. Roles compose via the bus; complex workflows are expressed by which roles publish and which subscribe - not by hand-wired graphs.

Perceptor
Ingestion & parsing
  • Natural-language understanding
  • Multimodal content extraction
  • Data structure discovery
Decomposer
Task planning
  • Complex task breakdown
  • Parallel sub-task generation
  • Self-consistency (K-sampling)
Proposer
Content generation
  • Code generation & content creation
  • Solution proposal
  • Memory-enhanced generation
Critic
Validation & quality
  • Output validation & assessment
  • Needs-fix loops with feedback
  • Confidence scoring
Safety
Governance & compliance
  • PII detection & redaction
  • Tool allowlist enforcement
  • Regional routing compliance
Executor
Tool execution
  • Secure code execution (sandboxed)
  • API calls with signed receipts
  • Side-effect verification
Scheduler
Resource management
  • Budget & quota enforcement
  • Priority calculation
  • Rate limiting & fairness
Observer
Telemetry & monitoring
  • Timeline reconstruction
  • Anomaly detection
  • Performance & cost tracking
Ledger
Audit & provenance
  • Immutable append-only log
  • Lineage queries
  • Receipt verification
Reasoning roles Governance roles Execution roles Operations roles
Micro-protocols

Six interaction patterns, composable everywhere.

Workflows do not need bespoke choreography. They compose from a small, well-defined set of standardized interaction patterns - each with formal guarantees, replay semantics, and consensus rules.

iterative
Propose-Critique

Iterative refinement loop. Proposer drafts; Critic evaluates against quality criteria; needs-fix feedback loops back. Quality gates enforced at each iteration.

Use - code & content generation
parallel
Decompose-Merge

Parallel sub-task execution with dependency-aware aggregation. Tasks fan out to capability topics; results merge back through deterministic reducers.

Use - multi-step reasoning
bidding
Contract-Net

Decentralized bidding for dynamic resource allocation. Available agents bid; selection by best score, lowest cost, or fastest expected completion.

Use - model selection & routing
k-of-n
Consensus

K-of-N validator approvals for high-risk decisions. Signed votes, immutable audit trails, deterministic tie-breaking. Configurable by risk tier.

Use - regulated decisions
hitl
Escalate-Approve

Human-in-the-loop integration for governance gates. Confidence below threshold or risk above ceiling routes to a human approver before progression.

Use - compliance gates
proof
Verification Game

Proof-challenge mechanism for integrity checks. Independent agents replay computations and challenge results; disagreements trigger arbitration.

Use - adversarial validation
Event-Sourced Memory

Memory as an immutable log, not a mutable state.

The "memory" half of MAN+ESM is not a vector store with overwrites. It's an append-only event log: every agent action, observation, and decision is a signed event. State is derived by replaying events through projections - which means state is reproducible, debuggable, and auditable, by construction.

Two memory tiers cooperate: episodic for long-term experience and learned preferences; scratchpad for task-local working context. Both are projections over the same event log.

events.append-log - agent_42
Append-only
t+0.000s PERCEIVED user_query: "analyze Q3 sales by region" #a14f
t+0.012s DECOMPOSED 3 sub-tasks - parallel #b22c
t+0.018s ROUTED policy -> executor.csv-query #c98e
t+0.041s VALIDATED critic - confidence 0.94 - pass #d44a
t+0.062s EXECUTED duckdb.query - 1.2k rows - receipt signed #e71b
t+0.084s LEDGERED sealed - sha256 #b9f2...0a4e - 6 signers #f0e3
Episodic projection
Long-term experience - preference store - vector-indexed for semantic recall
Scratchpad projection
Task-local working context - TTL-bounded - evicted on completion
Deterministic replay

Bug? Replay the event log up to the failure. Every state, every decision, exactly reproducible.

Audit by construction

The log is the audit trail. Compliance teams query it directly; nothing is reconstructed after the fact.

Tamper-evident

Hash-chained events; each entry signed. Modifying history requires forging a chain - detectable on the next read.

Bounded memory

Projections snapshot. Old events compact. Memory growth stays linear in active work, not lifetime work.

Coming Q3 2026 - early access waitlist

A LangGraph-grade developer experience, on a different substrate.

Familiar Python ergonomics - declarative agents, declarative workflows. Same SDK shape researchers already know; the decentralized event-driven runtime underneath is the only thing that's actually new.

my_agent.py
Agent definition
from agent_sdk import Agent, Event, EventType

class CSVPerceptor(Agent):
    capability = "csv.perceive"

    async def handle_event(self, event: Event) -> list[Event]:
        # parse user query -> structured intent
        intent = await self.parse(event.payload)

        # emit downstream - routing handled by the bus
        return [Event(
            event_type=EventType.PERCEIVED,
            task_id=event.task_id,
            payload={"intent": intent},
        )]
workflow.py
Workflow DSL
from agent_sdk import Workflow

# declarative - no graph wiring, just steps
workflow = Workflow(
    name="csv_analysis",
    steps=[
        "perceptor", "decomposer",
        "proposer", "safety",
        "executor", "critic",
    ],
)

# single call - routing & fan-out automatic
result = await workflow.execute(
    task_id="task-123",
    payload={"query": "Analyze Q3 sales by region"},
)
In-memory test bus

Mock agents and a synchronous event bus for unit and integration tests - no Kafka required for local dev.

Zero-config topics

Topic creation, ACL setup, and discovery announcement happen automatically when an agent boots.

CLI for everything

Inspect topics, replay events, scaffold agents, run evals - all from man-cli.

The SDK, ESLL evaluation lab, and CLI are scheduled for public release in Q3 2026. Researchers and partners can request early access via the notify form.

Roadmap

Ten phases. Three shipped.

What's already in production, what's in flight, and what comes next - laid out honestly, with quarterly granularity.

Shipped - operational foundation
Phase 1
Core substrate
  • - Kafka-based event bus
  • - Discovery & routing
  • - Dead-letter queue (DLQ)
  • - Basic observability
Phase 2
LLM-first reasoning
  • - Policy & Critic LLM
  • - Self-consistency K-sampling
  • - Confidence thresholds
  • - Episodic + scratchpad memory
Phase 3
Capability expansion
  • - 9 core agent roles
  • - 6 micro-protocols
  • - Receipt generation & verify
  • - Safety rails & governance
In flight - Q1-Q2 2026
Phase 4 - Q1
Governance & trust
  • - mTLS everywhere
  • - Signed envelopes
  • - OPA / ABAC policy
  • - PII redaction
Phase 5 - Q1
Observability & HITL
  • - OpenTelemetry traces
  • - Eval harness (ESLL)
  • - HITL console
  • - Metrics dashboards
Phase 6 - Q1
Federation & scale
  • - Multi-tenant isolation
  • - Quotas & fairness
  • - Cross-cluster federation
  • - Chargeback & billing
Phase 7 - Q1-Q2
Event-Sourced Memory
  • - ESM substrate
  • - State reconstruction
  • - Temporal indexes
  • - Deterministic replay
Future - Q2-Q3 2026
Phase 8 - Q2
Learning & adaptation
  • - Preference store & reward heads
  • - Prompt tuning & calibration
  • - Cascade optimizer
  • - Canary deployments
Phase 9 - Q2-Q3
Scale & resilience
  • - Exactly-once semantics
  • - Checkpointing & recovery
  • - Chaos engineering & DLQ replay
  • - Disaster-recovery drills
Phase 10 - Q3
System evolution
  • - Schema compatibility checker
  • - Protocol negotiation
  • - Dual-write migrations
  • - Zero-downtime deploys
Resources & citations

The work this builds on top of.

MAN+ESM stands on a long tradition: distributed systems literature, multi-agent protocols, event-sourcing patterns. The novelty is the synthesis, not the parts.

Preprint - available PDF - 9 pages
Multi-Agent Network and Event-Sourced Memory: A Decentralized Architecture for Production-Scale Agent Coordination

Full architectural treatment - three-layer model, gossip protocol, Flow ID observability, ESM, micro-protocols, reliability mathematics, comparative analysis. The benchmark suite and reproducible notebooks publish alongside the public release.

Read the preprint
GitHub
Private - Q2 2026
adya-ai / man-esm

SDK, ESLL eval lab, CLI, reference agents, and benchmark notebooks. Public repo opens alongside the paper release.

Benchmark suite
Reproducible comparisons

Hardware specs, prompt corpus, framework reference implementations, latency & memory measurement scripts. Same task suite for every framework.

Selected prior work this builds on
[1]Smith & Davis. The Contract Net Protocol: High-Level Communication and Control in a Distributed Problem Solver. IEEE Trans. Computers, 1980.
[2]Demers et al. Epidemic Algorithms for Replicated Database Maintenance. PODC, 1987.
[3]Fowler. Event Sourcing. martinfowler.com, 2005.
[4]Kreps. The Log: What every software engineer should know about real-time data's unifying abstraction. LinkedIn Engineering, 2013.
[5]Wang et al. AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation. arXiv:2308.08155, 2023.
[6]Chase et al. LangGraph: Stateful, Multi-Actor Applications with LLMs. LangChain Inc., 2024.
Push the field forward

Read the work. Run the benchmarks. Tear it apart.

We are publishing the architecture, the methodology, and the measurements precisely because we want them stress-tested. If you find a flaw in the design or a fragility in the numbers, that is the most useful conversation we can have right now.

Read the preprint