Running alpaca logofastpaca
BlogAboutGitHub ↗
← Back to blogNovember 21, 2025

Universal LLM Memory Does Not Exist

I benchmarked Mem0 and Zep on MemBench to understand why production agents were failing. Memory systems cost 14-77x more and were 31-33% less accurate than naive long-context.

Over the last few weeks, I've been digging deep into LLM memory systems. My last post described the techniques they use.

Whenever I mentioned tools like Mem0 to engineers running agents in production, I got the same reaction: a collective sigh.

"It's heavy." "The latency kills us." "Great in theory."

I wanted to understand why.

Systems like Mem0 & Zep are sold on the promise of reducing cost by 90%, and cutting latency drastically. Why is there such a disconnect with the reality of these systems vs. how they're sold?

I took two of the most hyped systems: Zep (Knowledge Graph) & Mem0 ("Universal Memory") and ran them against MemBench, a 2025 benchmark designed to test reflective memory and reasoning.

I expected to find some trade-offs. What I found instead was a huge "WTF" moment.

The Experiment

I didn't want to run a standard "retrieval" benchmark. Marketing pages are full of those. I wanted to see the total cost of operation and latency impact.

I set up a harness running gpt-5-nano against 4,000 conversational cases from MemBench. The goal was simple: have a conversation, extract facts, and recall them later. It's the same type of benchmark and workload these systems were built to manage, and should be great at.

Here is what happened:

Memory SystemPrecisionAvg. Input TokensAvg. LatencyTotal Cost (4k cases)
long-context (baseline)84.6%4,2327.8s$1.98
Mem0 (vector)49.3%7,319154.5s$24.88
Zep (graphiti)51.6%1.17M*224s~$152.6

*Partial run (1,730/4,000 cases). Averaged 1,028 LLM calls per case, 2B input tokens total. Run aborted after 9 hours due to cost (usage alerts triggered in openai).

You are reading that correctly. Zep burned 1.17 million tokens per test case.

At first, I thought my harness was broken. How can a simple conversation generate a million tokens of traffic? I dug into the logs, and what I found wasn't a bug. It was the architecture working exactly as designed.

You can run this experiment yourself using my open sourced tool I built to run benchmarks: agentbench, check examples/membench_qa_test for details on how this was run.

Architecture: "LLM-on-Write"

To understand why the latency & cost exploded, we have to look at how these systems work. They don't just "save" data. They employ a pattern I call LLM-on-Write.

They intercept every message and spin up background LLM processes to "extract" meaning.

1. Mem0: N+1 summarization & extraction

Mem0 is based on two main components: facts and summaries. It also supports graphs but we'll focus on vectors for now, as the graph functionality is disabled by default.

It runs three parallel background LLM processes on every interaction:

  1. Update a conversation timeline to keep a narrative summary (with an LLM)
  2. Identify facts and save them to a vector store (with an LLM)
  3. Check for contradictions and update old or remove old facts (with an LLM)
Diagram showing the N+1 pattern where one user message triggers multiple background LLM calls

For every message your agent sends, Mem0 spins up three separate inference jobs.

2. Zep: Recursive Explosions of LLM calls

Zep's Graphiti is a knowledge graph. When you say "I work on Project X " it doesn't just save the string. It wakes up an extractor LLM to:

  1. Identify the entity "User".
  2. Identify the entity "Project X".
  3. Create an edge "Works On".

But it doesn't stop there. It then traverses the graph to see if this new fact contradicts old facts (e.g., "User is already working on Project Y"). If it finds a conflict or a connection, it triggers another LLM call to resolve it.

Diagram showing the recursive LLM explosion in Zep's Graphiti

In my experiment, a single complex reasoning chain triggered a cascade of graph updates. The agent said one thing, which updated a node, which triggered a neighbor update, which triggered a re-summarization of the edge.

The Common Flaw: Fact Extraction

Despite their differences (Graph vs. Vector), both systems share a fatal architectural flaw when applied to agent execution: Fact Extraction. They both rely on LLMs to "interpret" raw data into "facts".

Diagram showing the hallucination funnel where raw data is compressed into fuzzy facts

This works for personalization ("User likes blue" is a safe extraction). If you need a CRM on top of your chat assistant, it's awesome. If you need to reduce cost and latency of an autonomous agent, it's not.

Compounding hallucinations by LLMs

The extractor LLM is non-deterministic. It might rewrite "I was ill last year" as current_status: ill. The error happens at write time, meaning the data is corrupted before it even hits the database. No amount of retrieval optimization can fix a database filled with hallucinations.

Your primary LLM is at the mercy of the accuracy of the extractor LLM.

N+1 Latency & Cost Tax

Latency and cost compound as you add more LLM calls to your pipeline. For every message, you are triggering a chain of background inferences—extraction, summarization, graph updates.

We are slapping LLMs on top of LLMs, introducing noise and latency at every layer, and paying a premium for the privilege.

Marketing hides the real costs

So why is everyone buying this?

Because the marketing focuses on the wrong unit of measurement. Memory vendors advertise "Cost per Retrieval" They show you how cheap it is to fetch a small context window instead of reading the whole history.

But as an engineer or founder, you pay "Cost per Conversation".

You pay for the N+1 extraction tax. You pay for the recursive graph updates. You pay for the debugging time when the system throws away your error logs.

To be fair: Zep is more honest than most. They claim temporal knowledge graphs for personalization and business context, and that's exactly what they build. The problem is that even for their stated use case (semantic memory), the cost is prohibitive at production scale. And despite this relative clarity, teams still use it for working memory tasks (agent execution state) it was never designed to handle.

The marketing hype feeds into itself. We want "Universal Memory" to be real because it sounds amazing. We want an infinite context window that costs nothing. But the physics of the architecture don't support it.

Working Memory & Semantic Memory

Diagram comparing working memory and semantic memory requirements

The conclusion from my experiment is clear: Universal Memory does not exist.

We are trying to solve two fundamentally different problems with one tool.

  • Semantic Memory, which is for the User. It tracks preferences, long-term history, and rapport. It should be fuzzy, extracted, and graph-based.
  • Working Memory, which is for the Agent. It tracks file paths, variable names, and immediate error logs. It must be lossless, temporal, and exact.

When you use a Semantic Memory tool (Zep, Mem0) for Working Memory tasks, you aren't just making a tradeoff. You are using the wrong architecture. You are trying to run a database on a lossy compression algorithm.

State is the application. You cannot compress your way to reliability.

Semantic memory is brilliant for personalization across sessions. It's catastrophic for execution state within a task. Treat them as separate systems with separate requirements.

BlogAboutGitHub

© 2025 Fastpaca · Research · Open Source · Advisory