Running alpaca logofastpaca
BlogAboutGitHub ↗
← Back to blogNovember 7, 2025

LLM Memory Systems Explained

An introductory guide to how LLMs handle 'memory', from context windows to retrieval systems and everything in between.

LLMs don't have memory. They're stateless: each response requires resending the entire conversation history.

Yet they reference earlier messages and maintain context across long interactions. How?

LLMs do not remember anything

Illustration of an LLM processing the entire prior chat transcript to answer

LLMs are stateless. Each inference is independent. Generating a single output token requires processing all preceding tokens as input.

To answer coherently the LLM needs every previous message. This input is the "context window".

This creates an optimization problem with three competing constraints:

  1. Latency (most critical): More tokens = slower inference. Users feel every millisecond.
  2. Cost: More input tokens = higher API costs. Scales linearly with conversation length.
  3. Accuracy: Finding the right amount of relevant context. Too little causes hallucinations, too much causes perceived hallucinations from noise.
Latency, cost, and accuracy triangle illustrating competing pressures on memory systems

You can't optimize all three. Sending full history not only reduce accuracy after a certain point but also kills latency and cost. Aggressive compression helps latency and cost but degrades accuracy. Every memory system is picking a point on this trade-off triangle.

Note: Internal caching (KV cache, KV pinning) reduces computation within the LLM, but doesn't change the external interface. These help reduce latency & avoid reprocessing, and are not memory systems.

How LLMs Remember

We build systems around their current limitation. Three key techniques:

  1. Summarization: Compress old messages to reduce token count
  2. Fact tracking: Extract and store durable information across conversations
  3. Retrieval systems: Store and fetch relevant context on demand

These techniques combine to make LLMs appear to remember.

Summarization

Token-level compression within the context window. Compress old messages to keep only relevant information, preventing the LLM from getting distracted by noise while preserving key events.

Diagram showing how older chat turns are compressed into a smaller summary

A typical implementation uses a secondary LLM to maintain a running summary as new messages arrive.

Approaches:

  • Periodic (every N messages)
  • Sliding window (recent messages + summary of older ones)
  • Hierarchical (summaries of summaries)

Trade-offs:

  • Reduces token cost significantly
  • Lossy: important details can be dropped
  • Adds summarization latency and cost
  • Quality depends on the summarization model
  • Can exceed context window limits for very long conversations

Fact Tracking Across Conversations

Extends beyond the current context window. Extract durable facts about users that persist beyond individual conversations. If a user mentions they hate cinnamon in one conversation, that fact is available in all future conversations.

Illustration of durable fact storage persisting details between conversations

A typical implementation uses a secondary LLM to monitor conversations and extract facts worth persisting. Unlike summarization, these facts live beyond the conversation scope.

Common patterns:

  • Continuous extraction (after each message or periodically)
  • Structured storage (key-value pairs or entity graphs)
  • Conflict resolution (when facts contradict)

Trade-offs:

  • Personalized experience across conversations
  • Requires fact extraction inference cost
  • Needs storage and retrieval infrastructure
  • Must handle contradictions and updates
  • Privacy considerations for storing user data

Retrieval Systems

Retrieval means selective inclusion, not persistence. With summarization and fact tracking, we have stored memory, but we can't send all of it to the LLM without adding noise. When the user asks about "cinnamon", including "user does not like cinnamon" is useful, including "user has a dog named Oscar" is not.

Diagram of a retrieval system ranking stored memories for the next prompt

Retrieval systems solve this by fetching semantically relevant pieces of memory based on the current context.

How it works:

  1. Generate embeddings (vector representations) of stored information
  2. Store in a vector database
  3. On query: retrieve semantically similar pieces
  4. Inject retrieved context into the prompt

Trade-offs:

  • Adds retrieval latency before inference
  • Quality depends on embedding model and retrieval strategy
  • Can miss connections between non-adjacent information
  • Scales beyond context window limits

Other retrieval methods exist (graph databases, keyword search), but vector-based retrieval is common and generalises well between use cases.

Note: Retrieval systems are also widely used for accessing large corpora (RAG - Retrieval-Augmented Generation). That's a related but different use case. Here we're focused on retrieving stored memories, not general document knowledge bases.

What We Have Today

Combining these techniques lets us build applications with long conversations, personalization, and large knowledge bases. We can deliver reasonable improvements to user experience.

Systems like Mem0 apply these techniques using vector stores for long-term memory. They demonstrate that production implementations are viable, though trade-offs remain.

Every technique forces serious trade-offs. The latency-cost-accuracy triangle is real. There's no perfect solution out there, and this is very much an unsolved problem.

Open Problems

The techniques above are starting points, not long-term solutions. Real questions remain:

Lossy compression: When do critical details get dropped? How do you know what to keep?

Retrieval precision: Semantic search doesn't always fetch the right data. How do we measure relevance accurately?

Determinism vs. adaptivity: Deterministic systems are debuggable but inflexible. Adaptive systems are powerful but unpredictable. Which matters more?

Evaluation: How do you test memory systems when conversations span thousands of turns and context is deeply nested and subjectively driven by user intent?

Active research areas:

  • Semantic compaction (compress by importance, not recency)
  • Learned compression (models that compress their own context)
  • Multi-modal memory (unified handling of images, audio, text)
  • Adaptive budgets (dynamic token allocation based on conversation needs)

LLMs process fixed-size inputs. How we manage that constraint (balancing latency, cost, and accuracy) is still being figured out.


We're exploring these problems with context-store, infrastructure for experimenting with memory techniques. Read about why we're building it.

BlogAboutGitHub

© 2025 Fastpaca · Research · Open Source · Advisory