<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:atom="http://www.w3.org/2005/Atom" version="2.0">
    <channel>
      <title>fastpaca</title>
      <link>https://fastpaca.com</link>
      <description>Exploring how complex systems can work better</description>
      <generator>Zola</generator>
      <language>en</language>
      <atom:link href="https://fastpaca.com/feed.xml" rel="self" type="application/rss+xml"/>
      <lastBuildDate>Fri, 13 Feb 2026 00:00:00 +0000</lastBuildDate>
      <item>
          <title>Let&#x27;s Build an AI Assistant That Remembers</title>
          <pubDate>Fri, 13 Feb 2026 00:00:00 +0000</pubDate>
          <author>Sebastian Lund</author>
          <link>https://fastpaca.com/blog/build-ai-assistant-that-remembers/</link>
          <guid>https://fastpaca.com/blog/build-ai-assistant-that-remembers/</guid>
          <description xml:base="https://fastpaca.com/blog/build-ai-assistant-that-remembers/">&lt;p&gt;A founder friend messaged me recently:&lt;&#x2F;p&gt;
&lt;blockquote&gt;
&lt;p&gt;When do we trigger compaction? Context is finite, so at some point we have to compress. Priority-based, task-specific, time-based… what have you tried?&lt;&#x2F;p&gt;
&lt;&#x2F;blockquote&gt;
&lt;p&gt;These are the questions most people start with. How do I compress? When do I trigger that? How do I retrieve what’s relevant? They’re the right questions, but going from concepts to a working implementation isn’t straightforward.&lt;&#x2F;p&gt;
&lt;p&gt;Let’s build one together: an assistant that remembers everything. It won’t be production-grade. It’s a foundation, enough to understand how the pieces fit together and a starting point for whatever your product needs.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;the-starting-point&quot;&gt;The starting point&lt;&#x2F;h2&gt;
&lt;p&gt;First, we need to get a basic assistant going. We’ll use &lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;nextjs.org&#x2F;&quot;&gt;Next.js&lt;&#x2F;a&gt; for the app, Vercel’s &lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;ai-sdk.dev&#x2F;&quot;&gt;AI SDK&lt;&#x2F;a&gt; for LLM integration, and &lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;github.com&#x2F;fastpaca&#x2F;cria&quot;&gt;Cria&lt;&#x2F;a&gt; for prompt composition. AI SDK handles the plumbing and state management. Cria handles the prompt and all future memory components. Together, they keep us from reinventing the wheel while staying easy to customize.&lt;&#x2F;p&gt;
&lt;p&gt;A bare minimum chat route looks like this:&lt;&#x2F;p&gt;
&lt;pre class=&quot;giallo&quot; style=&quot;color: #E1E4E8; background-color: #24292E;&quot;&gt;&lt;code data-lang=&quot;typescript&quot;&gt;&lt;span class=&quot;giallo-l&quot;&gt;&lt;span style=&quot;color: #6A737D;&quot;&gt;&#x2F;&#x2F; app&#x2F;api&#x2F;chat&#x2F;route.ts&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span style=&quot;color: #F97583;&quot;&gt;import&lt;&#x2F;span&gt;&lt;span&gt; { streamText }&lt;&#x2F;span&gt;&lt;span style=&quot;color: #F97583;&quot;&gt; from&lt;&#x2F;span&gt;&lt;span style=&quot;color: #9ECBFF;&quot;&gt; &amp;quot;ai&amp;quot;&lt;&#x2F;span&gt;&lt;span&gt;;&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span style=&quot;color: #F97583;&quot;&gt;import&lt;&#x2F;span&gt;&lt;span&gt; { createOpenAI }&lt;&#x2F;span&gt;&lt;span style=&quot;color: #F97583;&quot;&gt; from&lt;&#x2F;span&gt;&lt;span style=&quot;color: #9ECBFF;&quot;&gt; &amp;quot;@ai-sdk&#x2F;openai&amp;quot;&lt;&#x2F;span&gt;&lt;span&gt;;&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span style=&quot;color: #F97583;&quot;&gt;const&lt;&#x2F;span&gt;&lt;span style=&quot;color: #79B8FF;&quot;&gt; openai&lt;&#x2F;span&gt;&lt;span style=&quot;color: #F97583;&quot;&gt; =&lt;&#x2F;span&gt;&lt;span style=&quot;color: #B392F0;&quot;&gt; createOpenAI&lt;&#x2F;span&gt;&lt;span&gt;({ apiKey: process.env.&lt;&#x2F;span&gt;&lt;span style=&quot;color: #79B8FF;&quot;&gt;OPENAI_API_KEY&lt;&#x2F;span&gt;&lt;span&gt; });&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span style=&quot;color: #F97583;&quot;&gt;export async function&lt;&#x2F;span&gt;&lt;span style=&quot;color: #B392F0;&quot;&gt; POST&lt;&#x2F;span&gt;&lt;span&gt;(&lt;&#x2F;span&gt;&lt;span style=&quot;color: #FFAB70;&quot;&gt;req&lt;&#x2F;span&gt;&lt;span style=&quot;color: #F97583;&quot;&gt;:&lt;&#x2F;span&gt;&lt;span style=&quot;color: #B392F0;&quot;&gt; Request&lt;&#x2F;span&gt;&lt;span&gt;) {&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span style=&quot;color: #F97583;&quot;&gt;  const&lt;&#x2F;span&gt;&lt;span&gt; {&lt;&#x2F;span&gt;&lt;span style=&quot;color: #79B8FF;&quot;&gt; messages&lt;&#x2F;span&gt;&lt;span&gt; }&lt;&#x2F;span&gt;&lt;span style=&quot;color: #F97583;&quot;&gt; = await&lt;&#x2F;span&gt;&lt;span&gt; req.&lt;&#x2F;span&gt;&lt;span style=&quot;color: #B392F0;&quot;&gt;json&lt;&#x2F;span&gt;&lt;span&gt;();&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span style=&quot;color: #F97583;&quot;&gt;  return&lt;&#x2F;span&gt;&lt;span style=&quot;color: #B392F0;&quot;&gt; streamText&lt;&#x2F;span&gt;&lt;span&gt;({&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;    model:&lt;&#x2F;span&gt;&lt;span style=&quot;color: #B392F0;&quot;&gt; openai&lt;&#x2F;span&gt;&lt;span&gt;(&lt;&#x2F;span&gt;&lt;span style=&quot;color: #9ECBFF;&quot;&gt;&amp;quot;gpt-4o-mini&amp;quot;&lt;&#x2F;span&gt;&lt;span&gt;),&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;    system:&lt;&#x2F;span&gt;&lt;span style=&quot;color: #9ECBFF;&quot;&gt; &amp;quot;You are a helpful assistant.&amp;quot;&lt;&#x2F;span&gt;&lt;span&gt;,&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;    messages,&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;  }).&lt;&#x2F;span&gt;&lt;span style=&quot;color: #B392F0;&quot;&gt;toDataStreamResponse&lt;&#x2F;span&gt;&lt;span&gt;();&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;}&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;And a client to use it:&lt;&#x2F;p&gt;
&lt;pre class=&quot;giallo&quot; style=&quot;color: #E1E4E8; background-color: #24292E;&quot;&gt;&lt;code data-lang=&quot;tsx&quot;&gt;&lt;span class=&quot;giallo-l&quot;&gt;&lt;span style=&quot;color: #6A737D;&quot;&gt;&#x2F;&#x2F; app&#x2F;page.tsx&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span style=&quot;color: #9ECBFF;&quot;&gt;&amp;quot;use client&amp;quot;&lt;&#x2F;span&gt;&lt;span&gt;;&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span style=&quot;color: #F97583;&quot;&gt;import&lt;&#x2F;span&gt;&lt;span&gt; { useChat }&lt;&#x2F;span&gt;&lt;span style=&quot;color: #F97583;&quot;&gt; from&lt;&#x2F;span&gt;&lt;span style=&quot;color: #9ECBFF;&quot;&gt; &amp;quot;@ai-sdk&#x2F;react&amp;quot;&lt;&#x2F;span&gt;&lt;span&gt;;&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span style=&quot;color: #F97583;&quot;&gt;export default function&lt;&#x2F;span&gt;&lt;span style=&quot;color: #B392F0;&quot;&gt; Chat&lt;&#x2F;span&gt;&lt;span&gt;() {&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span style=&quot;color: #F97583;&quot;&gt;  const&lt;&#x2F;span&gt;&lt;span&gt; {&lt;&#x2F;span&gt;&lt;span style=&quot;color: #79B8FF;&quot;&gt; messages&lt;&#x2F;span&gt;&lt;span&gt;,&lt;&#x2F;span&gt;&lt;span style=&quot;color: #79B8FF;&quot;&gt; input&lt;&#x2F;span&gt;&lt;span&gt;,&lt;&#x2F;span&gt;&lt;span style=&quot;color: #79B8FF;&quot;&gt; handleInputChange&lt;&#x2F;span&gt;&lt;span&gt;,&lt;&#x2F;span&gt;&lt;span style=&quot;color: #79B8FF;&quot;&gt; handleSubmit&lt;&#x2F;span&gt;&lt;span&gt; }&lt;&#x2F;span&gt;&lt;span style=&quot;color: #F97583;&quot;&gt; =&lt;&#x2F;span&gt;&lt;span style=&quot;color: #B392F0;&quot;&gt; useChat&lt;&#x2F;span&gt;&lt;span&gt;();&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span style=&quot;color: #F97583;&quot;&gt;  return&lt;&#x2F;span&gt;&lt;span&gt; (&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;    &amp;lt;&lt;&#x2F;span&gt;&lt;span style=&quot;color: #85E89D;&quot;&gt;div&lt;&#x2F;span&gt;&lt;span&gt;&amp;gt;&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;      {messages.&lt;&#x2F;span&gt;&lt;span style=&quot;color: #B392F0;&quot;&gt;map&lt;&#x2F;span&gt;&lt;span&gt;((&lt;&#x2F;span&gt;&lt;span style=&quot;color: #FFAB70;&quot;&gt;m&lt;&#x2F;span&gt;&lt;span&gt;)&lt;&#x2F;span&gt;&lt;span style=&quot;color: #F97583;&quot;&gt; =&amp;gt;&lt;&#x2F;span&gt;&lt;span&gt; (&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;        &amp;lt;&lt;&#x2F;span&gt;&lt;span style=&quot;color: #85E89D;&quot;&gt;div&lt;&#x2F;span&gt;&lt;span style=&quot;color: #B392F0;&quot;&gt; key&lt;&#x2F;span&gt;&lt;span style=&quot;color: #F97583;&quot;&gt;=&lt;&#x2F;span&gt;&lt;span&gt;{m.id}&amp;gt;{m.role&lt;&#x2F;span&gt;&lt;span style=&quot;color: #F97583;&quot;&gt; ===&lt;&#x2F;span&gt;&lt;span style=&quot;color: #9ECBFF;&quot;&gt; &amp;quot;user&amp;quot;&lt;&#x2F;span&gt;&lt;span style=&quot;color: #F97583;&quot;&gt; ?&lt;&#x2F;span&gt;&lt;span style=&quot;color: #9ECBFF;&quot;&gt; &amp;quot;You: &amp;quot;&lt;&#x2F;span&gt;&lt;span style=&quot;color: #F97583;&quot;&gt; :&lt;&#x2F;span&gt;&lt;span style=&quot;color: #9ECBFF;&quot;&gt; &amp;quot;AI: &amp;quot;&lt;&#x2F;span&gt;&lt;span&gt;}{m.content}&amp;lt;&#x2F;&lt;&#x2F;span&gt;&lt;span style=&quot;color: #85E89D;&quot;&gt;div&lt;&#x2F;span&gt;&lt;span&gt;&amp;gt;&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;      ))}&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;      &amp;lt;&lt;&#x2F;span&gt;&lt;span style=&quot;color: #85E89D;&quot;&gt;form&lt;&#x2F;span&gt;&lt;span style=&quot;color: #B392F0;&quot;&gt; onSubmit&lt;&#x2F;span&gt;&lt;span style=&quot;color: #F97583;&quot;&gt;=&lt;&#x2F;span&gt;&lt;span&gt;{handleSubmit}&amp;gt;&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;        &amp;lt;&lt;&#x2F;span&gt;&lt;span style=&quot;color: #85E89D;&quot;&gt;input&lt;&#x2F;span&gt;&lt;span style=&quot;color: #B392F0;&quot;&gt; value&lt;&#x2F;span&gt;&lt;span style=&quot;color: #F97583;&quot;&gt;=&lt;&#x2F;span&gt;&lt;span&gt;{input}&lt;&#x2F;span&gt;&lt;span style=&quot;color: #B392F0;&quot;&gt; onChange&lt;&#x2F;span&gt;&lt;span style=&quot;color: #F97583;&quot;&gt;=&lt;&#x2F;span&gt;&lt;span&gt;{handleInputChange} &#x2F;&amp;gt;&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;      &amp;lt;&#x2F;&lt;&#x2F;span&gt;&lt;span style=&quot;color: #85E89D;&quot;&gt;form&lt;&#x2F;span&gt;&lt;span&gt;&amp;gt;&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;    &amp;lt;&#x2F;&lt;&#x2F;span&gt;&lt;span style=&quot;color: #85E89D;&quot;&gt;div&lt;&#x2F;span&gt;&lt;span&gt;&amp;gt;&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;  );&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;}&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;That’s a working chat. User sends a message, model streams a response.&lt;&#x2F;p&gt;
&lt;p&gt;&lt;em&gt;Here’s what a stylized version looks like. See the &lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;github.com&#x2F;fastpaca&#x2F;memory-assistant&quot;&gt;full example on GitHub&lt;&#x2F;a&gt; for the complete code.&lt;&#x2F;em&gt;&lt;&#x2F;p&gt;
&lt;img src=&quot;&#x2F;images&#x2F;memory-assistant-simple.png&quot; alt=&quot;A simple assistant, no memory yet&quot; style=&quot;max-width: 560px;&quot; &#x2F;&gt;
&lt;p&gt;You now have a basic assistant wired together in minimal code. It won’t remember anything between sessions, but you can talk to it like you would ChatGPT.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;designing-the-memory-lifecycle&quot;&gt;Designing the memory lifecycle&lt;&#x2F;h2&gt;
&lt;p&gt;Before we wire memory up, we need to think about how it fits into our product. LLMs can’t keep going forever. They are constrained in several ways:&lt;&#x2F;p&gt;
&lt;ul&gt;
&lt;li&gt;LLMs have context limits that eventually prevent you from sending more input in the same session.&lt;&#x2F;li&gt;
&lt;li&gt;LLMs pay more attention to the &lt;em&gt;start&lt;&#x2F;em&gt; and &lt;em&gt;end&lt;&#x2F;em&gt; of their context window than to the middle. For an assistant, the initial messages and the latest ones always carry the most weight.&lt;&#x2F;li&gt;
&lt;li&gt;LLMs have diminishing returns past a certain context size (e.g., ~200,000 tokens for OpenAI models), and going beyond that increases the risk of hallucinations.&lt;&#x2F;li&gt;
&lt;li&gt;LLMs are hardwired through reinforcement learning to reach an “end state” where they solve a problem. If you’re a Claude user, you may have noticed how it gets short once it thinks it’s answered your question, and keeping the conversation going past that point almost becomes painful.&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;p&gt;These constraints mean we need a natural end point for a session. For an assistant, that’s inactivity: when the user stops messaging, we know they’re done, so we end the session and save what happened. Most teams call this compaction.&lt;&#x2F;p&gt;
&lt;p&gt;For users who never go idle, we need to force sessions to end. Teams call this forced compaction: write memories, carry them into a new session, keep going. We want to avoid it when possible, and reinforcement learning helps. Models like Claude Sonnet and Opus naturally steer conversations toward resolution, nudging the user toward a stopping point.&lt;&#x2F;p&gt;
&lt;p&gt;So we know &lt;em&gt;when&lt;&#x2F;em&gt; to write memories. The next question is &lt;em&gt;what&lt;&#x2F;em&gt; to write. We need episodic memory to summarize past conversations, semantic memory to remember durable facts about the user, and a recall mechanism for finding relevant past sessions once there are too many to include directly.&lt;&#x2F;p&gt;
&lt;p&gt;These three memory types let the assistant:&lt;&#x2F;p&gt;
&lt;ol&gt;
&lt;li&gt;Summarize past sessions, i.e., “what happened before the user went idle”&lt;&#x2F;li&gt;
&lt;li&gt;Maintain a fact sheet about the user, e.g., “what is their name?”&lt;&#x2F;li&gt;
&lt;li&gt;Recall past sessions by topic, i.e., “what did we talk about last week?”&lt;&#x2F;li&gt;
&lt;&#x2F;ol&gt;
&lt;table&gt;&lt;thead&gt;&lt;tr&gt;&lt;th&gt;Layer&lt;&#x2F;th&gt;&lt;th&gt;Memory type&lt;&#x2F;th&gt;&lt;th&gt;Primitive&lt;&#x2F;th&gt;&lt;th&gt;What it does&lt;&#x2F;th&gt;&lt;&#x2F;tr&gt;&lt;&#x2F;thead&gt;&lt;tbody&gt;
&lt;tr&gt;&lt;td&gt;Working&lt;&#x2F;td&gt;&lt;td&gt;Current context&lt;&#x2F;td&gt;&lt;td&gt;Recent messages&lt;&#x2F;td&gt;&lt;td&gt;Last N messages kept verbatim&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;Episodic&lt;&#x2F;td&gt;&lt;td&gt;What happened&lt;&#x2F;td&gt;&lt;td&gt;Summarizer&lt;&#x2F;td&gt;&lt;td&gt;Distills each session into a structured log&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;Semantic&lt;&#x2F;td&gt;&lt;td&gt;Who this person is&lt;&#x2F;td&gt;&lt;td&gt;Summarizer&lt;&#x2F;td&gt;&lt;td&gt;Extracts durable facts into a structured user profile&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;Vector recall&lt;&#x2F;td&gt;&lt;td&gt;Specific past conversations&lt;&#x2F;td&gt;&lt;td&gt;Vector DB&lt;&#x2F;td&gt;&lt;td&gt;Indexes session summaries for similarity search&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;&#x2F;tbody&gt;&lt;&#x2F;table&gt;
&lt;p&gt;&lt;em&gt;I covered &lt;a href=&quot;&#x2F;blog&#x2F;ultimate-guide-to-llm-memory&quot;&gt;LLM memory types&lt;&#x2F;a&gt; in a previous post if you want to know more about how to design this.&lt;&#x2F;em&gt;&lt;&#x2F;p&gt;
&lt;p&gt;Notice the summarizer appears twice. Episodic and semantic memory both use summarization, but for different purposes: episodic distills &lt;em&gt;one session&lt;&#x2F;em&gt; into a log of events. Semantic extracts &lt;em&gt;durable facts across many sessions&lt;&#x2F;em&gt; into a structured profile. Same primitive, different keys, different prompts.&lt;&#x2F;p&gt;
&lt;p&gt;The flow:&lt;&#x2F;p&gt;
&lt;ol&gt;
&lt;li&gt;User chats → conversation accumulates (working memory)&lt;&#x2F;li&gt;
&lt;li&gt;User goes idle → session boundary fires&lt;&#x2F;li&gt;
&lt;li&gt;On boundary → extract episodic, semantic, and vector memories&lt;&#x2F;li&gt;
&lt;li&gt;User returns → fresh session, bootstrapped with stored memories&lt;&#x2F;li&gt;
&lt;&#x2F;ol&gt;
&lt;p&gt;Now we know what to build, so let’s get to it!&lt;&#x2F;p&gt;
&lt;h2 id=&quot;structuring-memory&quot;&gt;Structuring memory&lt;&#x2F;h2&gt;
&lt;p&gt;Multiple parts of the app use memory: The chat route reads it, the save endpoint writes it. Best to wrap it in an interface we can iterate on internally and plug in wherever it’s needed:&lt;&#x2F;p&gt;
&lt;pre class=&quot;giallo&quot; style=&quot;color: #E1E4E8; background-color: #24292E;&quot;&gt;&lt;code data-lang=&quot;typescript&quot;&gt;&lt;span class=&quot;giallo-l&quot;&gt;&lt;span style=&quot;color: #6A737D;&quot;&gt;&#x2F;&#x2F; src&#x2F;lib&#x2F;memory.ts&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span style=&quot;color: #F97583;&quot;&gt;type&lt;&#x2F;span&gt;&lt;span style=&quot;color: #B392F0;&quot;&gt; SessionMessage&lt;&#x2F;span&gt;&lt;span style=&quot;color: #F97583;&quot;&gt; =&lt;&#x2F;span&gt;&lt;span&gt; {&lt;&#x2F;span&gt;&lt;span style=&quot;color: #FFAB70;&quot;&gt; role&lt;&#x2F;span&gt;&lt;span style=&quot;color: #F97583;&quot;&gt;:&lt;&#x2F;span&gt;&lt;span style=&quot;color: #9ECBFF;&quot;&gt; &amp;quot;user&amp;quot;&lt;&#x2F;span&gt;&lt;span style=&quot;color: #F97583;&quot;&gt; |&lt;&#x2F;span&gt;&lt;span style=&quot;color: #9ECBFF;&quot;&gt; &amp;quot;assistant&amp;quot;&lt;&#x2F;span&gt;&lt;span&gt;;&lt;&#x2F;span&gt;&lt;span style=&quot;color: #FFAB70;&quot;&gt; content&lt;&#x2F;span&gt;&lt;span style=&quot;color: #F97583;&quot;&gt;:&lt;&#x2F;span&gt;&lt;span style=&quot;color: #79B8FF;&quot;&gt; string&lt;&#x2F;span&gt;&lt;span&gt; };&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span style=&quot;color: #F97583;&quot;&gt;class&lt;&#x2F;span&gt;&lt;span style=&quot;color: #B392F0;&quot;&gt; MemoryManager&lt;&#x2F;span&gt;&lt;span&gt; {&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span style=&quot;color: #F97583;&quot;&gt;  async&lt;&#x2F;span&gt;&lt;span style=&quot;color: #B392F0;&quot;&gt; prompt&lt;&#x2F;span&gt;&lt;span&gt;(&lt;&#x2F;span&gt;&lt;span style=&quot;color: #FFAB70;&quot;&gt;userId&lt;&#x2F;span&gt;&lt;span style=&quot;color: #F97583;&quot;&gt;:&lt;&#x2F;span&gt;&lt;span style=&quot;color: #79B8FF;&quot;&gt; string&lt;&#x2F;span&gt;&lt;span&gt;,&lt;&#x2F;span&gt;&lt;span style=&quot;color: #FFAB70;&quot;&gt; messages&lt;&#x2F;span&gt;&lt;span style=&quot;color: #F97583;&quot;&gt;:&lt;&#x2F;span&gt;&lt;span style=&quot;color: #B392F0;&quot;&gt; SessionMessage&lt;&#x2F;span&gt;&lt;span&gt;[])&lt;&#x2F;span&gt;&lt;span style=&quot;color: #F97583;&quot;&gt;:&lt;&#x2F;span&gt;&lt;span style=&quot;color: #B392F0;&quot;&gt; Promise&lt;&#x2F;span&gt;&lt;span&gt;&amp;lt;&lt;&#x2F;span&gt;&lt;span style=&quot;color: #B392F0;&quot;&gt;PromptPlugin&lt;&#x2F;span&gt;&lt;span&gt;&amp;gt; {&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span style=&quot;color: #6A737D;&quot;&gt;    &#x2F;&#x2F; Build a memory-aware prompt for the chat route&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;  }&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span style=&quot;color: #F97583;&quot;&gt;  async&lt;&#x2F;span&gt;&lt;span style=&quot;color: #B392F0;&quot;&gt; save&lt;&#x2F;span&gt;&lt;span&gt;(&lt;&#x2F;span&gt;&lt;span style=&quot;color: #FFAB70;&quot;&gt;userId&lt;&#x2F;span&gt;&lt;span style=&quot;color: #F97583;&quot;&gt;:&lt;&#x2F;span&gt;&lt;span style=&quot;color: #79B8FF;&quot;&gt; string&lt;&#x2F;span&gt;&lt;span&gt;,&lt;&#x2F;span&gt;&lt;span style=&quot;color: #FFAB70;&quot;&gt; messages&lt;&#x2F;span&gt;&lt;span style=&quot;color: #F97583;&quot;&gt;:&lt;&#x2F;span&gt;&lt;span style=&quot;color: #B392F0;&quot;&gt; SessionMessage&lt;&#x2F;span&gt;&lt;span&gt;[])&lt;&#x2F;span&gt;&lt;span style=&quot;color: #F97583;&quot;&gt;:&lt;&#x2F;span&gt;&lt;span style=&quot;color: #B392F0;&quot;&gt; Promise&lt;&#x2F;span&gt;&lt;span&gt;&amp;lt;&lt;&#x2F;span&gt;&lt;span style=&quot;color: #79B8FF;&quot;&gt;void&lt;&#x2F;span&gt;&lt;span&gt;&amp;gt; {&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span style=&quot;color: #6A737D;&quot;&gt;    &#x2F;&#x2F; Extract and store memories when a session ends&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;  }&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;}&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span style=&quot;color: #F97583;&quot;&gt;export const&lt;&#x2F;span&gt;&lt;span style=&quot;color: #79B8FF;&quot;&gt; Memory&lt;&#x2F;span&gt;&lt;span style=&quot;color: #F97583;&quot;&gt; = new&lt;&#x2F;span&gt;&lt;span style=&quot;color: #B392F0;&quot;&gt; MemoryManager&lt;&#x2F;span&gt;&lt;span&gt;();&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;Plugging the module into our assistant is straightforward:&lt;&#x2F;p&gt;
&lt;pre class=&quot;giallo&quot; style=&quot;color: #E1E4E8; background-color: #24292E;&quot;&gt;&lt;code data-lang=&quot;typescript&quot;&gt;&lt;span class=&quot;giallo-l&quot;&gt;&lt;span style=&quot;color: #6A737D;&quot;&gt;&#x2F;&#x2F; app&#x2F;api&#x2F;chat&#x2F;route.ts&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span style=&quot;color: #F97583;&quot;&gt;...&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span style=&quot;color: #F97583;&quot;&gt;const&lt;&#x2F;span&gt;&lt;span style=&quot;color: #79B8FF;&quot;&gt; memory&lt;&#x2F;span&gt;&lt;span style=&quot;color: #F97583;&quot;&gt; = await&lt;&#x2F;span&gt;&lt;span&gt; Memory.&lt;&#x2F;span&gt;&lt;span style=&quot;color: #B392F0;&quot;&gt;prompt&lt;&#x2F;span&gt;&lt;span&gt;(userId, messages);&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span style=&quot;color: #F97583;&quot;&gt;const&lt;&#x2F;span&gt;&lt;span style=&quot;color: #79B8FF;&quot;&gt; rendered&lt;&#x2F;span&gt;&lt;span style=&quot;color: #F97583;&quot;&gt; = await&lt;&#x2F;span&gt;&lt;span&gt; cria&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;  .&lt;&#x2F;span&gt;&lt;span style=&quot;color: #B392F0;&quot;&gt;prompt&lt;&#x2F;span&gt;&lt;span&gt;(provider)&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;  .&lt;&#x2F;span&gt;&lt;span style=&quot;color: #B392F0;&quot;&gt;system&lt;&#x2F;span&gt;&lt;span&gt;(&lt;&#x2F;span&gt;&lt;span style=&quot;color: #79B8FF;&quot;&gt;SYSTEM_PROMPT&lt;&#x2F;span&gt;&lt;span&gt;)&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;  .&lt;&#x2F;span&gt;&lt;span style=&quot;color: #B392F0;&quot;&gt;use&lt;&#x2F;span&gt;&lt;span&gt;(memory)&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;  .&lt;&#x2F;span&gt;&lt;span style=&quot;color: #B392F0;&quot;&gt;render&lt;&#x2F;span&gt;&lt;span&gt;({ budget:&lt;&#x2F;span&gt;&lt;span style=&quot;color: #79B8FF;&quot;&gt; 40_000&lt;&#x2F;span&gt;&lt;span&gt; });&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span style=&quot;color: #F97583;&quot;&gt;return&lt;&#x2F;span&gt;&lt;span style=&quot;color: #B392F0;&quot;&gt; streamText&lt;&#x2F;span&gt;&lt;span&gt;({&lt;&#x2F;span&gt;&lt;span style=&quot;color: #F97583;&quot;&gt; ...&lt;&#x2F;span&gt;&lt;span&gt;, messages: rendered }).&lt;&#x2F;span&gt;&lt;span style=&quot;color: #B392F0;&quot;&gt;toDataStreamResponse&lt;&#x2F;span&gt;&lt;span&gt;();&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span style=&quot;color: #F97583;&quot;&gt;...&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;The other methods wire up to other routes. In the demo, we added a “Save Session” button that forces the session to end. Much easier for testing than waiting minutes for the inactivity trigger.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;layer-1-episodic-what-happened&quot;&gt;Layer 1: episodic. What happened&lt;&#x2F;h2&gt;
&lt;p&gt;When a session ends, we have a full conversation sitting in memory. Could be five messages, could be 50. Before we throw it away, we want to save the important pieces: what was the user trying to do, what happened, and what’s left unresolved. That’s episodic memory: a structured log of the session.&lt;&#x2F;p&gt;
&lt;p&gt;We need two things for this: somewhere to store these summaries, and something that can distill a conversation into one. We’ll use SQLite for storage since we’re prototyping, and Cria gives us a summarizer primitive that does the extraction:&lt;&#x2F;p&gt;
&lt;pre class=&quot;giallo&quot; style=&quot;color: #E1E4E8; background-color: #24292E;&quot;&gt;&lt;code data-lang=&quot;typescript&quot;&gt;&lt;span class=&quot;giallo-l&quot;&gt;&lt;span style=&quot;color: #F97583;&quot;&gt;const&lt;&#x2F;span&gt;&lt;span style=&quot;color: #79B8FF;&quot;&gt; summaryStore&lt;&#x2F;span&gt;&lt;span style=&quot;color: #F97583;&quot;&gt; = new&lt;&#x2F;span&gt;&lt;span style=&quot;color: #B392F0;&quot;&gt; SqliteStore&lt;&#x2F;span&gt;&lt;span&gt;&amp;lt;&lt;&#x2F;span&gt;&lt;span style=&quot;color: #B392F0;&quot;&gt;StoredSummary&lt;&#x2F;span&gt;&lt;span&gt;&amp;gt;({&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;  filename:&lt;&#x2F;span&gt;&lt;span style=&quot;color: #9ECBFF;&quot;&gt; &amp;quot;.&#x2F;data&#x2F;cria.sqlite&amp;quot;&lt;&#x2F;span&gt;&lt;span&gt;,&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;  tableName:&lt;&#x2F;span&gt;&lt;span style=&quot;color: #9ECBFF;&quot;&gt; &amp;quot;cria_summaries&amp;quot;&lt;&#x2F;span&gt;&lt;span&gt;,&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;});&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span style=&quot;color: #6A737D;&quot;&gt;&#x2F;&#x2F; Factory: same store, different IDs and prompts produce different summaries.&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span style=&quot;color: #F97583;&quot;&gt;const&lt;&#x2F;span&gt;&lt;span style=&quot;color: #B392F0;&quot;&gt; summarizer&lt;&#x2F;span&gt;&lt;span style=&quot;color: #F97583;&quot;&gt; =&lt;&#x2F;span&gt;&lt;span&gt; (&lt;&#x2F;span&gt;&lt;span style=&quot;color: #FFAB70;&quot;&gt;id&lt;&#x2F;span&gt;&lt;span style=&quot;color: #F97583;&quot;&gt;:&lt;&#x2F;span&gt;&lt;span style=&quot;color: #79B8FF;&quot;&gt; string&lt;&#x2F;span&gt;&lt;span&gt;,&lt;&#x2F;span&gt;&lt;span style=&quot;color: #FFAB70;&quot;&gt; prompt&lt;&#x2F;span&gt;&lt;span style=&quot;color: #F97583;&quot;&gt;?:&lt;&#x2F;span&gt;&lt;span style=&quot;color: #79B8FF;&quot;&gt; string&lt;&#x2F;span&gt;&lt;span&gt;)&lt;&#x2F;span&gt;&lt;span style=&quot;color: #F97583;&quot;&gt; =&amp;gt;&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;  cria.&lt;&#x2F;span&gt;&lt;span style=&quot;color: #B392F0;&quot;&gt;summarizer&lt;&#x2F;span&gt;&lt;span&gt;({ id, provider, store: summaryStore, priority:&lt;&#x2F;span&gt;&lt;span style=&quot;color: #79B8FF;&quot;&gt; 2&lt;&#x2F;span&gt;&lt;span&gt;, prompt });&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;A summarizer takes a conversation in, extracts the pieces you care about according to your prompt, and stores the result. Call it again with new input, and it rolls the new content into the existing summary. Our &lt;code&gt;save&lt;&#x2F;code&gt; method is one call:&lt;&#x2F;p&gt;
&lt;pre class=&quot;giallo&quot; style=&quot;color: #E1E4E8; background-color: #24292E;&quot;&gt;&lt;code data-lang=&quot;typescript&quot;&gt;&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;async&lt;&#x2F;span&gt;&lt;span style=&quot;color: #B392F0;&quot;&gt; save&lt;&#x2F;span&gt;&lt;span&gt;(userId: string, messages: SessionMessage[]) {&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span style=&quot;color: #F97583;&quot;&gt;  const&lt;&#x2F;span&gt;&lt;span style=&quot;color: #79B8FF;&quot;&gt; date&lt;&#x2F;span&gt;&lt;span style=&quot;color: #F97583;&quot;&gt; = new&lt;&#x2F;span&gt;&lt;span style=&quot;color: #B392F0;&quot;&gt; Date&lt;&#x2F;span&gt;&lt;span&gt;().&lt;&#x2F;span&gt;&lt;span style=&quot;color: #B392F0;&quot;&gt;toISOString&lt;&#x2F;span&gt;&lt;span&gt;().&lt;&#x2F;span&gt;&lt;span style=&quot;color: #B392F0;&quot;&gt;split&lt;&#x2F;span&gt;&lt;span&gt;(&lt;&#x2F;span&gt;&lt;span style=&quot;color: #9ECBFF;&quot;&gt;&amp;quot;T&amp;quot;&lt;&#x2F;span&gt;&lt;span&gt;)[&lt;&#x2F;span&gt;&lt;span style=&quot;color: #79B8FF;&quot;&gt;0&lt;&#x2F;span&gt;&lt;span&gt;];&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span style=&quot;color: #F97583;&quot;&gt;  const&lt;&#x2F;span&gt;&lt;span style=&quot;color: #79B8FF;&quot;&gt; summary&lt;&#x2F;span&gt;&lt;span style=&quot;color: #F97583;&quot;&gt; = await&lt;&#x2F;span&gt;&lt;span style=&quot;color: #B392F0;&quot;&gt; summarizer&lt;&#x2F;span&gt;&lt;span&gt;(&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span style=&quot;color: #9ECBFF;&quot;&gt;    `episodic:${&lt;&#x2F;span&gt;&lt;span&gt;userId&lt;&#x2F;span&gt;&lt;span style=&quot;color: #9ECBFF;&quot;&gt;}`&lt;&#x2F;span&gt;&lt;span&gt;,&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span style=&quot;color: #9ECBFF;&quot;&gt;    `Summarize this conversation into a structured session log.&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span style=&quot;color: #9ECBFF;&quot;&gt;Date: ${&lt;&#x2F;span&gt;&lt;span&gt;date&lt;&#x2F;span&gt;&lt;span style=&quot;color: #9ECBFF;&quot;&gt;}&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span style=&quot;color: #9ECBFF;&quot;&gt;Objective: What the user was trying to do.&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span style=&quot;color: #9ECBFF;&quot;&gt;Timeline: Bullet points of what happened, key decisions and discoveries.&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span style=&quot;color: #9ECBFF;&quot;&gt;Outcome: What got done, what&amp;#39;s unresolved.&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span style=&quot;color: #9ECBFF;&quot;&gt;Keep it concise. No markdown formatting. One line per bullet.`&lt;&#x2F;span&gt;&lt;span&gt;,&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;  ).&lt;&#x2F;span&gt;&lt;span style=&quot;color: #B392F0;&quot;&gt;writeNow&lt;&#x2F;span&gt;&lt;span&gt;({ history:&lt;&#x2F;span&gt;&lt;span style=&quot;color: #B392F0;&quot;&gt; toScope&lt;&#x2F;span&gt;&lt;span&gt;(messages) });&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;}&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;We customize the prompt so the output is structured and scannable rather than a wall of prose. For example, if the user talks about deploying to Vercel, we may end up with a summary such as:&lt;&#x2F;p&gt;
&lt;pre class=&quot;giallo&quot; style=&quot;color: #E1E4E8; background-color: #24292E;&quot;&gt;&lt;code data-lang=&quot;plain&quot;&gt;&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;Date: 2026-02-13&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;Objective: Deploy a Next.js app to Vercel.&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;Timeline:&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;- User confirmed app is running locally and hosted on GitHub.&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;- Discussed managing environment variables for database connections and API keys.&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;- Discovered Vercel supports separate env vars for production and preview deployments.&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;- User noted they could point preview deployments to a staging database.&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;- Explored secrets management for CI pipeline; Vercel integrates with HashiCorp Vault and AWS Secrets Manager.&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;- User flagged potential compliance requirements for health data.&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;Outcome: Deployment approach settled. Pending: security team input on&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;compliance requirements for secrets management.&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;The model now knows &lt;em&gt;what the user was trying to do&lt;&#x2F;em&gt;, &lt;em&gt;what they tried&lt;&#x2F;em&gt;, and &lt;em&gt;where they left off&lt;&#x2F;em&gt;.&lt;&#x2F;p&gt;
&lt;p&gt;Now we wire the summary back into the prompt so the model sees it when the user returns:&lt;&#x2F;p&gt;
&lt;pre class=&quot;giallo&quot; style=&quot;color: #E1E4E8; background-color: #24292E;&quot;&gt;&lt;code data-lang=&quot;typescript&quot;&gt;&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;async&lt;&#x2F;span&gt;&lt;span style=&quot;color: #B392F0;&quot;&gt; prompt&lt;&#x2F;span&gt;&lt;span&gt;(userId: string, messages: SessionMessage[]) {&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span style=&quot;color: #F97583;&quot;&gt;  const&lt;&#x2F;span&gt;&lt;span style=&quot;color: #79B8FF;&quot;&gt; previousSession&lt;&#x2F;span&gt;&lt;span style=&quot;color: #F97583;&quot;&gt; = await&lt;&#x2F;span&gt;&lt;span style=&quot;color: #B392F0;&quot;&gt; summarizer&lt;&#x2F;span&gt;&lt;span&gt;(&lt;&#x2F;span&gt;&lt;span style=&quot;color: #9ECBFF;&quot;&gt;`episodic:${&lt;&#x2F;span&gt;&lt;span&gt;userId&lt;&#x2F;span&gt;&lt;span style=&quot;color: #9ECBFF;&quot;&gt;}`&lt;&#x2F;span&gt;&lt;span&gt;).&lt;&#x2F;span&gt;&lt;span style=&quot;color: #B392F0;&quot;&gt;get&lt;&#x2F;span&gt;&lt;span&gt;();&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span style=&quot;color: #F97583;&quot;&gt;  return&lt;&#x2F;span&gt;&lt;span&gt; cria.&lt;&#x2F;span&gt;&lt;span style=&quot;color: #B392F0;&quot;&gt;prompt&lt;&#x2F;span&gt;&lt;span&gt;(provider)&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;    .&lt;&#x2F;span&gt;&lt;span style=&quot;color: #B392F0;&quot;&gt;system&lt;&#x2F;span&gt;&lt;span&gt;(&lt;&#x2F;span&gt;&lt;span style=&quot;color: #9ECBFF;&quot;&gt;`## Previous Session&lt;&#x2F;span&gt;&lt;span style=&quot;color: #79B8FF;&quot;&gt;\n&lt;&#x2F;span&gt;&lt;span style=&quot;color: #9ECBFF;&quot;&gt;${&lt;&#x2F;span&gt;&lt;span&gt;previousSession&lt;&#x2F;span&gt;&lt;span style=&quot;color: #9ECBFF;&quot;&gt;}`&lt;&#x2F;span&gt;&lt;span&gt;)&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;    .&lt;&#x2F;span&gt;&lt;span style=&quot;color: #B392F0;&quot;&gt;use&lt;&#x2F;span&gt;&lt;span&gt;(&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span style=&quot;color: #6A737D;&quot;&gt;      &#x2F;&#x2F; Safeguard: if the conversation outgrows the token budget,&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span style=&quot;color: #6A737D;&quot;&gt;      &#x2F;&#x2F; the summarizer kicks in and condenses older messages.&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span style=&quot;color: #B392F0;&quot;&gt;      summarizer&lt;&#x2F;span&gt;&lt;span&gt;(&lt;&#x2F;span&gt;&lt;span style=&quot;color: #9ECBFF;&quot;&gt;`history:${&lt;&#x2F;span&gt;&lt;span&gt;userId&lt;&#x2F;span&gt;&lt;span style=&quot;color: #9ECBFF;&quot;&gt;}`&lt;&#x2F;span&gt;&lt;span&gt;).&lt;&#x2F;span&gt;&lt;span style=&quot;color: #B392F0;&quot;&gt;plugin&lt;&#x2F;span&gt;&lt;span&gt;({&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;        history:&lt;&#x2F;span&gt;&lt;span style=&quot;color: #B392F0;&quot;&gt; toScope&lt;&#x2F;span&gt;&lt;span&gt;(messages),&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;      })&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;    );&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;}&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;We read the stored summary from last session and inject it as context. We wrap the current conversation in a summarizer plugin that acts as a compaction safeguard: Most of the time messages pass through unchanged, but if the conversation outgrows the token budget, the summarizer kicks in and condenses older turns automatically. The same primitive we use to save memories between sessions also protects us within a single session.&lt;&#x2F;p&gt;
&lt;p&gt;That single layer already lets the user leave and come back the next day with context intact. The problem is that each save overwrites the previous summary. Storing them separately doesn’t help much either: The model would read disconnected session logs. It knows &lt;em&gt;what happened&lt;&#x2F;em&gt; each time but has no coherent picture of &lt;em&gt;who this person is&lt;&#x2F;em&gt;.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;layer-2-who-is-this-person&quot;&gt;Layer 2: who is this person&lt;&#x2F;h2&gt;
&lt;p&gt;After three sessions, the model might know:&lt;&#x2F;p&gt;
&lt;ul&gt;
&lt;li&gt;Session 1: User asked about Python pandas.&lt;&#x2F;li&gt;
&lt;li&gt;Session 2: User debugged a data pipeline.&lt;&#x2F;li&gt;
&lt;li&gt;Session 3: User asked about deployment.&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;p&gt;These are isolated events. What we want the model to know is: this is a data engineer who works primarily in Python and is currently building a pipeline they want to deploy. That’s semantic memory. Durable facts that persist regardless of which session they came from.&lt;&#x2F;p&gt;
&lt;p&gt;The approach is simple: take each session summary and feed it into a second summarizer that maintains a rolling user profile. Same tool, different purpose. We extend &lt;code&gt;save&lt;&#x2F;code&gt; to capture the summary and pipe it forward:&lt;&#x2F;p&gt;
&lt;pre class=&quot;giallo&quot; style=&quot;color: #E1E4E8; background-color: #24292E;&quot;&gt;&lt;code data-lang=&quot;typescript&quot;&gt;&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;async&lt;&#x2F;span&gt;&lt;span style=&quot;color: #B392F0;&quot;&gt; save&lt;&#x2F;span&gt;&lt;span&gt;(userId: string, messages: SessionMessage[]) {&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span style=&quot;color: #F97583;&quot;&gt;  const&lt;&#x2F;span&gt;&lt;span style=&quot;color: #79B8FF;&quot;&gt; date&lt;&#x2F;span&gt;&lt;span style=&quot;color: #F97583;&quot;&gt; = new&lt;&#x2F;span&gt;&lt;span style=&quot;color: #B392F0;&quot;&gt; Date&lt;&#x2F;span&gt;&lt;span&gt;().&lt;&#x2F;span&gt;&lt;span style=&quot;color: #B392F0;&quot;&gt;toISOString&lt;&#x2F;span&gt;&lt;span&gt;().&lt;&#x2F;span&gt;&lt;span style=&quot;color: #B392F0;&quot;&gt;split&lt;&#x2F;span&gt;&lt;span&gt;(&lt;&#x2F;span&gt;&lt;span style=&quot;color: #9ECBFF;&quot;&gt;&amp;quot;T&amp;quot;&lt;&#x2F;span&gt;&lt;span&gt;)[&lt;&#x2F;span&gt;&lt;span style=&quot;color: #79B8FF;&quot;&gt;0&lt;&#x2F;span&gt;&lt;span&gt;];&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span style=&quot;color: #F97583;&quot;&gt;  const&lt;&#x2F;span&gt;&lt;span style=&quot;color: #79B8FF;&quot;&gt; summary&lt;&#x2F;span&gt;&lt;span style=&quot;color: #F97583;&quot;&gt; = await&lt;&#x2F;span&gt;&lt;span style=&quot;color: #B392F0;&quot;&gt; summarizer&lt;&#x2F;span&gt;&lt;span&gt;(&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span style=&quot;color: #9ECBFF;&quot;&gt;    `episodic:${&lt;&#x2F;span&gt;&lt;span&gt;userId&lt;&#x2F;span&gt;&lt;span style=&quot;color: #9ECBFF;&quot;&gt;}`&lt;&#x2F;span&gt;&lt;span&gt;,&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span style=&quot;color: #9ECBFF;&quot;&gt;    `Summarize this conversation into a structured session log.&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span style=&quot;color: #9ECBFF;&quot;&gt;Date: ${&lt;&#x2F;span&gt;&lt;span&gt;date&lt;&#x2F;span&gt;&lt;span style=&quot;color: #9ECBFF;&quot;&gt;}&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span style=&quot;color: #9ECBFF;&quot;&gt;Objective: What the user was trying to do.&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span style=&quot;color: #9ECBFF;&quot;&gt;Timeline: Bullet points of what happened, key decisions and discoveries.&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span style=&quot;color: #9ECBFF;&quot;&gt;Outcome: What got done, what&amp;#39;s unresolved.&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span style=&quot;color: #9ECBFF;&quot;&gt;Keep it concise. No markdown formatting. One line per bullet.`&lt;&#x2F;span&gt;&lt;span&gt;,&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;  ).&lt;&#x2F;span&gt;&lt;span style=&quot;color: #B392F0;&quot;&gt;writeNow&lt;&#x2F;span&gt;&lt;span&gt;({ history:&lt;&#x2F;span&gt;&lt;span style=&quot;color: #B392F0;&quot;&gt; toScope&lt;&#x2F;span&gt;&lt;span&gt;(messages) });&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span style=&quot;color: #F97583;&quot;&gt;  await&lt;&#x2F;span&gt;&lt;span style=&quot;color: #B392F0;&quot;&gt; summarizer&lt;&#x2F;span&gt;&lt;span&gt;(&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span style=&quot;color: #9ECBFF;&quot;&gt;    `profile:${&lt;&#x2F;span&gt;&lt;span&gt;userId&lt;&#x2F;span&gt;&lt;span style=&quot;color: #9ECBFF;&quot;&gt;}`&lt;&#x2F;span&gt;&lt;span&gt;,&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span style=&quot;color: #9ECBFF;&quot;&gt;    `Extract durable facts about the user. One fact per line as &amp;#39;category: value&amp;#39;.&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span style=&quot;color: #9ECBFF;&quot;&gt;Categories: name, role, company, tools, preferences, current projects.&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span style=&quot;color: #9ECBFF;&quot;&gt;Only include facts explicitly stated. Omit unknown categories.&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span style=&quot;color: #9ECBFF;&quot;&gt;When new info contradicts old, keep only the latest.`&lt;&#x2F;span&gt;&lt;span&gt;,&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;  ).&lt;&#x2F;span&gt;&lt;span style=&quot;color: #B392F0;&quot;&gt;writeNow&lt;&#x2F;span&gt;&lt;span&gt;({ history: cria.&lt;&#x2F;span&gt;&lt;span style=&quot;color: #B392F0;&quot;&gt;user&lt;&#x2F;span&gt;&lt;span&gt;(&lt;&#x2F;span&gt;&lt;span style=&quot;color: #9ECBFF;&quot;&gt;`Session summary:&lt;&#x2F;span&gt;&lt;span style=&quot;color: #79B8FF;&quot;&gt;\n&lt;&#x2F;span&gt;&lt;span style=&quot;color: #9ECBFF;&quot;&gt;${&lt;&#x2F;span&gt;&lt;span&gt;summary&lt;&#x2F;span&gt;&lt;span style=&quot;color: #9ECBFF;&quot;&gt;}`&lt;&#x2F;span&gt;&lt;span&gt;) });&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;}&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;Two summarizers, two keys, two different prompts. The episodic summarizer produces a structured session log. The profile summarizer extracts durable facts and stores them as a flat list. After the first session (deploying to Vercel), the stored profile contains:&lt;&#x2F;p&gt;
&lt;pre class=&quot;giallo&quot; style=&quot;color: #E1E4E8; background-color: #24292E;&quot;&gt;&lt;code data-lang=&quot;plain&quot;&gt;&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;tools: GitHub, Vercel, HashiCorp Vault, AWS Secrets Manager&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;preferences: Staging database for preview deployments&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;current projects: Deploying a Next.js app to Vercel&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;The user comes back the next day and debugs a React performance issue. After that session ends, the profile has grown:&lt;&#x2F;p&gt;
&lt;pre class=&quot;giallo&quot; style=&quot;color: #E1E4E8; background-color: #24292E;&quot;&gt;&lt;code data-lang=&quot;plain&quot;&gt;&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;tools: Next.js, Vercel, GitHub, React&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;preferences: Staging database for preview deployments&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;current projects: Deploying a Next.js app to Vercel, Fixing re-rendering in a React list component&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;Same store, same key. The summarizer rolled the new session’s facts into the existing profile. &lt;code&gt;current projects&lt;&#x2F;code&gt; expanded. &lt;code&gt;tools&lt;&#x2F;code&gt; updated. When the user switches from one tool to another, the “contradicts old, keep only the latest” instruction means the profile stays current automatically.&lt;&#x2F;p&gt;
&lt;p&gt;And in &lt;code&gt;prompt&lt;&#x2F;code&gt;, we add the profile at the top:&lt;&#x2F;p&gt;
&lt;pre class=&quot;giallo&quot; style=&quot;color: #E1E4E8; background-color: #24292E;&quot;&gt;&lt;code data-lang=&quot;typescript&quot;&gt;&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;async&lt;&#x2F;span&gt;&lt;span style=&quot;color: #B392F0;&quot;&gt; prompt&lt;&#x2F;span&gt;&lt;span&gt;(userId: string, messages: SessionMessage[]) {&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span style=&quot;color: #F97583;&quot;&gt;  const&lt;&#x2F;span&gt;&lt;span style=&quot;color: #79B8FF;&quot;&gt; userProfile&lt;&#x2F;span&gt;&lt;span style=&quot;color: #F97583;&quot;&gt; = await&lt;&#x2F;span&gt;&lt;span style=&quot;color: #B392F0;&quot;&gt; summarizer&lt;&#x2F;span&gt;&lt;span&gt;(&lt;&#x2F;span&gt;&lt;span style=&quot;color: #9ECBFF;&quot;&gt;`profile:${&lt;&#x2F;span&gt;&lt;span&gt;userId&lt;&#x2F;span&gt;&lt;span style=&quot;color: #9ECBFF;&quot;&gt;}`&lt;&#x2F;span&gt;&lt;span&gt;).&lt;&#x2F;span&gt;&lt;span style=&quot;color: #B392F0;&quot;&gt;get&lt;&#x2F;span&gt;&lt;span&gt;();&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span style=&quot;color: #F97583;&quot;&gt;  const&lt;&#x2F;span&gt;&lt;span style=&quot;color: #79B8FF;&quot;&gt; previousSession&lt;&#x2F;span&gt;&lt;span style=&quot;color: #F97583;&quot;&gt; = await&lt;&#x2F;span&gt;&lt;span style=&quot;color: #B392F0;&quot;&gt; summarizer&lt;&#x2F;span&gt;&lt;span&gt;(&lt;&#x2F;span&gt;&lt;span style=&quot;color: #9ECBFF;&quot;&gt;`episodic:${&lt;&#x2F;span&gt;&lt;span&gt;userId&lt;&#x2F;span&gt;&lt;span style=&quot;color: #9ECBFF;&quot;&gt;}`&lt;&#x2F;span&gt;&lt;span&gt;).&lt;&#x2F;span&gt;&lt;span style=&quot;color: #B392F0;&quot;&gt;get&lt;&#x2F;span&gt;&lt;span&gt;();&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span style=&quot;color: #F97583;&quot;&gt;  return&lt;&#x2F;span&gt;&lt;span&gt; cria.&lt;&#x2F;span&gt;&lt;span style=&quot;color: #B392F0;&quot;&gt;prompt&lt;&#x2F;span&gt;&lt;span&gt;(provider)&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;    .&lt;&#x2F;span&gt;&lt;span style=&quot;color: #B392F0;&quot;&gt;system&lt;&#x2F;span&gt;&lt;span&gt;(&lt;&#x2F;span&gt;&lt;span style=&quot;color: #9ECBFF;&quot;&gt;`## What You Know About The User&lt;&#x2F;span&gt;&lt;span style=&quot;color: #79B8FF;&quot;&gt;\n&lt;&#x2F;span&gt;&lt;span style=&quot;color: #9ECBFF;&quot;&gt;${&lt;&#x2F;span&gt;&lt;span&gt;userProfile&lt;&#x2F;span&gt;&lt;span style=&quot;color: #9ECBFF;&quot;&gt;}`&lt;&#x2F;span&gt;&lt;span&gt;)&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;    .&lt;&#x2F;span&gt;&lt;span style=&quot;color: #B392F0;&quot;&gt;system&lt;&#x2F;span&gt;&lt;span&gt;(&lt;&#x2F;span&gt;&lt;span style=&quot;color: #9ECBFF;&quot;&gt;`## Previous Session&lt;&#x2F;span&gt;&lt;span style=&quot;color: #79B8FF;&quot;&gt;\n&lt;&#x2F;span&gt;&lt;span style=&quot;color: #9ECBFF;&quot;&gt;${&lt;&#x2F;span&gt;&lt;span&gt;previousSession&lt;&#x2F;span&gt;&lt;span style=&quot;color: #9ECBFF;&quot;&gt;}`&lt;&#x2F;span&gt;&lt;span&gt;)&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;    .&lt;&#x2F;span&gt;&lt;span style=&quot;color: #B392F0;&quot;&gt;use&lt;&#x2F;span&gt;&lt;span&gt;(&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span style=&quot;color: #B392F0;&quot;&gt;      summarizer&lt;&#x2F;span&gt;&lt;span&gt;(&lt;&#x2F;span&gt;&lt;span style=&quot;color: #9ECBFF;&quot;&gt;`history:${&lt;&#x2F;span&gt;&lt;span&gt;userId&lt;&#x2F;span&gt;&lt;span style=&quot;color: #9ECBFF;&quot;&gt;}`&lt;&#x2F;span&gt;&lt;span&gt;).&lt;&#x2F;span&gt;&lt;span style=&quot;color: #B392F0;&quot;&gt;plugin&lt;&#x2F;span&gt;&lt;span&gt;({&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;        history:&lt;&#x2F;span&gt;&lt;span style=&quot;color: #B392F0;&quot;&gt; toScope&lt;&#x2F;span&gt;&lt;span&gt;(messages),&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;      })&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;    );&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;}&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;The prompt is growing. First, the user profile (a structured fact list), then the last session summary (what happened last time), then the current conversation. The model hasn’t seen a single new message yet and already knows the user’s tools, preferences, and current projects.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;layer-3-we-talked-about-this&quot;&gt;Layer 3: we talked about this&lt;&#x2F;h2&gt;
&lt;p&gt;The profile tells the model who this person is. The episodic summary tells it what happened last time. But neither helps with “remember that restaurant you mentioned last week?” The profile doesn’t track individual recommendations, and old session summaries aren’t searchable by topic.&lt;&#x2F;p&gt;
&lt;p&gt;Vector search fills that gap. The idea: turn each session summary into an embedding (a list of numbers that captures its meaning) and store it. Later, when the user asks a question, embed that question too and find the past sessions whose embeddings are most similar. “What restaurant?” matches the session where restaurants were discussed, even if the word “restaurant” never appeared in the summary itself.&lt;&#x2F;p&gt;
&lt;p&gt;We set up a vector store backed by the same SQLite database, with OpenAI’s embedding model to convert text into vectors:&lt;&#x2F;p&gt;
&lt;pre class=&quot;giallo&quot; style=&quot;color: #E1E4E8; background-color: #24292E;&quot;&gt;&lt;code data-lang=&quot;typescript&quot;&gt;&lt;span class=&quot;giallo-l&quot;&gt;&lt;span style=&quot;color: #F97583;&quot;&gt;const&lt;&#x2F;span&gt;&lt;span style=&quot;color: #79B8FF;&quot;&gt; vectorStore&lt;&#x2F;span&gt;&lt;span style=&quot;color: #F97583;&quot;&gt; = new&lt;&#x2F;span&gt;&lt;span style=&quot;color: #B392F0;&quot;&gt; SqliteVectorStore&lt;&#x2F;span&gt;&lt;span&gt;&amp;lt;&lt;&#x2F;span&gt;&lt;span style=&quot;color: #79B8FF;&quot;&gt;string&lt;&#x2F;span&gt;&lt;span&gt;&amp;gt;({&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;  filename:&lt;&#x2F;span&gt;&lt;span style=&quot;color: #9ECBFF;&quot;&gt; &amp;quot;.&#x2F;data&#x2F;cria.sqlite&amp;quot;&lt;&#x2F;span&gt;&lt;span&gt;,&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;  tableName:&lt;&#x2F;span&gt;&lt;span style=&quot;color: #9ECBFF;&quot;&gt; &amp;quot;cria_vectors&amp;quot;&lt;&#x2F;span&gt;&lt;span&gt;,&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;  dimensions:&lt;&#x2F;span&gt;&lt;span style=&quot;color: #79B8FF;&quot;&gt; 1536&lt;&#x2F;span&gt;&lt;span&gt;,&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span style=&quot;color: #B392F0;&quot;&gt;  embed&lt;&#x2F;span&gt;&lt;span&gt;:&lt;&#x2F;span&gt;&lt;span style=&quot;color: #F97583;&quot;&gt; async&lt;&#x2F;span&gt;&lt;span&gt; (&lt;&#x2F;span&gt;&lt;span style=&quot;color: #FFAB70;&quot;&gt;text&lt;&#x2F;span&gt;&lt;span&gt;)&lt;&#x2F;span&gt;&lt;span style=&quot;color: #F97583;&quot;&gt; =&amp;gt;&lt;&#x2F;span&gt;&lt;span&gt; {&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span style=&quot;color: #F97583;&quot;&gt;    const&lt;&#x2F;span&gt;&lt;span&gt; {&lt;&#x2F;span&gt;&lt;span style=&quot;color: #79B8FF;&quot;&gt; embedding&lt;&#x2F;span&gt;&lt;span&gt; }&lt;&#x2F;span&gt;&lt;span style=&quot;color: #F97583;&quot;&gt; = await&lt;&#x2F;span&gt;&lt;span style=&quot;color: #B392F0;&quot;&gt; embed&lt;&#x2F;span&gt;&lt;span&gt;({&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;      model: openai.&lt;&#x2F;span&gt;&lt;span style=&quot;color: #B392F0;&quot;&gt;embedding&lt;&#x2F;span&gt;&lt;span&gt;(&lt;&#x2F;span&gt;&lt;span style=&quot;color: #9ECBFF;&quot;&gt;&amp;quot;text-embedding-3-small&amp;quot;&lt;&#x2F;span&gt;&lt;span&gt;),&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;      value: text,&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;    });&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span style=&quot;color: #F97583;&quot;&gt;    return&lt;&#x2F;span&gt;&lt;span&gt; embedding;&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;  },&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;  schema: z.&lt;&#x2F;span&gt;&lt;span style=&quot;color: #B392F0;&quot;&gt;string&lt;&#x2F;span&gt;&lt;span&gt;(),&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;});&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span style=&quot;color: #F97583;&quot;&gt;const&lt;&#x2F;span&gt;&lt;span style=&quot;color: #79B8FF;&quot;&gt; vectorDB&lt;&#x2F;span&gt;&lt;span style=&quot;color: #F97583;&quot;&gt; =&lt;&#x2F;span&gt;&lt;span&gt; cria.&lt;&#x2F;span&gt;&lt;span style=&quot;color: #B392F0;&quot;&gt;vectordb&lt;&#x2F;span&gt;&lt;span&gt;(vectorStore);&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;One more addition to &lt;code&gt;save&lt;&#x2F;code&gt;. After updating the profile, we index the session summary so it’s searchable by similarity:&lt;&#x2F;p&gt;
&lt;pre class=&quot;giallo&quot; style=&quot;color: #E1E4E8; background-color: #24292E;&quot;&gt;&lt;code data-lang=&quot;typescript&quot;&gt;&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;async&lt;&#x2F;span&gt;&lt;span style=&quot;color: #B392F0;&quot;&gt; save&lt;&#x2F;span&gt;&lt;span&gt;(userId: string, messages: SessionMessage[]) {&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span style=&quot;color: #F97583;&quot;&gt;  const&lt;&#x2F;span&gt;&lt;span style=&quot;color: #79B8FF;&quot;&gt; date&lt;&#x2F;span&gt;&lt;span style=&quot;color: #F97583;&quot;&gt; = new&lt;&#x2F;span&gt;&lt;span style=&quot;color: #B392F0;&quot;&gt; Date&lt;&#x2F;span&gt;&lt;span&gt;().&lt;&#x2F;span&gt;&lt;span style=&quot;color: #B392F0;&quot;&gt;toISOString&lt;&#x2F;span&gt;&lt;span&gt;().&lt;&#x2F;span&gt;&lt;span style=&quot;color: #B392F0;&quot;&gt;split&lt;&#x2F;span&gt;&lt;span&gt;(&lt;&#x2F;span&gt;&lt;span style=&quot;color: #9ECBFF;&quot;&gt;&amp;quot;T&amp;quot;&lt;&#x2F;span&gt;&lt;span&gt;)[&lt;&#x2F;span&gt;&lt;span style=&quot;color: #79B8FF;&quot;&gt;0&lt;&#x2F;span&gt;&lt;span&gt;];&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span style=&quot;color: #F97583;&quot;&gt;  const&lt;&#x2F;span&gt;&lt;span style=&quot;color: #79B8FF;&quot;&gt; summary&lt;&#x2F;span&gt;&lt;span style=&quot;color: #F97583;&quot;&gt; = await&lt;&#x2F;span&gt;&lt;span style=&quot;color: #B392F0;&quot;&gt; summarizer&lt;&#x2F;span&gt;&lt;span&gt;(&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span style=&quot;color: #9ECBFF;&quot;&gt;    `episodic:${&lt;&#x2F;span&gt;&lt;span&gt;userId&lt;&#x2F;span&gt;&lt;span style=&quot;color: #9ECBFF;&quot;&gt;}`&lt;&#x2F;span&gt;&lt;span&gt;,&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span style=&quot;color: #9ECBFF;&quot;&gt;    `Summarize this conversation into a structured session log.&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span style=&quot;color: #9ECBFF;&quot;&gt;Date: ${&lt;&#x2F;span&gt;&lt;span&gt;date&lt;&#x2F;span&gt;&lt;span style=&quot;color: #9ECBFF;&quot;&gt;}&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span style=&quot;color: #9ECBFF;&quot;&gt;Objective: What the user was trying to do.&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span style=&quot;color: #9ECBFF;&quot;&gt;Timeline: Bullet points of what happened, key decisions and discoveries.&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span style=&quot;color: #9ECBFF;&quot;&gt;Outcome: What got done, what&amp;#39;s unresolved.&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span style=&quot;color: #9ECBFF;&quot;&gt;Keep it concise. No markdown formatting. One line per bullet.`&lt;&#x2F;span&gt;&lt;span&gt;,&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;  ).&lt;&#x2F;span&gt;&lt;span style=&quot;color: #B392F0;&quot;&gt;writeNow&lt;&#x2F;span&gt;&lt;span&gt;({ history:&lt;&#x2F;span&gt;&lt;span style=&quot;color: #B392F0;&quot;&gt; toScope&lt;&#x2F;span&gt;&lt;span&gt;(messages) });&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span style=&quot;color: #F97583;&quot;&gt;  await&lt;&#x2F;span&gt;&lt;span style=&quot;color: #B392F0;&quot;&gt; summarizer&lt;&#x2F;span&gt;&lt;span&gt;(&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span style=&quot;color: #9ECBFF;&quot;&gt;    `profile:${&lt;&#x2F;span&gt;&lt;span&gt;userId&lt;&#x2F;span&gt;&lt;span style=&quot;color: #9ECBFF;&quot;&gt;}`&lt;&#x2F;span&gt;&lt;span&gt;,&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span style=&quot;color: #9ECBFF;&quot;&gt;    `Extract durable facts about the user. One fact per line as &amp;#39;category: value&amp;#39;.&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span style=&quot;color: #9ECBFF;&quot;&gt;Categories: name, role, company, tools, preferences, current projects.&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span style=&quot;color: #9ECBFF;&quot;&gt;Only include facts explicitly stated. Omit unknown categories.&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span style=&quot;color: #9ECBFF;&quot;&gt;When new info contradicts old, keep only the latest.`&lt;&#x2F;span&gt;&lt;span&gt;,&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;  ).&lt;&#x2F;span&gt;&lt;span style=&quot;color: #B392F0;&quot;&gt;writeNow&lt;&#x2F;span&gt;&lt;span&gt;({ history: cria.&lt;&#x2F;span&gt;&lt;span style=&quot;color: #B392F0;&quot;&gt;user&lt;&#x2F;span&gt;&lt;span&gt;(&lt;&#x2F;span&gt;&lt;span style=&quot;color: #9ECBFF;&quot;&gt;`Session summary:&lt;&#x2F;span&gt;&lt;span style=&quot;color: #79B8FF;&quot;&gt;\n&lt;&#x2F;span&gt;&lt;span style=&quot;color: #9ECBFF;&quot;&gt;${&lt;&#x2F;span&gt;&lt;span&gt;summary&lt;&#x2F;span&gt;&lt;span style=&quot;color: #9ECBFF;&quot;&gt;}`&lt;&#x2F;span&gt;&lt;span&gt;) });&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span style=&quot;color: #F97583;&quot;&gt;  await&lt;&#x2F;span&gt;&lt;span&gt; vectorDB.&lt;&#x2F;span&gt;&lt;span style=&quot;color: #B392F0;&quot;&gt;index&lt;&#x2F;span&gt;&lt;span&gt;({&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;    id:&lt;&#x2F;span&gt;&lt;span style=&quot;color: #9ECBFF;&quot;&gt; `session:${&lt;&#x2F;span&gt;&lt;span&gt;userId&lt;&#x2F;span&gt;&lt;span style=&quot;color: #9ECBFF;&quot;&gt;}:${&lt;&#x2F;span&gt;&lt;span&gt;date&lt;&#x2F;span&gt;&lt;span style=&quot;color: #9ECBFF;&quot;&gt;}:${&lt;&#x2F;span&gt;&lt;span&gt;Date&lt;&#x2F;span&gt;&lt;span style=&quot;color: #9ECBFF;&quot;&gt;.&lt;&#x2F;span&gt;&lt;span style=&quot;color: #B392F0;&quot;&gt;now&lt;&#x2F;span&gt;&lt;span style=&quot;color: #9ECBFF;&quot;&gt;()}`&lt;&#x2F;span&gt;&lt;span&gt;,&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;    data: summary,&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;  });&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;}&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;That’s the complete &lt;code&gt;save&lt;&#x2F;code&gt;. Three writes when the session ends: distill the conversation into a session log, update the user profile, index for search. The summary already carries its own date and timeline, so the vector index gets rich, structured content to embed. All three happen at the same moment because that’s when the context is complete: The user has signaled “we’re done” through inactivity or clicking save.&lt;&#x2F;p&gt;
&lt;p&gt;Now the final version of &lt;code&gt;prompt&lt;&#x2F;code&gt;. We add vector search using the user’s latest message as the query. And since vector search can surface any relevant past session by similarity, we no longer need the explicit previous session summary. If the last session is relevant to what the user is asking about, it’ll show up in the results.&lt;&#x2F;p&gt;
&lt;pre class=&quot;giallo&quot; style=&quot;color: #E1E4E8; background-color: #24292E;&quot;&gt;&lt;code data-lang=&quot;typescript&quot;&gt;&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;async&lt;&#x2F;span&gt;&lt;span style=&quot;color: #B392F0;&quot;&gt; prompt&lt;&#x2F;span&gt;&lt;span&gt;(userId: string, messages: SessionMessage[]) {&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span style=&quot;color: #F97583;&quot;&gt;  const&lt;&#x2F;span&gt;&lt;span style=&quot;color: #79B8FF;&quot;&gt; userProfile&lt;&#x2F;span&gt;&lt;span style=&quot;color: #F97583;&quot;&gt; = await&lt;&#x2F;span&gt;&lt;span style=&quot;color: #B392F0;&quot;&gt; summarizer&lt;&#x2F;span&gt;&lt;span&gt;(&lt;&#x2F;span&gt;&lt;span style=&quot;color: #9ECBFF;&quot;&gt;`profile:${&lt;&#x2F;span&gt;&lt;span&gt;userId&lt;&#x2F;span&gt;&lt;span style=&quot;color: #9ECBFF;&quot;&gt;}`&lt;&#x2F;span&gt;&lt;span&gt;).&lt;&#x2F;span&gt;&lt;span style=&quot;color: #B392F0;&quot;&gt;get&lt;&#x2F;span&gt;&lt;span&gt;();&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span style=&quot;color: #F97583;&quot;&gt;  const&lt;&#x2F;span&gt;&lt;span style=&quot;color: #79B8FF;&quot;&gt; latestQuestion&lt;&#x2F;span&gt;&lt;span style=&quot;color: #F97583;&quot;&gt; =&lt;&#x2F;span&gt;&lt;span&gt; messages.&lt;&#x2F;span&gt;&lt;span style=&quot;color: #B392F0;&quot;&gt;findLast&lt;&#x2F;span&gt;&lt;span&gt;((&lt;&#x2F;span&gt;&lt;span style=&quot;color: #FFAB70;&quot;&gt;m&lt;&#x2F;span&gt;&lt;span&gt;)&lt;&#x2F;span&gt;&lt;span style=&quot;color: #F97583;&quot;&gt; =&amp;gt;&lt;&#x2F;span&gt;&lt;span&gt; m.role&lt;&#x2F;span&gt;&lt;span style=&quot;color: #F97583;&quot;&gt; ===&lt;&#x2F;span&gt;&lt;span style=&quot;color: #9ECBFF;&quot;&gt; &amp;quot;user&amp;quot;&lt;&#x2F;span&gt;&lt;span&gt;)?.content&lt;&#x2F;span&gt;&lt;span style=&quot;color: #F97583;&quot;&gt; ??&lt;&#x2F;span&gt;&lt;span style=&quot;color: #9ECBFF;&quot;&gt; &amp;quot;&amp;quot;&lt;&#x2F;span&gt;&lt;span&gt;;&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span style=&quot;color: #F97583;&quot;&gt;  return&lt;&#x2F;span&gt;&lt;span&gt; cria.&lt;&#x2F;span&gt;&lt;span style=&quot;color: #B392F0;&quot;&gt;prompt&lt;&#x2F;span&gt;&lt;span&gt;(provider)&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;    .&lt;&#x2F;span&gt;&lt;span style=&quot;color: #B392F0;&quot;&gt;system&lt;&#x2F;span&gt;&lt;span&gt;(&lt;&#x2F;span&gt;&lt;span style=&quot;color: #9ECBFF;&quot;&gt;`## What You Know About The User&lt;&#x2F;span&gt;&lt;span style=&quot;color: #79B8FF;&quot;&gt;\n&lt;&#x2F;span&gt;&lt;span style=&quot;color: #9ECBFF;&quot;&gt;${&lt;&#x2F;span&gt;&lt;span&gt;userProfile&lt;&#x2F;span&gt;&lt;span style=&quot;color: #9ECBFF;&quot;&gt;}`&lt;&#x2F;span&gt;&lt;span&gt;)&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;    .&lt;&#x2F;span&gt;&lt;span style=&quot;color: #B392F0;&quot;&gt;use&lt;&#x2F;span&gt;&lt;span&gt;(vectorDB.&lt;&#x2F;span&gt;&lt;span style=&quot;color: #B392F0;&quot;&gt;plugin&lt;&#x2F;span&gt;&lt;span&gt;({ query: latestQuestion, limit:&lt;&#x2F;span&gt;&lt;span style=&quot;color: #79B8FF;&quot;&gt; 4&lt;&#x2F;span&gt;&lt;span&gt; }))&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;    .&lt;&#x2F;span&gt;&lt;span style=&quot;color: #B392F0;&quot;&gt;use&lt;&#x2F;span&gt;&lt;span&gt;(&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span style=&quot;color: #B392F0;&quot;&gt;      summarizer&lt;&#x2F;span&gt;&lt;span&gt;(&lt;&#x2F;span&gt;&lt;span style=&quot;color: #9ECBFF;&quot;&gt;`history:${&lt;&#x2F;span&gt;&lt;span&gt;userId&lt;&#x2F;span&gt;&lt;span style=&quot;color: #9ECBFF;&quot;&gt;}`&lt;&#x2F;span&gt;&lt;span&gt;).&lt;&#x2F;span&gt;&lt;span style=&quot;color: #B392F0;&quot;&gt;plugin&lt;&#x2F;span&gt;&lt;span&gt;({&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;        history:&lt;&#x2F;span&gt;&lt;span style=&quot;color: #B392F0;&quot;&gt; toScope&lt;&#x2F;span&gt;&lt;span&gt;(messages),&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;      })&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;    );&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;}&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;That’s the complete prompt function we’ve been building toward. Three blocks: who this person is, what past sessions are relevant, and the current conversation. When the user asks “what restaurant did you recommend?”, the vector plugin finds the session where food came up, even if it was weeks ago. No explicit tagging, no categories to maintain.&lt;&#x2F;p&gt;
&lt;p&gt;One caveat: Embedding full session summaries works well when sessions are focused on a single topic. It gets weaker when a session covers multiple unrelated topics because the embedding becomes a “semantic average” that matches none of them precisely. Production systems handle this by indexing at topic-level granularity rather than session-level, or by combining semantic search with keyword search to catch what embeddings miss. For a starting point, per-session indexing gets you surprisingly far.&lt;&#x2F;p&gt;
&lt;p&gt;And when the token budget fills up, Cria drops layers in a predictable order:&lt;&#x2F;p&gt;
&lt;table&gt;&lt;thead&gt;&lt;tr&gt;&lt;th&gt;What drops first&lt;&#x2F;th&gt;&lt;th&gt;What happens&lt;&#x2F;th&gt;&lt;&#x2F;tr&gt;&lt;&#x2F;thead&gt;&lt;tbody&gt;
&lt;tr&gt;&lt;td&gt;Vector recall&lt;&#x2F;td&gt;&lt;td&gt;“We discussed this last week” stops working&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;Conversation history&lt;&#x2F;td&gt;&lt;td&gt;Older turns get progressively summarized&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;User profile&lt;&#x2F;td&gt;&lt;td&gt;Minor friction, user re-explains preferences&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;&#x2F;tbody&gt;&lt;&#x2F;table&gt;
&lt;p&gt;Everything degrades gracefully. For more on &lt;a href=&quot;&#x2F;blog&#x2F;ultimate-guide-to-llm-memory&#x2F;#prompt-layout&quot;&gt;prompt layout&lt;&#x2F;a&gt; and priority-based degradation, see the &lt;a href=&quot;&#x2F;blog&#x2F;ultimate-guide-to-llm-memory&#x2F;&quot;&gt;LLM memory guide&lt;&#x2F;a&gt;.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;wiring-it-up&quot;&gt;Wiring it up&lt;&#x2F;h2&gt;
&lt;p&gt;Here’s the full chat route. Compare it to our starting point: The only addition is composing &lt;code&gt;Memory.prompt()&lt;&#x2F;code&gt; into the request.&lt;&#x2F;p&gt;
&lt;pre class=&quot;giallo&quot; style=&quot;color: #E1E4E8; background-color: #24292E;&quot;&gt;&lt;code data-lang=&quot;typescript&quot;&gt;&lt;span class=&quot;giallo-l&quot;&gt;&lt;span style=&quot;color: #6A737D;&quot;&gt;&#x2F;&#x2F; app&#x2F;api&#x2F;chat&#x2F;route.ts&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span style=&quot;color: #F97583;&quot;&gt;import&lt;&#x2F;span&gt;&lt;span&gt; { Memory, model, provider }&lt;&#x2F;span&gt;&lt;span style=&quot;color: #F97583;&quot;&gt; from&lt;&#x2F;span&gt;&lt;span style=&quot;color: #9ECBFF;&quot;&gt; &amp;quot;@&#x2F;lib&#x2F;memory&amp;quot;&lt;&#x2F;span&gt;&lt;span&gt;;&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span style=&quot;color: #F97583;&quot;&gt;export async function&lt;&#x2F;span&gt;&lt;span style=&quot;color: #B392F0;&quot;&gt; POST&lt;&#x2F;span&gt;&lt;span&gt;(&lt;&#x2F;span&gt;&lt;span style=&quot;color: #FFAB70;&quot;&gt;req&lt;&#x2F;span&gt;&lt;span style=&quot;color: #F97583;&quot;&gt;:&lt;&#x2F;span&gt;&lt;span style=&quot;color: #B392F0;&quot;&gt; Request&lt;&#x2F;span&gt;&lt;span&gt;) {&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span style=&quot;color: #F97583;&quot;&gt;  const&lt;&#x2F;span&gt;&lt;span&gt; {&lt;&#x2F;span&gt;&lt;span style=&quot;color: #79B8FF;&quot;&gt; messages&lt;&#x2F;span&gt;&lt;span&gt;,&lt;&#x2F;span&gt;&lt;span style=&quot;color: #79B8FF;&quot;&gt; userId&lt;&#x2F;span&gt;&lt;span&gt; }&lt;&#x2F;span&gt;&lt;span style=&quot;color: #F97583;&quot;&gt; = await&lt;&#x2F;span&gt;&lt;span&gt; req.&lt;&#x2F;span&gt;&lt;span style=&quot;color: #B392F0;&quot;&gt;json&lt;&#x2F;span&gt;&lt;span&gt;();&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span style=&quot;color: #F97583;&quot;&gt;  const&lt;&#x2F;span&gt;&lt;span style=&quot;color: #79B8FF;&quot;&gt; memoryPrompt&lt;&#x2F;span&gt;&lt;span style=&quot;color: #F97583;&quot;&gt; = await&lt;&#x2F;span&gt;&lt;span&gt; Memory.&lt;&#x2F;span&gt;&lt;span style=&quot;color: #B392F0;&quot;&gt;prompt&lt;&#x2F;span&gt;&lt;span&gt;(userId, messages);&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span style=&quot;color: #F97583;&quot;&gt;  const&lt;&#x2F;span&gt;&lt;span style=&quot;color: #79B8FF;&quot;&gt; rendered&lt;&#x2F;span&gt;&lt;span style=&quot;color: #F97583;&quot;&gt; = await&lt;&#x2F;span&gt;&lt;span&gt; cria&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;    .&lt;&#x2F;span&gt;&lt;span style=&quot;color: #B392F0;&quot;&gt;prompt&lt;&#x2F;span&gt;&lt;span&gt;(provider)&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;    .&lt;&#x2F;span&gt;&lt;span style=&quot;color: #B392F0;&quot;&gt;system&lt;&#x2F;span&gt;&lt;span&gt;(&lt;&#x2F;span&gt;&lt;span style=&quot;color: #79B8FF;&quot;&gt;SYSTEM_PROMPT&lt;&#x2F;span&gt;&lt;span&gt;)&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;    .&lt;&#x2F;span&gt;&lt;span style=&quot;color: #B392F0;&quot;&gt;use&lt;&#x2F;span&gt;&lt;span&gt;(memoryPrompt)&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;    .&lt;&#x2F;span&gt;&lt;span style=&quot;color: #B392F0;&quot;&gt;render&lt;&#x2F;span&gt;&lt;span&gt;({ budget:&lt;&#x2F;span&gt;&lt;span style=&quot;color: #79B8FF;&quot;&gt; 40_000&lt;&#x2F;span&gt;&lt;span&gt; });&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span style=&quot;color: #F97583;&quot;&gt;  return&lt;&#x2F;span&gt;&lt;span style=&quot;color: #B392F0;&quot;&gt; streamText&lt;&#x2F;span&gt;&lt;span&gt;({ model, messages: rendered }).&lt;&#x2F;span&gt;&lt;span style=&quot;color: #B392F0;&quot;&gt;toDataStreamResponse&lt;&#x2F;span&gt;&lt;span&gt;();&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;}&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;The app owns the system prompt. Memory is a plugin that slots in via &lt;code&gt;.use()&lt;&#x2F;code&gt;. &lt;code&gt;render({ budget: 40_000 })&lt;&#x2F;code&gt; is where Cria figures out what fits and what gets dropped. The save endpoint is equally simple:&lt;&#x2F;p&gt;
&lt;pre class=&quot;giallo&quot; style=&quot;color: #E1E4E8; background-color: #24292E;&quot;&gt;&lt;code data-lang=&quot;typescript&quot;&gt;&lt;span class=&quot;giallo-l&quot;&gt;&lt;span style=&quot;color: #6A737D;&quot;&gt;&#x2F;&#x2F; app&#x2F;api&#x2F;save&#x2F;route.ts&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span style=&quot;color: #F97583;&quot;&gt;export async function&lt;&#x2F;span&gt;&lt;span style=&quot;color: #B392F0;&quot;&gt; POST&lt;&#x2F;span&gt;&lt;span&gt;(&lt;&#x2F;span&gt;&lt;span style=&quot;color: #FFAB70;&quot;&gt;req&lt;&#x2F;span&gt;&lt;span style=&quot;color: #F97583;&quot;&gt;:&lt;&#x2F;span&gt;&lt;span style=&quot;color: #B392F0;&quot;&gt; Request&lt;&#x2F;span&gt;&lt;span&gt;) {&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span style=&quot;color: #F97583;&quot;&gt;  const&lt;&#x2F;span&gt;&lt;span&gt; {&lt;&#x2F;span&gt;&lt;span style=&quot;color: #79B8FF;&quot;&gt; userId&lt;&#x2F;span&gt;&lt;span&gt;,&lt;&#x2F;span&gt;&lt;span style=&quot;color: #79B8FF;&quot;&gt; messages&lt;&#x2F;span&gt;&lt;span&gt; }&lt;&#x2F;span&gt;&lt;span style=&quot;color: #F97583;&quot;&gt; = await&lt;&#x2F;span&gt;&lt;span&gt; req.&lt;&#x2F;span&gt;&lt;span style=&quot;color: #B392F0;&quot;&gt;json&lt;&#x2F;span&gt;&lt;span&gt;();&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span style=&quot;color: #F97583;&quot;&gt;  await&lt;&#x2F;span&gt;&lt;span&gt; Memory.&lt;&#x2F;span&gt;&lt;span style=&quot;color: #B392F0;&quot;&gt;save&lt;&#x2F;span&gt;&lt;span&gt;(userId, messages);&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span style=&quot;color: #F97583;&quot;&gt;  return&lt;&#x2F;span&gt;&lt;span&gt; Response.&lt;&#x2F;span&gt;&lt;span style=&quot;color: #B392F0;&quot;&gt;json&lt;&#x2F;span&gt;&lt;span&gt;({ ok:&lt;&#x2F;span&gt;&lt;span style=&quot;color: #79B8FF;&quot;&gt; true&lt;&#x2F;span&gt;&lt;span&gt; });&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;}&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;&lt;h2 id=&quot;the-result&quot;&gt;The result&lt;&#x2F;h2&gt;
&lt;p&gt;The user introduces themselves. They click “Save session.” The system writes episodic, semantic, and vector memories. New session, empty chat. “What do you remember about me?”&lt;&#x2F;p&gt;
&lt;img src=&quot;&#x2F;images&#x2F;memory-assistant-demo.gif&quot; alt=&quot;User chats, saves the session, then asks the assistant what it remembers&quot; style=&quot;max-width: 640px;&quot; &#x2F;&gt;
&lt;p&gt;The assistant knows their name, location, what they’re working on, and what they wanted help with. Three memory layers, composed into a single prompt builder. The user never re-introduced themselves.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;where-to-go-from-here&quot;&gt;Where to go from here&lt;&#x2F;h2&gt;
&lt;p&gt;What we built is a working memory system, not a finished one. Three directions to explore:&lt;&#x2F;p&gt;
&lt;p&gt;&lt;em&gt;Iterate on the prompts with real conversations.&lt;&#x2F;em&gt; The prompts we used are a starting point. Your product’s conversations will be different. Run a handful of real sessions through the summarizer, read the output, adjust. The shape of what gets stored is entirely determined by the prompt.&lt;&#x2F;p&gt;
&lt;p&gt;&lt;em&gt;Consider atomic facts and knowledge graphs.&lt;&#x2F;em&gt; Our profile summarizer rolls new facts into an existing summary, which works but is lossy. &lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;github.com&#x2F;mem0ai&#x2F;mem0&quot;&gt;Mem0&lt;&#x2F;a&gt; extracts individual facts and diffs them (add, update, delete) so nothing gets silently dropped. &lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;github.com&#x2F;getzep&#x2F;zep&quot;&gt;Zep&lt;&#x2F;a&gt; goes further and builds a knowledge graph from conversations, capturing entities and their relationships over time. That precision matters when your product relies on remembering specific details.&lt;&#x2F;p&gt;
&lt;p&gt;&lt;em&gt;Let your product decide what to remember.&lt;&#x2F;em&gt; A coding assistant needs to remember file paths and error patterns. A health app needs to remember medications and allergies. A tutoring app needs to remember what the student already understands. The memory layers are the same. The prompts and what you index are completely different. Start with what your users complain about forgetting.&lt;&#x2F;p&gt;
&lt;p&gt;For the theory behind these layers, see our &lt;a href=&quot;&#x2F;blog&#x2F;ultimate-guide-to-llm-memory&#x2F;&quot;&gt;guide to LLM memory&lt;&#x2F;a&gt; and &lt;a href=&quot;&#x2F;blog&#x2F;llm-memory-systems-explained&#x2F;&quot;&gt;memory systems explained&lt;&#x2F;a&gt;. And if you want to see where this composable approach goes at scale, read OpenAI’s writeup on &lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;openai.com&#x2F;index&#x2F;inside-our-in-house-data-agent&#x2F;&quot;&gt;their in-house data agent&lt;&#x2F;a&gt;. They use six context layers with the same fundamental pattern: different prompts producing different memory shapes, all composed into a single context window. The fact that this architecture works for an internal tool processing millions of queries is a strong signal that the building blocks we used here are the right ones.&lt;&#x2F;p&gt;
&lt;hr &#x2F;&gt;
&lt;p&gt;Full code on &lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;github.com&#x2F;fastpaca&#x2F;memory-assistant&quot;&gt;GitHub&lt;&#x2F;a&gt;. Built with &lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;github.com&#x2F;fastpaca&#x2F;cria&quot;&gt;Cria&lt;&#x2F;a&gt;.&lt;&#x2F;p&gt;
</description>
      </item>
      <item>
          <title>Ultimate Guide to LLM Memory</title>
          <pubDate>Fri, 16 Jan 2026 00:00:00 +0000</pubDate>
          <author>Sebastian Lund</author>
          <link>https://fastpaca.com/blog/ultimate-guide-to-llm-memory/</link>
          <guid>https://fastpaca.com/blog/ultimate-guide-to-llm-memory/</guid>
          <description xml:base="https://fastpaca.com/blog/ultimate-guide-to-llm-memory/">&lt;p&gt;Most LLM memory systems make your product worse.&lt;&#x2F;p&gt;
&lt;p&gt;Engineers add them expecting a database. Instead they get something slow, expensive, and unreliable. Mention memory tools to anyone running agents in production and you get the same reaction: &lt;em&gt;“It’s heavy.” “The latency kills us.” “Great in theory.”&lt;&#x2F;em&gt;&lt;&#x2F;p&gt;
&lt;p&gt;The problem isn’t the tools. It’s that &lt;a href=&quot;&#x2F;blog&#x2F;memory-isnt-one-thing&#x2F;&quot;&gt;there is no universal LLM memory&lt;&#x2F;a&gt;. The industry &lt;a href=&quot;&#x2F;blog&#x2F;llm-memory-systems-explained&#x2F;&quot;&gt;uses several methods&lt;&#x2F;a&gt; to fake it, and each solves a different problem. Pick the wrong one and you pay for capability you don’t need while missing what you do.&lt;&#x2F;p&gt;
&lt;p&gt;This guide covers the patterns, how to match them to your use case, and how to swap them out when something better drops.&lt;&#x2F;p&gt;
&lt;img src=&quot;&#x2F;images&#x2F;alpaca-plugging-memory-into-llm.png&quot; alt=&quot;Alpaca plugging memory into an LLM&quot; style=&quot;max-width: 440px;&quot; &#x2F;&gt;
&lt;h2 id=&quot;memory-database&quot;&gt;Memory ≠ Database&lt;&#x2F;h2&gt;
&lt;p&gt;Before we get into the details, we need to change your mental model about what “memory” means in the world of AI and LLMs.&lt;&#x2F;p&gt;
&lt;p&gt;Most of us adding memory to agents aren’t LLM experts. We build something with an LLM and want it to remember things between sessions. We search online, find some cool “memory system”, slap it on top, only to discover it makes our product slow and confusing. And it barely remembers anything useful.&lt;&#x2F;p&gt;
&lt;p&gt;When we think of memory in software, we think about databases: we “store things somewhere” and “fetch them later”. Modern databases are robust and deterministic. Fifty years of software engineering has gone into making sure they don’t fail in production. Databases were built to store and retrieve data for which you knew the exact shape, the exact constraints, and wrote the code yourself. They are deterministic by design.&lt;&#x2F;p&gt;
&lt;p&gt;LLMs operate under different constraints than the code you manually write to call a database:&lt;&#x2F;p&gt;
&lt;blockquote&gt;
&lt;p&gt;Humans are unpredictable. LLMs are unpredictable. You’re interfacing between two unpredictable systems in a protocol that allows virtually anything (text).&lt;&#x2F;p&gt;
&lt;&#x2F;blockquote&gt;
&lt;ol&gt;
&lt;li&gt;The type of data you deal with is “random text that contains useful stuff”, but you rarely know what’s useful up-front.&lt;&#x2F;li&gt;
&lt;li&gt;You have an upper limit on the amount of data you can include and shove into an LLM until it becomes slow, costly, and inaccurate (confused).&lt;&#x2F;li&gt;
&lt;li&gt;LLMs are by their very nature non-deterministic. Same input can yield different results.&lt;&#x2F;li&gt;
&lt;&#x2F;ol&gt;
&lt;p&gt;Every memory system is picking a point on this trade-off triangle:&lt;&#x2F;p&gt;
&lt;p&gt;&lt;img src=&quot;&#x2F;images&#x2F;llm-memory-triangle.svg&quot; alt=&quot;Latency, cost, and accuracy triangle illustrating competing pressures on memory systems&quot; &#x2F;&gt;&lt;&#x2F;p&gt;
&lt;p&gt;These constraints mean you need to work &lt;em&gt;around&lt;&#x2F;em&gt; the non-determinism, the limitations of the LLM, and the nature of the input data.&lt;&#x2F;p&gt;
&lt;p&gt;Say you’re building a voice assistant for a hospital and want to store useful data for diagnostics. You cannot do so deterministically. Imagine if you saved data every time a user mentioned “feeling unwell”. You’d quickly find your entire database filled with useless data from patients discussing other people rather than themselves because they’re lonely.&lt;&#x2F;p&gt;
&lt;p&gt;Unpredictable systems are like async&#x2F;await. Once you introduce it, you can’t escape it. You can only build around it. Memory systems are inherently &lt;em&gt;unpredictable&lt;&#x2F;em&gt; because of the environment they’re deployed and built within. Databases are &lt;em&gt;predictable&lt;&#x2F;em&gt; by design. They are not the same.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;what-is-llm-memory&quot;&gt;What Is “LLM Memory”?&lt;&#x2F;h2&gt;
&lt;p&gt;Because LLMs and humans are both unpredictable, the industry is building memory in layers: working memory, episodic memory, semantic memory, and document memory. Each serves a different purpose and solves a different problem.&lt;&#x2F;p&gt;
&lt;p&gt;A useful mental model:&lt;&#x2F;p&gt;
&lt;ul&gt;
&lt;li&gt;Working: what I’m &lt;em&gt;thinking&lt;&#x2F;em&gt; right now&lt;&#x2F;li&gt;
&lt;li&gt;Episodic: what I &lt;em&gt;experienced&lt;&#x2F;em&gt; (concrete events)&lt;&#x2F;li&gt;
&lt;li&gt;Semantic: what I &lt;em&gt;know&lt;&#x2F;em&gt; (facts from experience)&lt;&#x2F;li&gt;
&lt;li&gt;Document: what I can &lt;em&gt;look up&lt;&#x2F;em&gt; (external reference)&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;h3 id=&quot;working-memory&quot;&gt;Working Memory&lt;&#x2F;h3&gt;
&lt;p&gt;Working memory keeps the LLM &lt;em&gt;on track&lt;&#x2F;em&gt; with what it’s currently doing. Think of it like a scratch pad the LLM needs to update continuously to track what it’s doing, what it has tried, and what it’s trying to achieve.&lt;&#x2F;p&gt;
&lt;p&gt;Working memory is usually localized to the immediate context of the LLM and is never truncated or messed with. Simple versions include keeping the last 100 messages, storing data in a file and updating continuously, or maintaining a todo list.&lt;&#x2F;p&gt;
&lt;p&gt;LLMs usually take multiple turns to complete a task (making them agents), yet they’re stateless. Working memory helps them continue with a task they started in a previous turn.&lt;&#x2F;p&gt;
&lt;p&gt;LLMs have amnesia (much like Dory in Finding Nemo) and write things down in their working memory to remember next time. It’s the &lt;em&gt;log&lt;&#x2F;em&gt; of what they’re thinking about, what they’re planning to do, etc.&lt;&#x2F;p&gt;
&lt;p&gt;Working memory is a “structured context window” for your LLM.&lt;&#x2F;p&gt;
&lt;h3 id=&quot;episodic-memory&quot;&gt;Episodic Memory&lt;&#x2F;h3&gt;
&lt;p&gt;Working memory tracks what’s happening now. But what about yesterday? Last week? Episodic memory gives your LLM a timeline of past events it can reference.&lt;&#x2F;p&gt;
&lt;p&gt;Think of it as a structured history log. What happened, when, and in what order. “Last Tuesday you asked me to draft that email.” “Earlier in this conversation you mentioned your deadline.” Episodic memory makes these references possible.&lt;&#x2F;p&gt;
&lt;p&gt;Episodic memory works on selective inclusion: you decide what events to remember, how to log them, what details matter. An additional agent or LLM typically handles extraction, deciding what’s worth persisting for future reference.&lt;&#x2F;p&gt;
&lt;p&gt;Episodic memory is usually called “summarization”, and you may see that term used instead. It’s a technique for how to achieve episodic memory in its most simple form, where you “summarise” past messages into a log of useful episodic events. &lt;a href=&quot;&#x2F;blog&#x2F;llm-memory-systems-explained&#x2F;#summarization&quot;&gt;Read more about it here&lt;&#x2F;a&gt;.&lt;&#x2F;p&gt;
&lt;p&gt;&lt;img src=&quot;&#x2F;images&#x2F;llm-memory-summary.svg&quot; alt=&quot;Diagram showing how older chat turns are compressed into a smaller summary&quot; &#x2F;&gt;&lt;&#x2F;p&gt;
&lt;h3 id=&quot;semantic-memory&quot;&gt;Semantic Memory&lt;&#x2F;h3&gt;
&lt;p&gt;Users interacting across multiple sessions often feed the same data repeatedly into the LLM’s episodic and working memory. If I’m diabetic, I probably have to tell the LLM that every session. How else would it know?&lt;&#x2F;p&gt;
&lt;p&gt;Semantic memory extracts important facts from episodic and working memory and persists them for future sessions. It’s a way to remember user preferences and personalize experiences. &lt;a href=&quot;&#x2F;blog&#x2F;llm-memory-systems-explained&#x2F;#fact-tracking-across-conversations&quot;&gt;Read more about it here&lt;&#x2F;a&gt;.&lt;&#x2F;p&gt;
&lt;p&gt;Semantic memory, like episodic memory, works on selective inclusion. You use another LLM or agent to extract entities, facts, and relationships. Whatever is interesting to remember. These are then &lt;em&gt;selectively persisted&lt;&#x2F;em&gt; for future turns or sessions.&lt;&#x2F;p&gt;
&lt;p&gt;&lt;img src=&quot;&#x2F;images&#x2F;llm-memory-summary-and-facts.svg&quot; alt=&quot;Illustration of durable fact storage persisting details between conversations&quot; &#x2F;&gt;&lt;&#x2F;p&gt;
&lt;h3 id=&quot;document-memory&quot;&gt;Document Memory&lt;&#x2F;h3&gt;
&lt;p&gt;Working, episodic, and semantic memory deal with information from the conversation or the user. Document memory is different: It’s reference material. Your knowledge base, your docs, your internal wikis.&lt;&#x2F;p&gt;
&lt;p&gt;The industry calls this RAG (Retrieval-Augmented Generation). You store documents in a searchable index, usually as vector embeddings. When the user sends a message, you search for relevant chunks and inject them into the prompt before calling the LLM. The LLM doesn’t know this information and doesn’t look it up. You do, and you enrich the prompt with what you find.&lt;&#x2F;p&gt;
&lt;p&gt;Say you’re building a customer support bot. The LLM has no idea what your pricing tiers are or how your refund policy works. You index your help docs. When a user asks “how do I get a refund?”, you retrieve the relevant policy and hand it to the LLM. Now it can answer accurately.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;when-to-use-what&quot;&gt;When to Use What&lt;&#x2F;h2&gt;
&lt;p&gt;Memory is additive. You start with one layer, need more capability, add another. Each layer solves a different problem. Each can be swapped independently as better solutions emerge.&lt;&#x2F;p&gt;
&lt;p&gt;Most apps start with working memory alone. You add layers as you need new capabilities:&lt;&#x2F;p&gt;
&lt;table&gt;&lt;thead&gt;&lt;tr&gt;&lt;th&gt;You want to…&lt;&#x2F;th&gt;&lt;th&gt;Add…&lt;&#x2F;th&gt;&lt;th&gt;Now you can…&lt;&#x2F;th&gt;&lt;&#x2F;tr&gt;&lt;&#x2F;thead&gt;&lt;tbody&gt;
&lt;tr&gt;&lt;td&gt;Have multi-turn conversations&lt;&#x2F;td&gt;&lt;td&gt;Working&lt;&#x2F;td&gt;&lt;td&gt;Track what’s happening right now&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;Reference things from earlier&lt;&#x2F;td&gt;&lt;td&gt;+ Episodic&lt;&#x2F;td&gt;&lt;td&gt;“You mentioned this earlier” works&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;Remember users across sessions&lt;&#x2F;td&gt;&lt;td&gt;+ Semantic&lt;&#x2F;td&gt;&lt;td&gt;Personalize without re-explaining&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;Answer from your own docs&lt;&#x2F;td&gt;&lt;td&gt;+ Document&lt;&#x2F;td&gt;&lt;td&gt;Cite internal knowledge accurately&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;&#x2F;tbody&gt;&lt;&#x2F;table&gt;
&lt;p&gt;This is a progression, not a menu. You stack layers. A production chat assistant might use all four: working memory for the current turn, episodic summaries for older context, semantic facts for personalization, and document retrieval for domain knowledge.&lt;&#x2F;p&gt;
&lt;p&gt;You can also stack multiples of the same type. Two document stores: one for user uploads, one for your knowledge base. Two episodic memories: one for the current session, one for cross-session history. The layers compose however your use case demands.&lt;&#x2F;p&gt;
&lt;p&gt;&lt;strong&gt;Autonomous agents typically need all four&lt;&#x2F;strong&gt;, each tuned differently. Working memory tracks execution state: what step am I on, what have I tried, what’s next. Episodic memory logs what the agent learned across tasks. Semantic memory stores extracted facts about the environment and user. Document memory provides reference material for decision-making. Agents are more demanding than chat assistants because they execute multi-step plans where each layer plays a distinct role.&lt;&#x2F;p&gt;
&lt;h3 id=&quot;latency-and-cost&quot;&gt;Latency and Cost&lt;&#x2F;h3&gt;
&lt;p&gt;Every layer adds latency and cost. Working memory scales with context length. Episodic and semantic memory add LLM calls for summarization and extraction. Document memory adds vector search plus the tokens you inject.&lt;&#x2F;p&gt;
&lt;p&gt;&lt;a href=&quot;&#x2F;blog&#x2F;memory-isnt-one-thing&#x2F;&quot;&gt;We benchmarked this elsewhere&lt;&#x2F;a&gt;. The tradeoffs are real. A voice assistant can’t wait two seconds for semantic memory extraction. A batch agent processing documents overnight doesn’t care.&lt;&#x2F;p&gt;
&lt;h3 id=&quot;failure-semantics&quot;&gt;Failure Semantics&lt;&#x2F;h3&gt;
&lt;p&gt;Remember: LLMs and humans are both unpredictable. Your memory system will fail. The question isn’t &lt;em&gt;if&lt;&#x2F;em&gt;, it’s &lt;em&gt;how&lt;&#x2F;em&gt;. &lt;a href=&quot;&#x2F;blog&#x2F;failure-case-memory-layout&#x2F;&quot;&gt;We’ve written about this elsewhere&lt;&#x2F;a&gt;, but here’s the short version:&lt;&#x2F;p&gt;
&lt;table&gt;&lt;thead&gt;&lt;tr&gt;&lt;th&gt;Layer&lt;&#x2F;th&gt;&lt;th&gt;When it fails…&lt;&#x2F;th&gt;&lt;th&gt;Product impact&lt;&#x2F;th&gt;&lt;&#x2F;tr&gt;&lt;&#x2F;thead&gt;&lt;tbody&gt;
&lt;tr&gt;&lt;td&gt;&lt;strong&gt;Working&lt;&#x2F;strong&gt;&lt;&#x2F;td&gt;&lt;td&gt;The LLM forgets what it’s doing mid-task.&lt;&#x2F;td&gt;&lt;td&gt;Task fails. User starts over.&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;&lt;strong&gt;Episodic&lt;&#x2F;strong&gt;&lt;&#x2F;td&gt;&lt;td&gt;“We discussed this earlier” stops working.&lt;&#x2F;td&gt;&lt;td&gt;User notices, re-explains.&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;&lt;strong&gt;Semantic&lt;&#x2F;strong&gt;&lt;&#x2F;td&gt;&lt;td&gt;The LLM forgets who you are between sessions.&lt;&#x2F;td&gt;&lt;td&gt;Minor friction. User re-explains.&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;&lt;strong&gt;Document&lt;&#x2F;strong&gt;&lt;&#x2F;td&gt;&lt;td&gt;The LLM hallucinates instead of citing your docs.&lt;&#x2F;td&gt;&lt;td&gt;User gets wrong answer. Trust erodes.&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;&#x2F;tbody&gt;&lt;&#x2F;table&gt;
&lt;p&gt;Not all failures are equal. Semantic and episodic failures degrade the experience. Users notice, get annoyed, re-explain themselves. Frustrating, but recoverable.&lt;&#x2F;p&gt;
&lt;p&gt;Working and document failures are different. A task that fails midway wastes user time. A hallucinated answer damages trust, or worse, causes real harm. For these layers, it’s better to error explicitly than to silently produce garbage. If your document retrieval returns nothing relevant, say so. If working memory is corrupted, stop the task. Crashing beats confident hallucination.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;common-mistakes&quot;&gt;Common Mistakes&lt;&#x2F;h2&gt;
&lt;p&gt;These patterns hurt more products than they help:&lt;&#x2F;p&gt;
&lt;p&gt;&lt;strong&gt;Using the wrong layer for execution state.&lt;&#x2F;strong&gt; Tools like Mem0 and Zep are built for personalization. They extract facts, build user profiles, remember preferences. Great for “remember I prefer dark mode”. &lt;a href=&quot;&#x2F;blog&#x2F;memory-isnt-one-thing&#x2F;&quot;&gt;Catastrophic for “remember what step I’m on in this 12-step deployment”.&lt;&#x2F;a&gt; If your agent loses track mid-task, you’re using the wrong memory type.&lt;&#x2F;p&gt;
&lt;p&gt;&lt;strong&gt;Adopting a generic “memory solution”.&lt;&#x2F;strong&gt; Off-the-shelf memory tools make decisions for you. What to remember, what to forget, how to retrieve. These decisions should be yours. Your chat assistant and your autonomous agent have different needs. A tool that works for one will &lt;a href=&quot;&#x2F;blog&#x2F;failure-case-memory-layout&#x2F;&quot;&gt;produce weird failures&lt;&#x2F;a&gt; in the other.&lt;&#x2F;p&gt;
&lt;p&gt;&lt;strong&gt;Adding layers before you need them.&lt;&#x2F;strong&gt; Working memory is powerful on its own. Modern context windows are large, and your LLM can do a lot with just a well-structured prompt. Start there. Add episodic when users need to reference past conversations. Add semantic when personalization matters. Add document when you have a knowledge base worth querying. Each layer adds complexity. Earn that complexity by needing the capability.&lt;&#x2F;p&gt;
&lt;p&gt;&lt;strong&gt;Forgetting that layers are swappable.&lt;&#x2F;strong&gt; The point of composability isn’t architectural elegance. It’s survival. The best summarization technique today won’t be the best in six months. The RAG approach you choose now will be obsolete. Build so you can rip out any layer and replace it without rewriting everything else.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;how-to-add-memory&quot;&gt;How to Add Memory&lt;&#x2F;h2&gt;
&lt;p&gt;Memory systems aren’t mutually exclusive. You compose them.&lt;&#x2F;p&gt;
&lt;p&gt;A chat assistant might need all four types working together:&lt;&#x2F;p&gt;
&lt;ul&gt;
&lt;li&gt;System prompt + last 100 messages (working)&lt;&#x2F;li&gt;
&lt;li&gt;Summary of older conversation (episodic)&lt;&#x2F;li&gt;
&lt;li&gt;Extracted user facts (semantic)&lt;&#x2F;li&gt;
&lt;li&gt;RAG results from your knowledge base (document)&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;p&gt;The trap is reaching for an off-the-shelf solution that promises to handle “memory” generically. These &lt;a href=&quot;&#x2F;blog&#x2F;failure-case-memory-layout&#x2F;&quot;&gt;produce weird failures&lt;&#x2F;a&gt; because they make decisions for you that should be yours to make.&lt;&#x2F;p&gt;
&lt;p&gt;The better approach: treat your prompt like code. Each memory type is a component. Components can be swapped, removed, or upgraded independently. When a better summarization technique drops, you swap that piece. When your needs change, you rip out what you don’t need. Your prompt structure stays stable while the pieces evolve.&lt;&#x2F;p&gt;
&lt;p&gt;&lt;em&gt;The examples below use &lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;github.com&#x2F;fastpaca&#x2F;cria&quot;&gt;Cria&lt;&#x2F;a&gt;, a prompt composition library I built for exactly this problem. The concepts apply regardless of what tools you use.&lt;&#x2F;em&gt;&lt;&#x2F;p&gt;
&lt;h3 id=&quot;prompt-layout&quot;&gt;Prompt Layout&lt;&#x2F;h3&gt;
&lt;p&gt;Every piece of memory you pull in competes for space. System instructions, retrieved documents, conversation summaries, user facts, recent messages, the current query. They all need to fit. When you query an LLM, you’re assembling a prompt from these pieces. The layout matters.&lt;&#x2F;p&gt;
&lt;p&gt;LLMs weigh information two ways, and understanding both shapes how you structure prompts.&lt;&#x2F;p&gt;
&lt;p&gt;&lt;strong&gt;By position.&lt;&#x2F;strong&gt; LLMs focus on the beginning and end of a prompt. Middle sections fade into background noise. Place stable context (system instructions, user facts) early. Place dynamic context (recent messages, the current query) at the end where it captures attention.&lt;&#x2F;p&gt;
&lt;p&gt;&lt;strong&gt;By proportion.&lt;&#x2F;strong&gt; Whichever memory type dominates your prompt gets more of the model’s focus. The needle-in-a-haystack problem works in reverse: a bloated haystack doesn’t just hide needles, it captures attention that should go elsewhere. Pull in 1000 RAG chunks and the model drowns in them, even if five matter. Keep each memory system constrained so none overwhelms the rest.&lt;&#x2F;p&gt;
&lt;p&gt;&lt;img src=&quot;&#x2F;images&#x2F;llm-memory-layout-simple.svg&quot; alt=&quot;Diagram showing the context window as a layout with regions: sacred content at top, compressible content below, with annotations showing what happens if each region is lost&quot; &#x2F;&gt;&lt;&#x2F;p&gt;
&lt;p&gt;Think of your prompt as a layout with a fixed ceiling. Each model has a context window, a hard upper limit on tokens. You know this number for whatever model you use. But filling to the edge is a mistake. Performance degrades well before you hit the limit, sometimes dramatically. The goal is maximizing useful content while staying comfortably below the threshold where quality drops.&lt;&#x2F;p&gt;
&lt;p&gt;Assign each memory system a region. RAG results might get 2000 tokens. Conversation summaries get 1000. User facts get 500. Recent messages get 1500. These numbers are illustrative. Your actual budgets depend on your model’s context window and your use case. The point is giving each memory type a ceiling. System instructions and the current query are sacred, always included. These constraints keep one memory type from eating the haystack.&lt;&#x2F;p&gt;
&lt;pre class=&quot;giallo&quot; style=&quot;color: #E1E4E8; background-color: #24292E;&quot;&gt;&lt;code data-lang=&quot;typescript&quot;&gt;&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;cria&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;  .&lt;&#x2F;span&gt;&lt;span style=&quot;color: #B392F0;&quot;&gt;prompt&lt;&#x2F;span&gt;&lt;span&gt;()&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span style=&quot;color: #6A737D;&quot;&gt;  &#x2F;&#x2F; working memory: what are we doing?&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;  .&lt;&#x2F;span&gt;&lt;span style=&quot;color: #B392F0;&quot;&gt;system&lt;&#x2F;span&gt;&lt;span&gt;(instructions)&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span style=&quot;color: #6A737D;&quot;&gt;  &#x2F;&#x2F; document memory&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;  .&lt;&#x2F;span&gt;&lt;span style=&quot;color: #B392F0;&quot;&gt;vectorSearch&lt;&#x2F;span&gt;&lt;span&gt;({ store, query, limit:&lt;&#x2F;span&gt;&lt;span style=&quot;color: #79B8FF;&quot;&gt; 5&lt;&#x2F;span&gt;&lt;span&gt; })&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span style=&quot;color: #6A737D;&quot;&gt;  &#x2F;&#x2F; semantic memory in another store (tech is used across memory types)&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;  .&lt;&#x2F;span&gt;&lt;span style=&quot;color: #B392F0;&quot;&gt;vectorSearch&lt;&#x2F;span&gt;&lt;span&gt;({ store: factStore, query, limit:&lt;&#x2F;span&gt;&lt;span style=&quot;color: #79B8FF;&quot;&gt; 15&lt;&#x2F;span&gt;&lt;span&gt; })&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span style=&quot;color: #6A737D;&quot;&gt;  &#x2F;&#x2F; episodic memory&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;  .&lt;&#x2F;span&gt;&lt;span style=&quot;color: #B392F0;&quot;&gt;summary&lt;&#x2F;span&gt;&lt;span&gt;(messages, { id:&lt;&#x2F;span&gt;&lt;span style=&quot;color: #9ECBFF;&quot;&gt; &amp;quot;conversation-summary&amp;quot;&lt;&#x2F;span&gt;&lt;span&gt;, store: summaryStore })&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span style=&quot;color: #6A737D;&quot;&gt;  &#x2F;&#x2F; working memory: the last messages for relevance&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span style=&quot;color: #6A737D;&quot;&gt;  &#x2F;&#x2F; which also captures tool calls&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;  .&lt;&#x2F;span&gt;&lt;span style=&quot;color: #B392F0;&quot;&gt;last&lt;&#x2F;span&gt;&lt;span&gt;(messages, { n:&lt;&#x2F;span&gt;&lt;span style=&quot;color: #79B8FF;&quot;&gt; 100&lt;&#x2F;span&gt;&lt;span&gt; });&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;&lt;h3 id=&quot;composing-for-different-use-cases&quot;&gt;Composing for Different Use Cases&lt;&#x2F;h3&gt;
&lt;p&gt;Different applications need different memory compositions. A chat assistant, an autonomous agent, and a support bot all use the same building blocks but weight them differently.&lt;&#x2F;p&gt;
&lt;p&gt;&lt;strong&gt;Chat assistant&lt;&#x2F;strong&gt;: Personalization matters. Semantic memory (user facts) and episodic memory (conversation summaries) take up most of the layout. Document memory is secondary. You might not need RAG at all if the assistant is general-purpose.&lt;&#x2F;p&gt;
&lt;pre class=&quot;giallo&quot; style=&quot;color: #E1E4E8; background-color: #24292E;&quot;&gt;&lt;code data-lang=&quot;typescript&quot;&gt;&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;cria&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;  .&lt;&#x2F;span&gt;&lt;span style=&quot;color: #B392F0;&quot;&gt;prompt&lt;&#x2F;span&gt;&lt;span&gt;()&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;  .&lt;&#x2F;span&gt;&lt;span style=&quot;color: #B392F0;&quot;&gt;system&lt;&#x2F;span&gt;&lt;span&gt;(assistantInstructions)&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span style=&quot;color: #6A737D;&quot;&gt;  &#x2F;&#x2F; semantic memory dominates&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;  .&lt;&#x2F;span&gt;&lt;span style=&quot;color: #B392F0;&quot;&gt;assistant&lt;&#x2F;span&gt;&lt;span&gt;(userFacts)&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;  .&lt;&#x2F;span&gt;&lt;span style=&quot;color: #B392F0;&quot;&gt;summary&lt;&#x2F;span&gt;&lt;span&gt;(messages, { id:&lt;&#x2F;span&gt;&lt;span style=&quot;color: #9ECBFF;&quot;&gt; &amp;quot;assistant-summary&amp;quot;&lt;&#x2F;span&gt;&lt;span&gt;, store })&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span style=&quot;color: #6A737D;&quot;&gt;  &#x2F;&#x2F; working memory&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;  .&lt;&#x2F;span&gt;&lt;span style=&quot;color: #B392F0;&quot;&gt;last&lt;&#x2F;span&gt;&lt;span&gt;(messages, { n:&lt;&#x2F;span&gt;&lt;span style=&quot;color: #79B8FF;&quot;&gt; 20&lt;&#x2F;span&gt;&lt;span&gt; });&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;&lt;strong&gt;Autonomous agent&lt;&#x2F;strong&gt;: Execution state is critical. Working memory dominates. The agent needs to track what step it’s on, what it’s tried, what failed. Episodic memory logs task history for learning. Document memory provides reference material for decisions.&lt;&#x2F;p&gt;
&lt;pre class=&quot;giallo&quot; style=&quot;color: #E1E4E8; background-color: #24292E;&quot;&gt;&lt;code data-lang=&quot;typescript&quot;&gt;&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;cria&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;  .&lt;&#x2F;span&gt;&lt;span style=&quot;color: #B392F0;&quot;&gt;prompt&lt;&#x2F;span&gt;&lt;span&gt;()&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;  .&lt;&#x2F;span&gt;&lt;span style=&quot;color: #B392F0;&quot;&gt;system&lt;&#x2F;span&gt;&lt;span&gt;(agentInstructions)&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span style=&quot;color: #6A737D;&quot;&gt;  &#x2F;&#x2F; working memory dominates&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;  .&lt;&#x2F;span&gt;&lt;span style=&quot;color: #B392F0;&quot;&gt;assistant&lt;&#x2F;span&gt;&lt;span&gt;(currentPlan)&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;  .&lt;&#x2F;span&gt;&lt;span style=&quot;color: #B392F0;&quot;&gt;assistant&lt;&#x2F;span&gt;&lt;span&gt;(executionLog)&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span style=&quot;color: #6A737D;&quot;&gt;  &#x2F;&#x2F; document memory for reference&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;  .&lt;&#x2F;span&gt;&lt;span style=&quot;color: #B392F0;&quot;&gt;vectorSearch&lt;&#x2F;span&gt;&lt;span&gt;({ store: knowledgeBase, query: currentTask, limit:&lt;&#x2F;span&gt;&lt;span style=&quot;color: #79B8FF;&quot;&gt; 5&lt;&#x2F;span&gt;&lt;span&gt; })&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span style=&quot;color: #6A737D;&quot;&gt;  &#x2F;&#x2F; need more messages in working memory to&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span style=&quot;color: #6A737D;&quot;&gt;  &#x2F;&#x2F; do longer running work continuously&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;  .&lt;&#x2F;span&gt;&lt;span style=&quot;color: #B392F0;&quot;&gt;last&lt;&#x2F;span&gt;&lt;span&gt;(messages, { n:&lt;&#x2F;span&gt;&lt;span style=&quot;color: #79B8FF;&quot;&gt; 200&lt;&#x2F;span&gt;&lt;span&gt; });&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;&lt;strong&gt;Support bot&lt;&#x2F;strong&gt;: Document memory dominates. The bot needs to cite your help docs, knowledge base, product information. Semantic memory (customer info) helps personalize. Episodic memory matters less since support conversations are usually single-session.&lt;&#x2F;p&gt;
&lt;pre class=&quot;giallo&quot; style=&quot;color: #E1E4E8; background-color: #24292E;&quot;&gt;&lt;code data-lang=&quot;typescript&quot;&gt;&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;cria&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;  .&lt;&#x2F;span&gt;&lt;span style=&quot;color: #B392F0;&quot;&gt;prompt&lt;&#x2F;span&gt;&lt;span&gt;()&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;  .&lt;&#x2F;span&gt;&lt;span style=&quot;color: #B392F0;&quot;&gt;system&lt;&#x2F;span&gt;&lt;span&gt;(supportInstructions)&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span style=&quot;color: #6A737D;&quot;&gt;  &#x2F;&#x2F; document memory dominates&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;  .&lt;&#x2F;span&gt;&lt;span style=&quot;color: #B392F0;&quot;&gt;vectorSearch&lt;&#x2F;span&gt;&lt;span&gt;({ store: helpDocs, query: customerQuestion, limit:&lt;&#x2F;span&gt;&lt;span style=&quot;color: #79B8FF;&quot;&gt; 10&lt;&#x2F;span&gt;&lt;span&gt; })&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span style=&quot;color: #6A737D;&quot;&gt;  &#x2F;&#x2F; semantic memory for personalization&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;  .&lt;&#x2F;span&gt;&lt;span style=&quot;color: #B392F0;&quot;&gt;assistant&lt;&#x2F;span&gt;&lt;span&gt;(customerRecord)&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span style=&quot;color: #6A737D;&quot;&gt;  &#x2F;&#x2F; working memory, don&amp;#39;t need that many messages since it&amp;#39;s likely&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span style=&quot;color: #6A737D;&quot;&gt;  &#x2F;&#x2F; to be multiple sessions over time that episodic memory captures&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;  .&lt;&#x2F;span&gt;&lt;span style=&quot;color: #B392F0;&quot;&gt;last&lt;&#x2F;span&gt;&lt;span&gt;(messages, { n:&lt;&#x2F;span&gt;&lt;span style=&quot;color: #79B8FF;&quot;&gt; 25&lt;&#x2F;span&gt;&lt;span&gt; });&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;The composition pattern stays the same. What changes is which layers you include and how much of the layout you give them.&lt;&#x2F;p&gt;
&lt;h3 id=&quot;handling-conflicts&quot;&gt;Handling Conflicts&lt;&#x2F;h3&gt;
&lt;p&gt;Memory layers contradict each other. Your semantic memory says “user prefers dark mode”. Your episodic summary says “user switched to light mode yesterday”. What wins?&lt;&#x2F;p&gt;
&lt;p&gt;LLMs handle this well if you capture &lt;em&gt;when&lt;&#x2F;em&gt; something was true. “Preferred dark mode (January 2024)” vs “switched to light mode (December 2025)” gives the model enough to reason about recency. Make sure your memory system records timestamps.&lt;&#x2F;p&gt;
&lt;p&gt;Off-the-shelf tools like &lt;a href=&quot;&#x2F;blog&#x2F;memory-isnt-one-thing&#x2F;&quot;&gt;Zep&lt;&#x2F;a&gt; and Mem0 handle reconciliation automatically, using an LLM to merge or update facts as they come in. If your use case depends on facts staying accurate over time, pick a memory system that supports this.&lt;&#x2F;p&gt;
&lt;p&gt;Vector search is trickier. RAG can fill your budget with contradictory facts, and you can’t tell which are current. Episodic memory helps because it preserves temporal order. A summary beats a bag of conflicting chunks.&lt;&#x2F;p&gt;
&lt;h3 id=&quot;a-complete-example&quot;&gt;A Complete Example&lt;&#x2F;h3&gt;
&lt;p&gt;Here’s a chat assistant with all four memory types, priority-based fitting, and swappable stores:&lt;&#x2F;p&gt;
&lt;pre class=&quot;giallo&quot; style=&quot;color: #E1E4E8; background-color: #24292E;&quot;&gt;&lt;code data-lang=&quot;typescript&quot;&gt;&lt;span class=&quot;giallo-l&quot;&gt;&lt;span style=&quot;color: #F97583;&quot;&gt;const&lt;&#x2F;span&gt;&lt;span style=&quot;color: #79B8FF;&quot;&gt; summaryStore&lt;&#x2F;span&gt;&lt;span style=&quot;color: #F97583;&quot;&gt; = new&lt;&#x2F;span&gt;&lt;span style=&quot;color: #B392F0;&quot;&gt; PostgresStore&lt;&#x2F;span&gt;&lt;span&gt;({ tableName:&lt;&#x2F;span&gt;&lt;span style=&quot;color: #9ECBFF;&quot;&gt; &amp;quot;summaries&amp;quot;&lt;&#x2F;span&gt;&lt;span&gt; });&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span style=&quot;color: #F97583;&quot;&gt;const&lt;&#x2F;span&gt;&lt;span style=&quot;color: #79B8FF;&quot;&gt; vectorStore&lt;&#x2F;span&gt;&lt;span style=&quot;color: #F97583;&quot;&gt; = new&lt;&#x2F;span&gt;&lt;span style=&quot;color: #B392F0;&quot;&gt; QdrantStore&lt;&#x2F;span&gt;&lt;span&gt;({&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;  client: qdrantClient,&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;  collectionName:&lt;&#x2F;span&gt;&lt;span style=&quot;color: #9ECBFF;&quot;&gt; &amp;quot;knowledge&amp;quot;&lt;&#x2F;span&gt;&lt;span&gt;,&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;  embed: embedFunction,&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;});&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span style=&quot;color: #F97583;&quot;&gt;const&lt;&#x2F;span&gt;&lt;span style=&quot;color: #79B8FF;&quot;&gt; prompt&lt;&#x2F;span&gt;&lt;span style=&quot;color: #F97583;&quot;&gt; =&lt;&#x2F;span&gt;&lt;span&gt; cria&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;  .&lt;&#x2F;span&gt;&lt;span style=&quot;color: #B392F0;&quot;&gt;prompt&lt;&#x2F;span&gt;&lt;span&gt;()&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;  .&lt;&#x2F;span&gt;&lt;span style=&quot;color: #B392F0;&quot;&gt;system&lt;&#x2F;span&gt;&lt;span&gt;(&lt;&#x2F;span&gt;&lt;span style=&quot;color: #9ECBFF;&quot;&gt;&amp;quot;You are a helpful assistant.&amp;quot;&lt;&#x2F;span&gt;&lt;span&gt;)&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span style=&quot;color: #6A737D;&quot;&gt;  &#x2F;&#x2F; Document memory: RAG results&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;  .&lt;&#x2F;span&gt;&lt;span style=&quot;color: #B392F0;&quot;&gt;vectorSearch&lt;&#x2F;span&gt;&lt;span&gt;({&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;    store: vectorStore,&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;    query: userMessage,&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;    limit:&lt;&#x2F;span&gt;&lt;span style=&quot;color: #79B8FF;&quot;&gt; 5&lt;&#x2F;span&gt;&lt;span&gt;,&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;    priority:&lt;&#x2F;span&gt;&lt;span style=&quot;color: #79B8FF;&quot;&gt; 2&lt;&#x2F;span&gt;&lt;span style=&quot;color: #6A737D;&quot;&gt; &#x2F;* we can live without them *&#x2F;&lt;&#x2F;span&gt;&lt;span&gt;,&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;  })&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span style=&quot;color: #6A737D;&quot;&gt;  &#x2F;&#x2F; Episodic memory: auto-updating summary&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;  .&lt;&#x2F;span&gt;&lt;span style=&quot;color: #B392F0;&quot;&gt;summary&lt;&#x2F;span&gt;&lt;span&gt;(conversation, { id:&lt;&#x2F;span&gt;&lt;span style=&quot;color: #9ECBFF;&quot;&gt; &amp;quot;conversation-summary&amp;quot;&lt;&#x2F;span&gt;&lt;span&gt;, store: summaryStore, priority:&lt;&#x2F;span&gt;&lt;span style=&quot;color: #79B8FF;&quot;&gt; 1&lt;&#x2F;span&gt;&lt;span&gt; })&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span style=&quot;color: #6A737D;&quot;&gt;  &#x2F;&#x2F; Semantic memory: user facts&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;  .&lt;&#x2F;span&gt;&lt;span style=&quot;color: #B392F0;&quot;&gt;omit&lt;&#x2F;span&gt;&lt;span&gt;(&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;    cria.&lt;&#x2F;span&gt;&lt;span style=&quot;color: #B392F0;&quot;&gt;prompt&lt;&#x2F;span&gt;&lt;span&gt;().&lt;&#x2F;span&gt;&lt;span style=&quot;color: #B392F0;&quot;&gt;assistant&lt;&#x2F;span&gt;&lt;span&gt;(&lt;&#x2F;span&gt;&lt;span style=&quot;color: #F97583;&quot;&gt;await&lt;&#x2F;span&gt;&lt;span style=&quot;color: #B392F0;&quot;&gt; fetchUserFacts&lt;&#x2F;span&gt;&lt;span&gt;(userId)),&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;    { priority:&lt;&#x2F;span&gt;&lt;span style=&quot;color: #79B8FF;&quot;&gt; 2&lt;&#x2F;span&gt;&lt;span style=&quot;color: #6A737D;&quot;&gt; &#x2F;* we can live without them *&#x2F;&lt;&#x2F;span&gt;&lt;span&gt;, id:&lt;&#x2F;span&gt;&lt;span style=&quot;color: #9ECBFF;&quot;&gt; &amp;quot;user-facts&amp;quot;&lt;&#x2F;span&gt;&lt;span&gt; }&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;  )&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span style=&quot;color: #6A737D;&quot;&gt;  &#x2F;&#x2F; Working memory: recent messages&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;  .&lt;&#x2F;span&gt;&lt;span style=&quot;color: #B392F0;&quot;&gt;last&lt;&#x2F;span&gt;&lt;span&gt;(conversation, { n:&lt;&#x2F;span&gt;&lt;span style=&quot;color: #79B8FF;&quot;&gt; 200&lt;&#x2F;span&gt;&lt;span&gt; });&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span style=&quot;color: #F97583;&quot;&gt;const&lt;&#x2F;span&gt;&lt;span style=&quot;color: #79B8FF;&quot;&gt; messages&lt;&#x2F;span&gt;&lt;span style=&quot;color: #F97583;&quot;&gt; = await&lt;&#x2F;span&gt;&lt;span&gt; prompt.&lt;&#x2F;span&gt;&lt;span style=&quot;color: #B392F0;&quot;&gt;render&lt;&#x2F;span&gt;&lt;span&gt;({&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;  provider: openai,&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span style=&quot;color: #6A737D;&quot;&gt;  &#x2F;&#x2F; cap at 100k because we see accuracy decrease beyond that&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;  budget:&lt;&#x2F;span&gt;&lt;span style=&quot;color: #79B8FF;&quot;&gt; 100_000&lt;&#x2F;span&gt;&lt;span&gt;,&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;});&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;This example uses &lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;github.com&#x2F;fastpaca&#x2F;cria&quot;&gt;Cria&lt;&#x2F;a&gt;, a prompt composition library we’ve been building. If we hit the budget, Cria shrinks what we can live without (semantic and document memory) while keeping the rest intact.&lt;&#x2F;p&gt;
&lt;p&gt;The pattern matters more than the tool. Structure your prompts so you can adapt as the space evolves.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;what-about-research-on-memory-for-llms&quot;&gt;What About Research on Memory for LLMs?&lt;&#x2F;h2&gt;
&lt;p&gt;The future looks promising. Real breakthroughs are happening. At some point “LLM + memory” will become a non-issue. We won’t get there without stepwise improvements in LLM infrastructure, but we’re moving in the right direction.&lt;&#x2F;p&gt;
&lt;p&gt;The pattern worth watching: systems we build &lt;em&gt;around&lt;&#x2F;em&gt; the LLM are slowly getting absorbed &lt;em&gt;into&lt;&#x2F;em&gt; the LLM. This is good news. It means we can pick winning patterns now, knowing the best ones will eventually become native capabilities.&lt;&#x2F;p&gt;
&lt;p&gt;As of early 2026, here’s how recent developments map to the four layers:&lt;&#x2F;p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;&lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;arxiv.org&#x2F;html&#x2F;2601.07372v1&quot;&gt;DeepSeek Engram&lt;&#x2F;a&gt;&lt;&#x2F;strong&gt;: a conditional memory module that adds constant-time (O(1)) lookup into massive N-gram embedding tables, fused into the model via gating. Deterministic addressing enables prefetch and host-memory offload with minimal inference overhead. &lt;strong&gt;This is document memory baked into the model.&lt;&#x2F;strong&gt; Similar in spirit to vector search, but without the retrieval latency.&lt;&#x2F;li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;arxiv.org&#x2F;abs&#x2F;2512.24601&quot;&gt;Recursive Language Models (RLMs)&lt;&#x2F;a&gt;&lt;&#x2F;strong&gt;: an inference harness that treats long prompts as external environment state (e.g., a Python REPL variable), letting the model inspect it in code and recursively call itself over snippets. &lt;strong&gt;This extends working memory.&lt;&#x2F;strong&gt; The agent harnesses we build today, formalized as an inference pattern.&lt;&#x2F;li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;www.anthropic.com&#x2F;news&#x2F;memory&quot;&gt;Claude chat search and memory&lt;&#x2F;a&gt;&lt;&#x2F;strong&gt;: Claude can search past conversations across sessions and maintain memory summaries across chats and projects. &lt;strong&gt;This is episodic and semantic memory as a product feature.&lt;&#x2F;strong&gt; Simple, but often wins in real workflows because it works.&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;p&gt;These aren’t true memory yet. They don’t give LLMs the kind of episodic recall or semantic persistence we’ve described as fully general capabilities. But they’re absorbing specific layers: Engram eliminates document retrieval latency. RLMs formalize working memory patterns. Claude’s memory handles episodic and semantic for consumer use cases. The layers we build today are scaffolding. Some will become unnecessary as models absorb them. Others will evolve into something better.&lt;&#x2F;p&gt;
&lt;hr &#x2F;&gt;
&lt;p&gt;The goal isn’t finding the perfect memory system. It’s building something you can evolve.&lt;&#x2F;p&gt;
&lt;p&gt;This space moves fast. The state of the art today won’t be state of the art in three months. Whatever you build needs to adapt, or you’ll be ripping it out and starting over.&lt;&#x2F;p&gt;
&lt;p&gt;That’s the real argument for composability. Don’t settle for whatever’s kinda working today. Build a structure where you can swap out any layer when something better drops. Add capabilities when you actually need them. Evolve with the space instead of getting locked into last month’s best practice.&lt;&#x2F;p&gt;
&lt;p&gt;We’re building &lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;github.com&#x2F;fastpaca&#x2F;cria&quot;&gt;Cria&lt;&#x2F;a&gt; to ensure you can swap out whatever you need in the future. If you’re solving similar problems, check it out.&lt;&#x2F;p&gt;
</description>
      </item>
      <item>
          <title>Design Your LLM Memory Around How It Fails</title>
          <pubDate>Fri, 05 Dec 2025 00:00:00 +0000</pubDate>
          <author>Sebastian Lund</author>
          <link>https://fastpaca.com/blog/failure-case-memory-layout/</link>
          <guid>https://fastpaca.com/blog/failure-case-memory-layout/</guid>
          <description xml:base="https://fastpaca.com/blog/failure-case-memory-layout/">&lt;p&gt;Thursday afternoon. Your security team pings you:&lt;&#x2F;p&gt;
&lt;blockquote&gt;
&lt;p&gt;Jon @blue-team: The Next.js RCE vulnerability just dropped. What version were we running on Monday? Were we exposed? Do we need to check logs for exploitation attempts?&lt;&#x2F;p&gt;
&lt;&#x2F;blockquote&gt;
&lt;blockquote&gt;
&lt;p&gt;You: Wait what? What versions are affected?&lt;&#x2F;p&gt;
&lt;&#x2F;blockquote&gt;
&lt;blockquote&gt;
&lt;p&gt;Jon @blue-team: 15.0.5, 15.1.9, 15.2.6, 15.3.6, 15.4.8, 15.5.7, 15.6.0-canary.58, 16.0.7 have been patched and are safe.&lt;&#x2F;p&gt;
&lt;&#x2F;blockquote&gt;
&lt;p&gt;You ask your dependency auditing agent. It doesn’t know.&lt;&#x2F;p&gt;
&lt;p&gt;&lt;img src=&quot;&#x2F;images&#x2F;llm-memory-nextjs-failure.svg&quot; alt=&quot;Next.js RCE vulnerability scenario&quot; &#x2F;&gt;&lt;&#x2F;p&gt;
&lt;p&gt;On Monday, it made several tool calls and analyzed 2,400 packages with version numbers (~60k tokens). The memory system summarized the results and discarded most version numbers.&lt;&#x2F;p&gt;
&lt;pre class=&quot;giallo&quot; style=&quot;color: #E1E4E8; background-color: #24292E;&quot;&gt;&lt;code data-lang=&quot;plain&quot;&gt;&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;Dependency audit complete: 2,400 packages scanned across 12 workspaces.&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;Key frameworks: Next.js, React, Prisma. 2 moderate vulnerabilities found in dev dependencies.&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;Versions: {&amp;quot;prisma&amp;quot;: &amp;quot;6.1.1&amp;quot;, &amp;quot;react&amp;quot;: &amp;quot;19.2.1&amp;quot;, ... }&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;Next.js was not in the versions list. The one fact that matters (what version were we running on Monday?) is gone. The agent trusts its own memory. It doesn’t check the logs.&lt;&#x2F;p&gt;
&lt;p&gt;The agent can fabricate a version, admit ignorance, or tell you “Next.js wasn’t flagged, you should be good”. Option three is the disaster. The agent was never supposed to be the source of truth. But it answered confidently, and confidence is contagious. You don’t check the logs. Meanwhile, attackers may have had a window you’ll never know about.&lt;&#x2F;p&gt;
&lt;p&gt;The &lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;nextjs.org&#x2F;blog&#x2F;cve-2025-29927&quot;&gt;CVE&lt;&#x2F;a&gt; is real. The scenario is hypothetical. But &lt;strong&gt;silent false negatives from lost context&lt;&#x2F;strong&gt; happen constantly.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;the-problem-all-context-gets-treated-equally&quot;&gt;The Problem: All Context Gets Treated Equally&lt;&#x2F;h2&gt;
&lt;p&gt;Most LLM memory systems work the same way: append everything to a chronological buffer, run some RAG (and pray), and hope the right tokens survive when you hit the limit.&lt;&#x2F;p&gt;
&lt;p&gt;Those version numbers needed to be &lt;strong&gt;sacred&lt;&#x2F;strong&gt; or discarded completely. Instead, they got the same treatment as idle chit-chat: summarized into oblivion.&lt;&#x2F;p&gt;
&lt;p&gt;&lt;em&gt;New to LLM memory? Check out &lt;a href=&quot;&#x2F;blog&#x2F;llm-memory-systems-explained&#x2F;&quot;&gt;LLM Memory Systems Explained&lt;&#x2F;a&gt; for background, or &lt;a href=&quot;&#x2F;blog&#x2F;memory-isnt-one-thing&#x2F;&quot;&gt;Memory Isn’t One Thing&lt;&#x2F;a&gt; for why generic solutions fail.&lt;&#x2F;em&gt;&lt;&#x2F;p&gt;
&lt;h2 id=&quot;failure-first-design&quot;&gt;Failure-First Design&lt;&#x2F;h2&gt;
&lt;p&gt;When people talk about LLM memory, they start with architectures: vector vs. graph, RAG vs. long context, summarization vs. extraction.&lt;&#x2F;p&gt;
&lt;p&gt;That’s the wrong starting point.&lt;&#x2F;p&gt;
&lt;p&gt;&lt;strong&gt;How does your system fail when context is missing?&lt;&#x2F;strong&gt; That’s the first question.&lt;&#x2F;p&gt;
&lt;p&gt;Different agents fail in different ways:&lt;&#x2F;p&gt;
&lt;ul&gt;
&lt;li&gt;A &lt;strong&gt;support chatbot&lt;&#x2F;strong&gt; tolerates fuzzy recall. Miss a minor detail, give a decent answer. No one notices.&lt;&#x2F;li&gt;
&lt;li&gt;A &lt;strong&gt;vulnerability agent&lt;&#x2F;strong&gt; cannot miss the one critical version number in a pile of false positives. One false negative and the customer walks.&lt;&#x2F;li&gt;
&lt;li&gt;An &lt;strong&gt;SRE agent&lt;&#x2F;strong&gt; cannot quietly ignore the only log line that explains why prod is on fire. Waste the on-call’s time once, and they rip it out of the incident channel.&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;p&gt;Different failure modes require different memory shapes.&lt;&#x2F;p&gt;
&lt;p&gt;&lt;img src=&quot;&#x2F;images&#x2F;llm-memory-layout-simple.svg&quot; alt=&quot;Diagram showing the context window as a layout with regions: sacred content at top, compressible content below, with annotations showing what happens if each region is lost&quot; &#x2F;&gt;&lt;&#x2F;p&gt;
&lt;h2 id=&quot;threat-modeling-for-context&quot;&gt;Threat Modeling for Context&lt;&#x2F;h2&gt;
&lt;p&gt;This is effectively threat modeling, but for attention spans instead of security boundaries.&lt;&#x2F;p&gt;
&lt;p&gt;For every piece of data you want to put in the prompt, ask:&lt;&#x2F;p&gt;
&lt;table&gt;&lt;thead&gt;&lt;tr&gt;&lt;th&gt;Question&lt;&#x2F;th&gt;&lt;th&gt;Policy&lt;&#x2F;th&gt;&lt;&#x2F;tr&gt;&lt;&#x2F;thead&gt;&lt;tbody&gt;
&lt;tr&gt;&lt;td&gt;Missing context produces &lt;strong&gt;dangerous&lt;&#x2F;strong&gt; output?&lt;&#x2F;td&gt;&lt;td&gt;Sacred. Never drop.&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;Summarization causes &lt;strong&gt;false confidence&lt;&#x2F;strong&gt;?&lt;&#x2F;td&gt;&lt;td&gt;Critical. Never compress.&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;Truncation causes &lt;strong&gt;noticeable&lt;&#x2F;strong&gt; quality loss?&lt;&#x2F;td&gt;&lt;td&gt;Important. Compress last.&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;Removal still leaves &lt;strong&gt;useful&lt;&#x2F;strong&gt; output?&lt;&#x2F;td&gt;&lt;td&gt;Expendable. Drop first.&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;&#x2F;tbody&gt;&lt;&#x2F;table&gt;
&lt;p&gt;Back to the dependency auditor:&lt;&#x2F;p&gt;
&lt;table&gt;&lt;thead&gt;&lt;tr&gt;&lt;th&gt;Context Type&lt;&#x2F;th&gt;&lt;th&gt;Failure Mode&lt;&#x2F;th&gt;&lt;th&gt;Policy&lt;&#x2F;th&gt;&lt;&#x2F;tr&gt;&lt;&#x2F;thead&gt;&lt;tbody&gt;
&lt;tr&gt;&lt;td&gt;Version numbers from tool output&lt;&#x2F;td&gt;&lt;td&gt;Silent false negatives&lt;&#x2F;td&gt;&lt;td&gt;Sacred. Never drop.&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;User’s prioritization decisions&lt;&#x2F;td&gt;&lt;td&gt;Mild confusion on follow-up&lt;&#x2F;td&gt;&lt;td&gt;Important. Compress last.&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;Reasoning steps&lt;&#x2F;td&gt;&lt;td&gt;None (can be reconstructed)&lt;&#x2F;td&gt;&lt;td&gt;Expendable. Drop first.&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;&#x2F;tbody&gt;&lt;&#x2F;table&gt;
&lt;p&gt;This exercise doesn’t tell you &lt;em&gt;what&lt;&#x2F;em&gt; to mark sacred. You can’t know a version number matters until a CVE drops. But it forces you to decide what happens when you’re wrong, before you’re wrong.&lt;&#x2F;p&gt;
&lt;p&gt;Once you’ve done this, the architecture follows naturally. Sacred content stays in fixed context, not semantic retrieval. Expendable content can live in RAG and get evicted freely. The hard part isn’t choosing between vector, graph, or long-context. It’s knowing what you can’t afford to lose.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;making-trade-offs-explicit&quot;&gt;Making Trade-offs Explicit&lt;&#x2F;h2&gt;
&lt;p&gt;Once you know what’s sacred and what’s expendable, enforce it. Several companies building serious LLM applications have already solved this, each in their own way.&lt;&#x2F;p&gt;
&lt;p&gt;One useful mental model comes from the Cursor team, who frame context management as a &lt;strong&gt;layout problem&lt;&#x2F;strong&gt;, not a storage problem (&lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;lexfridman.com&#x2F;cursor-team-transcript#chapter8_prompt_engineering&quot;&gt;Lex Fridman interview&lt;&#x2F;a&gt;). Instead of a flat buffer where everything competes equally, imagine regions with different eviction policies:&lt;&#x2F;p&gt;
&lt;pre class=&quot;giallo&quot; style=&quot;color: #E1E4E8; background-color: #24292E;&quot;&gt;&lt;code data-lang=&quot;typescript&quot;&gt;&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;cria&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;  .&lt;&#x2F;span&gt;&lt;span style=&quot;color: #B392F0;&quot;&gt;prompt&lt;&#x2F;span&gt;&lt;span&gt;()&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span style=&quot;color: #6A737D;&quot;&gt;  &#x2F;&#x2F; Sacred: never drop&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;  .&lt;&#x2F;span&gt;&lt;span style=&quot;color: #B392F0;&quot;&gt;system&lt;&#x2F;span&gt;&lt;span&gt;(systemPrompt)&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;  .&lt;&#x2F;span&gt;&lt;span style=&quot;color: #B392F0;&quot;&gt;assistant&lt;&#x2F;span&gt;&lt;span&gt;(toolOutputs)&lt;&#x2F;span&gt;&lt;span style=&quot;color: #6A737D;&quot;&gt;  &#x2F;&#x2F; Version numbers live here&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span style=&quot;color: #6A737D;&quot;&gt;  &#x2F;&#x2F; High priority: compress last&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;  .&lt;&#x2F;span&gt;&lt;span style=&quot;color: #B392F0;&quot;&gt;last&lt;&#x2F;span&gt;&lt;span&gt;(messages, { n:&lt;&#x2F;span&gt;&lt;span style=&quot;color: #79B8FF;&quot;&gt; 50&lt;&#x2F;span&gt;&lt;span&gt;, priority:&lt;&#x2F;span&gt;&lt;span style=&quot;color: #79B8FF;&quot;&gt; 1&lt;&#x2F;span&gt;&lt;span&gt; })&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span style=&quot;color: #6A737D;&quot;&gt;  &#x2F;&#x2F; Low priority: drop if space is tight&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;  .&lt;&#x2F;span&gt;&lt;span style=&quot;color: #B392F0;&quot;&gt;omit&lt;&#x2F;span&gt;&lt;span&gt;(&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;    cria.&lt;&#x2F;span&gt;&lt;span style=&quot;color: #B392F0;&quot;&gt;prompt&lt;&#x2F;span&gt;&lt;span&gt;().&lt;&#x2F;span&gt;&lt;span style=&quot;color: #B392F0;&quot;&gt;assistant&lt;&#x2F;span&gt;&lt;span&gt;(retrievedFacts),&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;    { priority:&lt;&#x2F;span&gt;&lt;span style=&quot;color: #79B8FF;&quot;&gt; 2&lt;&#x2F;span&gt;&lt;span&gt;, id:&lt;&#x2F;span&gt;&lt;span style=&quot;color: #9ECBFF;&quot;&gt; &amp;quot;retrieved-facts&amp;quot;&lt;&#x2F;span&gt;&lt;span&gt; }&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;  )&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span style=&quot;color: #6A737D;&quot;&gt;  &#x2F;&#x2F; Expendable: drop first&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;  .&lt;&#x2F;span&gt;&lt;span style=&quot;color: #B392F0;&quot;&gt;summary&lt;&#x2F;span&gt;&lt;span&gt;(olderHistory, { id:&lt;&#x2F;span&gt;&lt;span style=&quot;color: #9ECBFF;&quot;&gt; &amp;quot;older-history&amp;quot;&lt;&#x2F;span&gt;&lt;span&gt;, store, priority:&lt;&#x2F;span&gt;&lt;span style=&quot;color: #79B8FF;&quot;&gt; 3&lt;&#x2F;span&gt;&lt;span&gt; });&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;When a user pastes a 20k-token stack trace, the layout engine knows what to sacrifice. Sacred regions don’t shrink. Low-priority regions absorb the pressure. If sacred regions alone exceed your budget? Probably better to crash than to silently drop something critical.&lt;&#x2F;p&gt;
&lt;p&gt;The value isn’t the specific implementation. It’s that the trade-offs are explicit in the code. You can look at this and know exactly what gets dropped when space runs out. No implicit decisions buried in a summarization heuristic.&lt;&#x2F;p&gt;
&lt;p&gt;You don’t need a layout engine to apply this. Separate prompt sections. Different summarization policies per source. A simple check that critical data hasn’t been truncated. The point is deciding explicitly, before your agent decides implicitly in production.&lt;&#x2F;p&gt;
&lt;hr &#x2F;&gt;
&lt;p&gt;Most memory systems are built around a question of capacity: &lt;em&gt;How much can I fit?&lt;&#x2F;em&gt;&lt;&#x2F;p&gt;
&lt;p&gt;The better question is: &lt;strong&gt;What happens when I can’t fit everything?&lt;&#x2F;strong&gt;&lt;&#x2F;p&gt;
&lt;p&gt;If you don’t have an answer, your agent will improvise one. In production, under load, when it matters most.&lt;&#x2F;p&gt;
</description>
      </item>
      <item>
          <title>Universal LLM Memory Does Not Exist</title>
          <pubDate>Fri, 21 Nov 2025 00:00:00 +0000</pubDate>
          <author>Sebastian Lund</author>
          <link>https://fastpaca.com/blog/memory-isnt-one-thing/</link>
          <guid>https://fastpaca.com/blog/memory-isnt-one-thing/</guid>
          <description xml:base="https://fastpaca.com/blog/memory-isnt-one-thing/">&lt;p&gt;Over the last few weeks, I’ve been digging deep into LLM memory systems. My &lt;a href=&quot;&#x2F;blog&#x2F;llm-memory-systems-explained&#x2F;&quot;&gt;last post&lt;&#x2F;a&gt; described the techniques they use.&lt;&#x2F;p&gt;
&lt;p&gt;Whenever I mentioned tools like &lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;mem0.ai&#x2F;&quot;&gt;Mem0&lt;&#x2F;a&gt; to engineers running agents in production, I got the same reaction: a collective &lt;em&gt;sigh&lt;&#x2F;em&gt;.&lt;&#x2F;p&gt;
&lt;p&gt;“It’s heavy.” “The latency kills us.” “Great in theory.”&lt;&#x2F;p&gt;
&lt;p&gt;I wanted to understand why.&lt;&#x2F;p&gt;
&lt;p&gt;Systems like &lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;mem0.ai&#x2F;&quot;&gt;Mem0&lt;&#x2F;a&gt; &amp;amp; &lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;www.getzep.com&#x2F;&quot;&gt;Zep&lt;&#x2F;a&gt; are sold on the promise of reducing cost by 90%, and cutting latency drastically. Why is there such a disconnect with the reality of these systems vs. how they’re sold?&lt;&#x2F;p&gt;
&lt;p&gt;I took two of the most hyped systems: &lt;strong&gt;Zep&lt;&#x2F;strong&gt; (Knowledge Graph) &amp;amp; &lt;strong&gt;Mem0&lt;&#x2F;strong&gt; (“Universal Memory”) and ran them against &lt;strong&gt;&lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;arxiv.org&#x2F;abs&#x2F;2506.21605&quot;&gt;MemBench&lt;&#x2F;a&gt;&lt;&#x2F;strong&gt;, a 2025 benchmark designed to test reflective memory and reasoning.&lt;&#x2F;p&gt;
&lt;p&gt;I expected to find some trade-offs. What I found instead was a huge &lt;strong&gt;“WTF” moment&lt;&#x2F;strong&gt;.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;the-experiment&quot;&gt;The Experiment&lt;&#x2F;h2&gt;
&lt;p&gt;I didn’t want to run a standard “retrieval” benchmark. Marketing pages are full of those. I wanted to see the total &lt;strong&gt;cost of operation&lt;&#x2F;strong&gt; and latency impact.&lt;&#x2F;p&gt;
&lt;p&gt;I set up a harness running &lt;code&gt;gpt-5-nano&lt;&#x2F;code&gt; against 4,000 conversational cases from &lt;strong&gt;&lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;arxiv.org&#x2F;abs&#x2F;2506.21605&quot;&gt;MemBench&lt;&#x2F;a&gt;&lt;&#x2F;strong&gt;. The goal was simple: have a conversation, extract facts, and recall them later. It’s the same type of benchmark and workload these systems were built to manage, and should be great at.&lt;&#x2F;p&gt;
&lt;p&gt;Here is what happened:&lt;&#x2F;p&gt;
&lt;table&gt;&lt;thead&gt;&lt;tr&gt;&lt;th&gt;Memory System&lt;&#x2F;th&gt;&lt;th&gt;Precision&lt;&#x2F;th&gt;&lt;th&gt;Avg. Input Tokens&lt;&#x2F;th&gt;&lt;th&gt;Avg. Latency&lt;&#x2F;th&gt;&lt;th&gt;Total Cost (4k cases)&lt;&#x2F;th&gt;&lt;&#x2F;tr&gt;&lt;&#x2F;thead&gt;&lt;tbody&gt;
&lt;tr&gt;&lt;td&gt;long-context (baseline)&lt;&#x2F;td&gt;&lt;td&gt;84.6%&lt;&#x2F;td&gt;&lt;td&gt;4,232&lt;&#x2F;td&gt;&lt;td&gt;7.8s&lt;&#x2F;td&gt;&lt;td&gt;$1.98&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;Mem0 (vector)&lt;&#x2F;td&gt;&lt;td&gt;49.3%&lt;&#x2F;td&gt;&lt;td&gt;7,319&lt;&#x2F;td&gt;&lt;td&gt;154.5s&lt;&#x2F;td&gt;&lt;td&gt;$24.88&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;Zep (graphiti)&lt;&#x2F;td&gt;&lt;td&gt;51.6%&lt;&#x2F;td&gt;&lt;td&gt;&lt;strong&gt;1.17M&lt;&#x2F;strong&gt;*&lt;&#x2F;td&gt;&lt;td&gt;224s&lt;&#x2F;td&gt;&lt;td&gt;&lt;strong&gt;~$152.6&lt;&#x2F;strong&gt;&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;&#x2F;tbody&gt;&lt;&#x2F;table&gt;
&lt;p&gt;&lt;em&gt;*Partial run (1,730&#x2F;4,000 cases). Averaged 1,028 LLM calls per case, 2B input tokens total. Run aborted after 9 hours due to cost (usage alerts triggered in openai).&lt;&#x2F;em&gt;&lt;&#x2F;p&gt;
&lt;p&gt;You are reading that correctly. &lt;strong&gt;Zep burned 1.17 million tokens per test case.&lt;&#x2F;strong&gt;&lt;&#x2F;p&gt;
&lt;p&gt;At first, I thought my harness was broken. How can a simple conversation generate a million tokens of traffic? I dug into the logs, and what I found wasn’t a bug. It was the architecture working exactly as designed.&lt;&#x2F;p&gt;
&lt;blockquote&gt;
&lt;p&gt;You can run this experiment yourself using my open sourced tool I built to run benchmarks: &lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;github.com&#x2F;fastpaca&#x2F;pacabench&quot;&gt;pacabench&lt;&#x2F;a&gt;, check &lt;code&gt;examples&#x2F;membench_qa_test&lt;&#x2F;code&gt; for details on how this was run.&lt;&#x2F;p&gt;
&lt;&#x2F;blockquote&gt;
&lt;h2 id=&quot;architecture-llm-on-write&quot;&gt;Architecture: “LLM-on-Write”&lt;&#x2F;h2&gt;
&lt;p&gt;To understand why the latency &amp;amp; cost exploded, we have to look at how these systems work. They don’t just “save” data. They employ a pattern I call &lt;strong&gt;LLM-on-Write&lt;&#x2F;strong&gt;.&lt;&#x2F;p&gt;
&lt;p&gt;They intercept every message and spin up background LLM processes to “extract” meaning.&lt;&#x2F;p&gt;
&lt;h3 id=&quot;1-mem0-n-1-summarization-extraction&quot;&gt;1. Mem0: N+1 summarization &amp;amp; extraction&lt;&#x2F;h3&gt;
&lt;p&gt;Mem0 is based on two main components: facts and summaries. It also supports graphs but we’ll focus on vectors for now, as the graph functionality is disabled by default.&lt;&#x2F;p&gt;
&lt;p&gt;It runs &lt;strong&gt;three parallel background LLM processes&lt;&#x2F;strong&gt; on every interaction:&lt;&#x2F;p&gt;
&lt;ol&gt;
&lt;li&gt;Update a conversation timeline to keep a narrative summary &lt;em&gt;(with an LLM)&lt;&#x2F;em&gt;&lt;&#x2F;li&gt;
&lt;li&gt;Identify facts and save them to a vector store &lt;em&gt;(with an LLM)&lt;&#x2F;em&gt;&lt;&#x2F;li&gt;
&lt;li&gt;Check for contradictions and update old or remove old facts &lt;em&gt;(with an LLM)&lt;&#x2F;em&gt;&lt;&#x2F;li&gt;
&lt;&#x2F;ol&gt;
&lt;p&gt;&lt;img src=&quot;&#x2F;images&#x2F;semantic-memory-n-plus-1.svg&quot; alt=&quot;Diagram showing the N+1 pattern where one user message triggers multiple background LLM calls&quot; &#x2F;&gt;&lt;&#x2F;p&gt;
&lt;p&gt;For every message your agent sends, Mem0 spins up three separate inference jobs.&lt;&#x2F;p&gt;
&lt;h3 id=&quot;2-zep-recursive-explosions-of-llm-calls&quot;&gt;2. Zep: Recursive Explosions of LLM calls&lt;&#x2F;h3&gt;
&lt;p&gt;Zep’s Graphiti is a knowledge graph. When you say &lt;em&gt;“I work on Project X “&lt;&#x2F;em&gt; it doesn’t just save the string. It wakes up an extractor LLM to:&lt;&#x2F;p&gt;
&lt;ol&gt;
&lt;li&gt;Identify the entity “User”.&lt;&#x2F;li&gt;
&lt;li&gt;Identify the entity “Project X”.&lt;&#x2F;li&gt;
&lt;li&gt;Create an edge “Works On”.&lt;&#x2F;li&gt;
&lt;&#x2F;ol&gt;
&lt;p&gt;But it doesn’t stop there. It then traverses the graph to see if this new fact contradicts old facts (e.g., “User is already working on Project Y”). If it finds a conflict or a connection, it triggers &lt;strong&gt;another&lt;&#x2F;strong&gt; LLM call to resolve it.&lt;&#x2F;p&gt;
&lt;p&gt;&lt;img src=&quot;&#x2F;images&#x2F;zep-graph-llm-explosion.svg&quot; alt=&quot;Diagram showing the recursive LLM explosion in Zep’s Graphiti&quot; &#x2F;&gt;&lt;&#x2F;p&gt;
&lt;p&gt;In my experiment, a single complex reasoning chain triggered a cascade of graph updates. The agent said one thing, which updated a node, which triggered a neighbor update, which triggered a re-summarization of the edge.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;the-common-flaw-fact-extraction&quot;&gt;The Common Flaw: Fact Extraction&lt;&#x2F;h2&gt;
&lt;p&gt;Despite their differences (Graph vs. Vector), both systems share a fatal architectural flaw when applied to agent execution: &lt;strong&gt;Fact Extraction.&lt;&#x2F;strong&gt; They both rely on LLMs to “interpret” raw data into “facts”.&lt;&#x2F;p&gt;
&lt;p&gt;&lt;img src=&quot;&#x2F;images&#x2F;fact-extraction-fallacy.svg&quot; alt=&quot;Diagram showing the hallucination funnel where raw data is compressed into fuzzy facts&quot; &#x2F;&gt;&lt;&#x2F;p&gt;
&lt;p&gt;This works for personalization (&lt;em&gt;“User likes blue”&lt;&#x2F;em&gt; is a safe extraction). If you need a CRM on top of your chat assistant, it’s awesome. If you need to reduce cost and latency of an autonomous agent, it’s not.&lt;&#x2F;p&gt;
&lt;h3 id=&quot;compounding-hallucinations-by-llms&quot;&gt;Compounding hallucinations by LLMs&lt;&#x2F;h3&gt;
&lt;p&gt;The extractor LLM is non-deterministic. It might rewrite &lt;em&gt;“I was ill last year”&lt;&#x2F;em&gt; as &lt;code&gt;current_status: ill&lt;&#x2F;code&gt;. The error happens at &lt;strong&gt;write time&lt;&#x2F;strong&gt;, meaning the data is corrupted before it even hits the database. No amount of retrieval optimization can fix a database filled with hallucinations.&lt;&#x2F;p&gt;
&lt;p&gt;&lt;strong&gt;Your primary LLM is at the mercy of the accuracy of the extractor LLM.&lt;&#x2F;strong&gt;&lt;&#x2F;p&gt;
&lt;h3 id=&quot;n-1-latency-cost-tax&quot;&gt;N+1 Latency &amp;amp; Cost Tax&lt;&#x2F;h3&gt;
&lt;p&gt;Latency and cost compound as you add more LLM calls to your pipeline. For every message, you are triggering a chain of background inferences—extraction, summarization, graph updates.&lt;&#x2F;p&gt;
&lt;blockquote&gt;
&lt;p&gt;We are slapping LLMs on top of LLMs, introducing noise and latency at every layer, and paying a premium for the privilege.&lt;&#x2F;p&gt;
&lt;&#x2F;blockquote&gt;
&lt;p&gt;&lt;img src=&quot;&#x2F;images&#x2F;you-can-fit-so-many-llms-in-here.jpg&quot; alt=&quot;&quot; &#x2F;&gt;&lt;&#x2F;p&gt;
&lt;h2 id=&quot;marketing-hides-the-real-costs&quot;&gt;Marketing hides the real costs&lt;&#x2F;h2&gt;
&lt;p&gt;So why is everyone buying this?&lt;&#x2F;p&gt;
&lt;p&gt;Because the marketing focuses on the wrong unit of measurement. Memory vendors advertise &lt;strong&gt;“Cost per Retrieval”&lt;&#x2F;strong&gt; They show you how cheap it is to fetch a small context window instead of reading the whole history.&lt;&#x2F;p&gt;
&lt;p&gt;But as an engineer or founder, you pay &lt;strong&gt;“Cost per Conversation”&lt;&#x2F;strong&gt;.&lt;&#x2F;p&gt;
&lt;p&gt;You pay for the &lt;strong&gt;N+1 extraction tax&lt;&#x2F;strong&gt;. You pay for the recursive graph updates. You pay for the debugging time when the system throws away your error logs.&lt;&#x2F;p&gt;
&lt;p&gt;To be fair: Zep is more honest than most. They claim temporal knowledge graphs for personalization and business context, and that’s exactly what they build. The problem is that even for their stated use case (semantic memory), the cost is prohibitive at production scale. And despite this relative clarity, teams still use it for working memory tasks (agent execution state) it was never designed to handle.&lt;&#x2F;p&gt;
&lt;p&gt;The marketing hype feeds into itself. We want “Universal Memory” to be real because it sounds amazing. We want an infinite context window that costs nothing. But the physics of the architecture don’t support it.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;working-memory-semantic-memory&quot;&gt;Working Memory &amp;amp; Semantic Memory&lt;&#x2F;h2&gt;
&lt;p&gt;&lt;img src=&quot;&#x2F;images&#x2F;two-types-of-memory.svg&quot; alt=&quot;Diagram comparing working memory and semantic memory requirements&quot; &#x2F;&gt;&lt;&#x2F;p&gt;
&lt;p&gt;The conclusion from my experiment is clear: &lt;strong&gt;Universal Memory does not exist.&lt;&#x2F;strong&gt;&lt;&#x2F;p&gt;
&lt;p&gt;We are trying to solve two fundamentally different problems with one tool.&lt;&#x2F;p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Semantic Memory&lt;&#x2F;strong&gt;, which is for the &lt;em&gt;User&lt;&#x2F;em&gt;. It tracks preferences, long-term history, and rapport. It &lt;em&gt;should&lt;&#x2F;em&gt; be fuzzy, extracted, and graph-based.&lt;&#x2F;li&gt;
&lt;li&gt;&lt;strong&gt;Working Memory&lt;&#x2F;strong&gt;, which is for the &lt;em&gt;Agent&lt;&#x2F;em&gt;. It tracks file paths, variable names, and immediate error logs. It must be &lt;strong&gt;lossless, temporal, and exact&lt;&#x2F;strong&gt;.&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;p&gt;When you use a Semantic Memory tool (Zep, Mem0) for Working Memory tasks, you aren’t just making a tradeoff. You are using the wrong architecture. You are trying to run a database on a lossy compression algorithm.&lt;&#x2F;p&gt;
&lt;p&gt;&lt;strong&gt;State is the application. You cannot compress your way to reliability.&lt;&#x2F;strong&gt;&lt;&#x2F;p&gt;
&lt;p&gt;Semantic memory is brilliant for personalization across sessions. It’s catastrophic for execution state within a task. Treat them as separate systems with separate requirements.&lt;&#x2F;p&gt;
</description>
      </item>
      <item>
          <title>LLM Memory Systems Explained</title>
          <pubDate>Fri, 07 Nov 2025 00:00:00 +0000</pubDate>
          <author>Sebastian Lund</author>
          <link>https://fastpaca.com/blog/llm-memory-systems-explained/</link>
          <guid>https://fastpaca.com/blog/llm-memory-systems-explained/</guid>
          <description xml:base="https://fastpaca.com/blog/llm-memory-systems-explained/">&lt;p&gt;LLMs don’t have memory. They’re stateless: each response requires resending the entire conversation history.&lt;&#x2F;p&gt;
&lt;p&gt;Yet they reference earlier messages and maintain context across long interactions. How?&lt;&#x2F;p&gt;
&lt;h2 id=&quot;llms-do-not-remember-anything&quot;&gt;LLMs do not remember anything&lt;&#x2F;h2&gt;
&lt;p&gt;&lt;img src=&quot;&#x2F;images&#x2F;llm-memory-base.svg&quot; alt=&quot;Illustration of an LLM processing the entire prior chat transcript to answer&quot; &#x2F;&gt;&lt;&#x2F;p&gt;
&lt;p&gt;LLMs are stateless. Each inference is independent. Generating a single output token requires processing all preceding tokens as input.&lt;&#x2F;p&gt;
&lt;p&gt;To answer coherently the LLM needs every previous message. This input is the “context window”.&lt;&#x2F;p&gt;
&lt;p&gt;&lt;strong&gt;This creates an optimization problem with three competing constraints:&lt;&#x2F;strong&gt;&lt;&#x2F;p&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Latency&lt;&#x2F;strong&gt; (most critical): More tokens = slower inference. Users feel every millisecond.&lt;&#x2F;li&gt;
&lt;li&gt;&lt;strong&gt;Cost&lt;&#x2F;strong&gt;: More input tokens = higher API costs. Scales linearly with conversation length.&lt;&#x2F;li&gt;
&lt;li&gt;&lt;strong&gt;Accuracy&lt;&#x2F;strong&gt;: Finding the right amount of relevant context. Too little causes hallucinations, too much causes perceived hallucinations from noise.&lt;&#x2F;li&gt;
&lt;&#x2F;ol&gt;
&lt;p&gt;&lt;img src=&quot;&#x2F;images&#x2F;llm-memory-triangle.svg&quot; alt=&quot;Latency, cost, and accuracy triangle illustrating competing pressures on memory systems&quot; &#x2F;&gt;&lt;&#x2F;p&gt;
&lt;p&gt;You can’t optimize all three. Sending full history not only reduce accuracy after a certain point but also kills latency and cost. Aggressive compression helps latency and cost but degrades accuracy. Every memory system is picking a point on this trade-off triangle.&lt;&#x2F;p&gt;
&lt;p&gt;&lt;em&gt;Note: Internal caching (KV cache, KV pinning) reduces computation within the LLM, but doesn’t change the external interface. These help reduce latency &amp;amp; avoid reprocessing, and are not memory systems.&lt;&#x2F;em&gt;&lt;&#x2F;p&gt;
&lt;h2 id=&quot;how-llms-remember&quot;&gt;How LLMs Remember&lt;&#x2F;h2&gt;
&lt;p&gt;We build systems around their current limitation. Three key techniques:&lt;&#x2F;p&gt;
&lt;ol&gt;
&lt;li&gt;&lt;em&gt;Summarization&lt;&#x2F;em&gt;: Compress old messages to reduce token count&lt;&#x2F;li&gt;
&lt;li&gt;&lt;em&gt;Fact tracking&lt;&#x2F;em&gt;: Extract and store durable information across conversations&lt;&#x2F;li&gt;
&lt;li&gt;&lt;em&gt;Retrieval systems&lt;&#x2F;em&gt;: Store and fetch relevant context on demand&lt;&#x2F;li&gt;
&lt;&#x2F;ol&gt;
&lt;p&gt;These techniques combine to make LLMs appear to remember.&lt;&#x2F;p&gt;
&lt;h3 id=&quot;summarization&quot;&gt;Summarization&lt;&#x2F;h3&gt;
&lt;p&gt;Token-level compression within the context window. Compress old messages to keep only relevant information, preventing the LLM from getting distracted by noise while preserving key events.&lt;&#x2F;p&gt;
&lt;p&gt;&lt;img src=&quot;&#x2F;images&#x2F;llm-memory-summary.svg&quot; alt=&quot;Diagram showing how older chat turns are compressed into a smaller summary&quot; &#x2F;&gt;&lt;&#x2F;p&gt;
&lt;p&gt;A typical implementation uses a secondary LLM to maintain a running summary as new messages arrive.&lt;&#x2F;p&gt;
&lt;p&gt;&lt;strong&gt;Approaches:&lt;&#x2F;strong&gt;&lt;&#x2F;p&gt;
&lt;ul&gt;
&lt;li&gt;Periodic (every N messages)&lt;&#x2F;li&gt;
&lt;li&gt;Sliding window (recent messages + summary of older ones)&lt;&#x2F;li&gt;
&lt;li&gt;Hierarchical (summaries of summaries)&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;p&gt;&lt;strong&gt;Trade-offs:&lt;&#x2F;strong&gt;&lt;&#x2F;p&gt;
&lt;ul&gt;
&lt;li&gt;Reduces token cost significantly&lt;&#x2F;li&gt;
&lt;li&gt;Lossy: important details can be dropped&lt;&#x2F;li&gt;
&lt;li&gt;Adds summarization latency and cost&lt;&#x2F;li&gt;
&lt;li&gt;Quality depends on the summarization model&lt;&#x2F;li&gt;
&lt;li&gt;Can exceed context window limits for very long conversations&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;h3 id=&quot;fact-tracking-across-conversations&quot;&gt;Fact Tracking Across Conversations&lt;&#x2F;h3&gt;
&lt;p&gt;Extends beyond the current context window. Extract durable facts about users that persist beyond individual conversations. If a user mentions they hate cinnamon in one conversation, that fact is available in all future conversations.&lt;&#x2F;p&gt;
&lt;p&gt;&lt;img src=&quot;&#x2F;images&#x2F;llm-memory-summary-and-facts.svg&quot; alt=&quot;Illustration of durable fact storage persisting details between conversations&quot; &#x2F;&gt;&lt;&#x2F;p&gt;
&lt;p&gt;A typical implementation uses a secondary LLM to monitor conversations and extract facts worth persisting. Unlike &lt;em&gt;summarization&lt;&#x2F;em&gt;, these facts live beyond the conversation scope.&lt;&#x2F;p&gt;
&lt;p&gt;&lt;strong&gt;Common patterns:&lt;&#x2F;strong&gt;&lt;&#x2F;p&gt;
&lt;ul&gt;
&lt;li&gt;Continuous extraction (after each message or periodically)&lt;&#x2F;li&gt;
&lt;li&gt;Structured storage (key-value pairs or entity graphs)&lt;&#x2F;li&gt;
&lt;li&gt;Conflict resolution (when facts contradict)&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;p&gt;&lt;strong&gt;Trade-offs:&lt;&#x2F;strong&gt;&lt;&#x2F;p&gt;
&lt;ul&gt;
&lt;li&gt;Personalized experience across conversations&lt;&#x2F;li&gt;
&lt;li&gt;Requires fact extraction inference cost&lt;&#x2F;li&gt;
&lt;li&gt;Needs storage and retrieval infrastructure&lt;&#x2F;li&gt;
&lt;li&gt;Must handle contradictions and updates&lt;&#x2F;li&gt;
&lt;li&gt;Privacy considerations for storing user data&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;h3 id=&quot;retrieval-systems&quot;&gt;Retrieval Systems&lt;&#x2F;h3&gt;
&lt;p&gt;Retrieval means selective inclusion, not persistence. With &lt;em&gt;summarization&lt;&#x2F;em&gt; and &lt;em&gt;fact tracking&lt;&#x2F;em&gt;, we have stored memory, but we can’t send all of it to the LLM without adding noise. When the user asks about “cinnamon”, including “user does not like cinnamon” is useful, including “user has a dog named Oscar” is not.&lt;&#x2F;p&gt;
&lt;p&gt;&lt;img src=&quot;&#x2F;images&#x2F;llm-memory-retrieval.svg&quot; alt=&quot;Diagram of a retrieval system ranking stored memories for the next prompt&quot; &#x2F;&gt;&lt;&#x2F;p&gt;
&lt;p&gt;Retrieval systems solve this by fetching semantically relevant pieces of memory based on the current context.&lt;&#x2F;p&gt;
&lt;p&gt;&lt;strong&gt;How it works:&lt;&#x2F;strong&gt;&lt;&#x2F;p&gt;
&lt;ol&gt;
&lt;li&gt;Generate embeddings (vector representations) of stored information&lt;&#x2F;li&gt;
&lt;li&gt;Store in a vector database&lt;&#x2F;li&gt;
&lt;li&gt;On query: retrieve semantically similar pieces&lt;&#x2F;li&gt;
&lt;li&gt;Inject retrieved context into the prompt&lt;&#x2F;li&gt;
&lt;&#x2F;ol&gt;
&lt;p&gt;&lt;strong&gt;Trade-offs:&lt;&#x2F;strong&gt;&lt;&#x2F;p&gt;
&lt;ul&gt;
&lt;li&gt;Adds retrieval latency before inference&lt;&#x2F;li&gt;
&lt;li&gt;Quality depends on embedding model and retrieval strategy&lt;&#x2F;li&gt;
&lt;li&gt;Can miss connections between non-adjacent information&lt;&#x2F;li&gt;
&lt;li&gt;Scales beyond context window limits&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;p&gt;Other retrieval methods exist (graph databases, keyword search), but vector-based retrieval is common and generalises well between use cases.&lt;&#x2F;p&gt;
&lt;p&gt;&lt;em&gt;Note: Retrieval systems are also widely used for accessing large corpora (&lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;cloud.google.com&#x2F;use-cases&#x2F;retrieval-augmented-generation&quot;&gt;RAG - Retrieval-Augmented Generation&lt;&#x2F;a&gt;). That’s a related but different use case. Here we’re focused on retrieving stored memories, not general document knowledge bases.&lt;&#x2F;em&gt;&lt;&#x2F;p&gt;
&lt;h2 id=&quot;what-we-have-today&quot;&gt;What We Have Today&lt;&#x2F;h2&gt;
&lt;p&gt;Combining these techniques lets us build applications with long conversations, personalization, and large knowledge bases. We can deliver reasonable improvements to user experience.&lt;&#x2F;p&gt;
&lt;p&gt;Systems like &lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;mem0.ai&#x2F;&quot;&gt;Mem0&lt;&#x2F;a&gt; apply these techniques using vector stores for long-term memory. They demonstrate that production implementations are viable, though trade-offs remain.&lt;&#x2F;p&gt;
&lt;p&gt;Every technique forces serious trade-offs. The latency-cost-accuracy triangle is real. There’s no perfect solution out there, and this is very much an unsolved problem.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;open-problems&quot;&gt;Open Problems&lt;&#x2F;h2&gt;
&lt;p&gt;The techniques above are starting points, not long-term solutions. Real questions remain:&lt;&#x2F;p&gt;
&lt;p&gt;&lt;strong&gt;Lossy compression&lt;&#x2F;strong&gt;: When do critical details get dropped? How do you know what to keep?&lt;&#x2F;p&gt;
&lt;p&gt;&lt;strong&gt;Retrieval precision&lt;&#x2F;strong&gt;: Semantic search doesn’t always fetch the right data. How do we measure relevance accurately?&lt;&#x2F;p&gt;
&lt;p&gt;&lt;strong&gt;Determinism vs. adaptivity&lt;&#x2F;strong&gt;: Deterministic systems are debuggable but inflexible. Adaptive systems are powerful but unpredictable. Which matters more?&lt;&#x2F;p&gt;
&lt;p&gt;&lt;strong&gt;Evaluation&lt;&#x2F;strong&gt;: How do you test memory systems when conversations span thousands of turns and context is deeply nested and subjectively driven by user intent?&lt;&#x2F;p&gt;
&lt;p&gt;&lt;strong&gt;Active research areas:&lt;&#x2F;strong&gt;&lt;&#x2F;p&gt;
&lt;ul&gt;
&lt;li&gt;Semantic compaction (compress by importance, not recency)&lt;&#x2F;li&gt;
&lt;li&gt;Learned compression (models that compress their own context)&lt;&#x2F;li&gt;
&lt;li&gt;Multi-modal memory (unified handling of images, audio, text)&lt;&#x2F;li&gt;
&lt;li&gt;Adaptive budgets (dynamic token allocation based on conversation needs)&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;p&gt;LLMs process fixed-size inputs. How we manage that constraint (balancing latency, cost, and accuracy) is still being figured out.&lt;&#x2F;p&gt;
&lt;hr &#x2F;&gt;
&lt;p&gt;&lt;em&gt;We’re exploring these problems with &lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;github.com&#x2F;fastpaca&#x2F;context-store&quot;&gt;context-store&lt;&#x2F;a&gt;, infrastructure for experimenting with memory techniques. Read about &lt;a href=&quot;&#x2F;blog&#x2F;introducing-context-store&#x2F;&quot;&gt;why we’re building it&lt;&#x2F;a&gt;.&lt;&#x2F;em&gt;&lt;&#x2F;p&gt;
</description>
      </item>
      <item>
          <title>Introducing Context-Store</title>
          <pubDate>Fri, 31 Oct 2025 00:00:00 +0000</pubDate>
          <author>Sebastian Lund</author>
          <link>https://fastpaca.com/blog/introducing-context-store/</link>
          <guid>https://fastpaca.com/blog/introducing-context-store/</guid>
          <description xml:base="https://fastpaca.com/blog/introducing-context-store/">&lt;p&gt;Every team building LLM apps hits the same wall: users expect full message history, but LLMs have hard context limits.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;the-problem&quot;&gt;The Problem&lt;&#x2F;h2&gt;
&lt;ul&gt;
&lt;li&gt;Users often need to review and trust the full conversation history.&lt;&#x2F;li&gt;
&lt;li&gt;LLMs need compaction to meet latency and token budgets.&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;p&gt;Those two requirements pull in opposite directions. So we built context-store to improve LLM latency while preserving full history for your users. It’s simple, predictable, and works well for AI assistants.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;the-pattern-everyone-rebuilds&quot;&gt;The Pattern Everyone Rebuilds&lt;&#x2F;h2&gt;
&lt;p&gt;We kept seeing teams build the same stack:&lt;&#x2F;p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Redis&lt;&#x2F;strong&gt; for hot message storage&lt;&#x2F;li&gt;
&lt;li&gt;&lt;strong&gt;Postgres&lt;&#x2F;strong&gt; for persistence&lt;&#x2F;li&gt;
&lt;li&gt;&lt;strong&gt;Pub&#x2F;sub&lt;&#x2F;strong&gt; for real-time updates&lt;&#x2F;li&gt;
&lt;li&gt;&lt;strong&gt;Custom compaction logic&lt;&#x2F;strong&gt; to manage the window&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;p&gt;Every implementation is slightly different. Every team makes similar mistakes. The infrastructure becomes a distraction from the actual product.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;what-context-store-does&quot;&gt;What Context-Store Does&lt;&#x2F;h2&gt;
&lt;p&gt;Context-store turns that pattern into a single service. You can choose to back everything up into postgres with a write-behind feature for analytics and to act as cold storage.&lt;&#x2F;p&gt;
&lt;p&gt;You set a token budget. You pick a compaction policy (or write your own). The service handles:&lt;&#x2F;p&gt;
&lt;ul&gt;
&lt;li&gt;Storing full message history&lt;&#x2F;li&gt;
&lt;li&gt;Enforcing token budgets&lt;&#x2F;li&gt;
&lt;li&gt;Compacting context automatically&lt;&#x2F;li&gt;
&lt;li&gt;Horizontal scaling, add more nodes to accept more data &amp;amp; traffic&lt;&#x2F;li&gt;
&lt;li&gt;Optional cold storage in Postgres&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;pre class=&quot;giallo&quot; style=&quot;color: #E1E4E8; background-color: #24292E;&quot;&gt;&lt;code data-lang=&quot;typescript&quot;&gt;&lt;span class=&quot;giallo-l&quot;&gt;&lt;span style=&quot;color: #F97583;&quot;&gt;const&lt;&#x2F;span&gt;&lt;span style=&quot;color: #79B8FF;&quot;&gt; ctx&lt;&#x2F;span&gt;&lt;span style=&quot;color: #F97583;&quot;&gt; = await&lt;&#x2F;span&gt;&lt;span&gt; fastpaca.&lt;&#x2F;span&gt;&lt;span style=&quot;color: #B392F0;&quot;&gt;context&lt;&#x2F;span&gt;&lt;span&gt;(&lt;&#x2F;span&gt;&lt;span style=&quot;color: #9ECBFF;&quot;&gt;&amp;#39;chat_42&amp;#39;&lt;&#x2F;span&gt;&lt;span&gt;, {&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;  budget:&lt;&#x2F;span&gt;&lt;span style=&quot;color: #79B8FF;&quot;&gt; 1_000_000&lt;&#x2F;span&gt;&lt;span&gt;,&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;  trigger:&lt;&#x2F;span&gt;&lt;span style=&quot;color: #79B8FF;&quot;&gt; 0.7&lt;&#x2F;span&gt;&lt;span&gt;,&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;  policy: { strategy:&lt;&#x2F;span&gt;&lt;span style=&quot;color: #9ECBFF;&quot;&gt; &amp;#39;last_n&amp;#39;&lt;&#x2F;span&gt;&lt;span&gt;, config: { limit:&lt;&#x2F;span&gt;&lt;span style=&quot;color: #79B8FF;&quot;&gt; 400&lt;&#x2F;span&gt;&lt;span&gt; } }&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;});&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span style=&quot;color: #F97583;&quot;&gt;await&lt;&#x2F;span&gt;&lt;span&gt; ctx.&lt;&#x2F;span&gt;&lt;span style=&quot;color: #B392F0;&quot;&gt;append&lt;&#x2F;span&gt;&lt;span&gt;({ role:&lt;&#x2F;span&gt;&lt;span style=&quot;color: #9ECBFF;&quot;&gt; &amp;#39;user&amp;#39;&lt;&#x2F;span&gt;&lt;span&gt;, parts: [{ type:&lt;&#x2F;span&gt;&lt;span style=&quot;color: #9ECBFF;&quot;&gt; &amp;#39;text&amp;#39;&lt;&#x2F;span&gt;&lt;span&gt;, text:&lt;&#x2F;span&gt;&lt;span style=&quot;color: #9ECBFF;&quot;&gt; &amp;#39;Hi&amp;#39;&lt;&#x2F;span&gt;&lt;span&gt; }] });&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span style=&quot;color: #F97583;&quot;&gt;const&lt;&#x2F;span&gt;&lt;span&gt; {&lt;&#x2F;span&gt;&lt;span style=&quot;color: #79B8FF;&quot;&gt; messages&lt;&#x2F;span&gt;&lt;span&gt; }&lt;&#x2F;span&gt;&lt;span style=&quot;color: #F97583;&quot;&gt; = await&lt;&#x2F;span&gt;&lt;span&gt; ctx.&lt;&#x2F;span&gt;&lt;span style=&quot;color: #B392F0;&quot;&gt;context&lt;&#x2F;span&gt;&lt;span&gt;();&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;&lt;h2 id=&quot;example-next-js-chat&quot;&gt;Example: Next.js Chat&lt;&#x2F;h2&gt;
&lt;p&gt;We took this architecture and built a minimal chat app in Next.js that shows the pattern end-to-end: append messages, keep full history, and compact deterministically before calling your model.&lt;&#x2F;p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;github.com&#x2F;fastpaca&#x2F;context-store&#x2F;tree&#x2F;main&#x2F;examples&#x2F;nextjs-chat&quot;&gt;Example code →&lt;&#x2F;a&gt;&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;p&gt;&lt;img src=&quot;&#x2F;images&#x2F;context-store-demo.gif&quot; alt=&quot;Animated demo of context-store powering a Next.js chat with deterministic compaction&quot; &#x2F;&gt;&lt;&#x2F;p&gt;
&lt;h2 id=&quot;why-elixir&quot;&gt;Why Elixir&lt;&#x2F;h2&gt;
&lt;p&gt;We built this in Elixir because:&lt;&#x2F;p&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Raft consensus&lt;&#x2F;strong&gt; for distributed state is table stakes&lt;&#x2F;li&gt;
&lt;li&gt;&lt;strong&gt;Process supervision&lt;&#x2F;strong&gt; makes failure handling clean&lt;&#x2F;li&gt;
&lt;li&gt;&lt;strong&gt;Hot code upgrades&lt;&#x2F;strong&gt; for zero-downtime deployments&lt;&#x2F;li&gt;
&lt;li&gt;&lt;strong&gt;Memory efficiency&lt;&#x2F;strong&gt; at scale&lt;&#x2F;li&gt;
&lt;&#x2F;ol&gt;
&lt;p&gt;We chose Elixir for its concurrency model and reliability characteristics that fit this problem well.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;what-s-next&quot;&gt;What’s Next&lt;&#x2F;h2&gt;
&lt;p&gt;Context-store is production-ready and Apache 2.0 licensed.&lt;&#x2F;p&gt;
&lt;p&gt;The research continues:&lt;&#x2F;p&gt;
&lt;ul&gt;
&lt;li&gt;Compression techniques on token-spaces&lt;&#x2F;li&gt;
&lt;li&gt;Semantic compaction (embedding-based strategies)&lt;&#x2F;li&gt;
&lt;li&gt;Multi-modal context handling&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;p&gt;If you’re building LLM apps and this problem sounds familiar, &lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;docs.fastpaca.com&quot;&gt;check out the docs&lt;&#x2F;a&gt; or &lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;github.com&#x2F;fastpaca&#x2F;context-store&quot;&gt;browse the code&lt;&#x2F;a&gt;.&lt;&#x2F;p&gt;
&lt;hr &#x2F;&gt;
&lt;p&gt;&lt;em&gt;This is part of the broader fastpaca research into hard problems at the intersection of humans and technology. More posts coming!&lt;&#x2F;em&gt;&lt;&#x2F;p&gt;
</description>
      </item>
    </channel>
</rss>
