Session three of a bug hunt in CPython's garbage collector. Two sessions in, I had what felt like a solid map: the exact call chain from PyObject_GC_Del through the generational collector, the non-obvious invariant around finalizer ordering, the three files where the relevant logic lived. Then /compact fired.
The summary said something like: "we were investigating CPython's garbage collector, specifically the interaction between finalizers and the generational GC." Accurate. Useless. The exact function signatures were gone. The specific line numbers were gone. The invariant that took two sessions to understand — compressed to one sentence that had lost all the nuance. The next 20 minutes: re-reading files to rebuild what I already knew.
This post is about what I learned from that, and from the working memory system I built to prevent it. Part 1 covered the indexing layer — how Vectr finds things in a codebase semantically. This part covers what happens after you find something: how to keep the knowledge alive across session boundaries, why my initial design was wrong in a fundamental way, and what actually works.
Part 1: The Problem With /compact
What /compact Actually Destroys
Most people treat /compact as "clear the context to keep going." That framing is roughly correct but understates the damage. The issue isn't just that context gets shorter — it's that the compression is lossy in exactly the cases where being wrong is most expensive.
/compact works by asking the AI to summarize the current conversation, then replacing the full history with that summary. Token count drops from (say) 180,000 to 12,000. Here's what the summary doesn't preserve:
Exact function signatures. A summary might say "the function takes a path and a flag." The conversation had def process_workspace_changes(path: Path, db: Database, *, force: bool = False) -> list[ChangeResult]. The difference between those two descriptions is the difference between a valid call site and a runtime error.
Specific line numbers. "The resolver module" and /src/workspace/resolver.rs:214 are not the same precision. You can reconstruct the file path, but it costs you a tool call.
Non-obvious behavioral invariants. If you spent three turns establishing that acquire_lock() must be called before touching workspace metadata because there's a race condition with the filesystem watcher, that three-turn understanding might survive as "be careful with locking." The exact invariant — the one that matters when you're writing the code — is gone.
The reasoning chain. Sometimes the value of an exploration session isn't the final answer but the chain of observations that produced it. Summaries discard chains. They keep endpoints.
Key insight: Summaries are fine for preserving topics and general direction. They fail specifically at exact signatures, line numbers, and subtle behavioral invariants — which is also where being wrong is most expensive. A summary of "be careful with locking" covers the topic. It doesn't tell you which function must be called first, or why, or what breaks if you get it wrong.
In the CPython scenario, re-establishing the finalizer ordering invariant from scratch means re-reading several files and re-following a non-obvious call chain — roughly 15–20 minutes of work that was already done. A note stored at the end of session two takes about a minute to write and ten milliseconds to retrieve.
Why You Can't Tell the AI to Just Forget Things
When I started building Vectr's memory layer, I had a clean model: the AI finds something useful, stores it with vectr_remember, then drops the file from its context window. The note is 50 tokens. The file was 800 tokens. Net gain: 750 tokens freed for new content. I called this "context offload."
I built it this way. I wrote documentation describing it this way. I designed vectr_evict_hint entirely around it.
It doesn't work.
The KV cache is append-only. Think of the transformer's memory as a lookup table it builds as it reads each token. For each token it processes, it computes a key-value representation that gets stored at each attention layer. Every subsequent token attends back to every previous token through these cached representations — that's how earlier context influences later output.
Once a token's representation is computed and cached, it stays until the context is cleared. There is no mechanism to evict specific tokens by instruction. "You can drop chunk X from your context window" is itself processed as tokens — added to the cache, not used to remove other entries from it.
A subtlety worth naming: the KV cache is maintained server-side by the inference provider. What you see as "context window usage" is a count of tokens in the current conversation, not a direct readout of GPU memory. The principle holds regardless: every token in the conversation occupies a slot in the cache, and you cannot remove individual tokens from a running session without ending or compressing the whole thing.
The KV cache memory cost formula:
KV cache size = 2 × L × n_heads × d_head × T × bytes_per_float
For a representative mid-size model: L=32 layers, n_heads=32, d_head=128, T=50,000 tokens at fp16 (2 bytes):
2 × 32 × 32 × 128 × 50,000 × 2 = 13.1 GB
The cache grows linearly with sequence length T. No selective removal. The operations that genuinely reduce context are: end the session (total loss), use /compact (precision loss), or rely on provider-side prefix caching — which stores stable prefix representations like system prompts to avoid recomputing them, but doesn't remove anything from your active context budget.
I measured context window usage before and after sequences of vectr_remember + vectr_evict_hint calls: essentially unchanged. The hint was adding tokens to the cache while accomplishing nothing at the context management level. In some cases it made things marginally worse.
Warning: Any tool or documentation claiming "store to external memory to free context budget" is describing something the system cannot deliver. Tokens in a live context window cannot be selectively evicted. Working memory tools are genuinely valuable — but not for freeing active context. Building around that claim confuses your benchmarks and misleads anyone using the tool.
Part 2: What Working Memory Actually Does
Three Tiers of Value
Once I dropped the context-offload framing, the actual value of vectr_remember became clear. It operates on three time horizons:
Tier 1 — In-session re-read avoidance. Within a single session, before any /compact: recalling a stored note costs ~50 tokens instead of re-reading the original file at ~600 tokens. Real savings, but the file is still sitting in your context window anyway. Genuinely useful, but not the reason to build this.
Tier 2 — /compact survival. When /compact compresses the conversation, notes stored on disk (SQLite + ChromaDB) are untouched. Exact signatures and behavioral invariants survive verbatim. The session resumes from actual precision. This is where the system earns its cost.
Tier 3 — Cross-session persistence. Between separate sessions — the editor closed and reopened — the AI starts with nothing. Notes survive. A new session calling vectr_status() + vectr_recall() recovers findings from sessions ago without re-reading a single file. Each session builds on the ones before it.
Analogy — The surgeon's notes: A surgeon takes detailed notes before starting a complex procedure. Halfway through, an emergency calls them away for two hours. When they return: (a) their notes are on the desk — exact measurements, named vessels, where they left off; or (b) a colleague wrote a summary: "patient is partially through a vascular procedure, some complications noted." Option (b) is dangerous. Option (a) lets you continue precisely.
vectr_rememberis option (a). /compact without notes is option (b).
Tier 3 compounds in a way that's easy to underestimate. The first session on a complex codebase pays the discovery cost. The second benefits from the first session's notes. By the tenth session, a well-maintained note store is a persistent model of the codebase that makes every session faster.
What to Store and How
Don't store file pointers. "See resolver.rs:214 for the lock implementation" is a bad note. File paths change during refactoring. Line numbers drift with every edit. A pointer hasn't captured what you learned — it's a reference. When you recall it, you still have to read the file.
Store the finding itself:
WorkspaceLock: defined at resolver.rs:214 (as of 2026-06-08)
- acquire(): blocks if .vectr_lock exists; writes current PID + timestamp
- release(): validates PID match before deleting lock file
(returns Err if mismatch — this is intentional, not a bug)
- CRITICAL: acquire() must be called BEFORE touching workspace
metadata. The filesystem watcher reads metadata; touching it
without holding the lock fires an invalid re-index.
This caused the race condition in issue #1247.
Key callsites: workspace.rs:89 (init), daemon.rs:203 (shutdown)
This note is ~120 tokens. Reading the relevant files to reconstruct this knowledge would cost 600+ tokens plus two turns. The note captures the actual insight — the non-obvious invariant about lock order — not just a pointer.
Priority and tags are not cosmetic. priority affects recall ordering: high-priority notes rank higher when multiple notes match a query with similar scores. tags enable filtered recall — vectr_recall(query="locking", tags=["concurrency"]) returns only notes tagged with "concurrency" that semantically match the query. In a large note store accumulated over months, filtering by subsystem makes recall precise.
Part 3: The Bugs That Shaped the Design
The B9 Bug: When Recall Doesn't Recall
For several early benchmark runs, vectr_recall was firing in implementation sessions but returning nothing useful — 0 relevant results across 5 separate sessions on CPython tasks, even though the research session had stored detailed notes about exactly the functions being modified.
Root cause: recall was using SQL LIKE queries, not semantic search.
# The broken implementation (pre-B9)
def recall(query: str) -> list[Note]:
return db.execute(
"SELECT * FROM notes WHERE content LIKE ? LIMIT 20",
(f"%{query}%",)
).fetchall()
SQL LIKE is substring matching. vectr_recall("garbage collector finalizer ordering") would only return notes containing that exact string. A note about PyObject_GC_Del describing finalizer behavior — stored with different wording in a different session — wouldn't match.
The fix: use the ChromaDB vector store for recall. Notes are embedded when stored, retrieved by semantic similarity when recalled.
# The correct implementation (post-B9)
def recall(query: str, tags: list[str] | None = None) -> list[Note]:
results = chroma_collection.query(
query_texts=[query],
n_results=10,
where={"tags": {"$in": tags}} if tags else None,
)
return [Note.from_chroma(r) for r in results]
Impact was immediate: vectr_recall fired with relevant results in 4 of 6 implementation sessions in the CPython re-run, compared to 0 of 6 before. This bug sat undetected because the initial benchmark design didn't make empty recalls visible. Per-tool logging — "vectr_recall called 5 times, 5 empty responses" — made it obvious.
Warning: SQL LIKE requires the query string to be a literal substring of the stored content. For anything more than exact-match lookup, it's not just suboptimal — it's functionally broken for most real queries.
vectr_evict_hint: What It Actually Does After the Reframe
After fixing the context-offload misconception, I kept vectr_evict_hint but reframed it completely. What it actually does: it tracks the cumulative token cost of all code chunks Vectr has retrieved in the current session. When this cost crosses a threshold (40K tokens or 20 tool calls — whichever fires first), it appends a hint:
[vectr_evict_hint] You've retrieved ~42,000 tokens of indexed chunks
this session. The following chunks are fully indexed and re-retrievable
in <50ms — no need to re-read these files later:
- resolver.rs:214 WorkspaceLock::acquire (retrieved 8 turns ago)
- resolver.rs:267 WorkspaceLock::release (retrieved 8 turns ago)
- workspace.rs:89 init call site (retrieved 5 turns ago)
Consider calling vectr_remember now if you have key findings you
haven't stored yet.
The word is "re-retrievable," not "droppable." The hint doesn't claim to free tokens. It tells the AI: these files are in the index, you can get them back in under 50ms if you need them — don't re-read out of caution when you already have what you need or could re-search instantly. It's a behavioral nudge, not a memory management operation.
The threshold values come from MemGPT (arXiv:2310.08560), which found models begin exhibiting "lost in the middle" degradation at roughly 70% context fill. Using a disjunction (first threshold reached triggers the hint) keeps it from firing too late on sessions that accumulate few large files but many small searches.
Lost in the middle: LLM performance on retrieval tasks follows a U-shaped curve over context position — accuracy highest at the beginning and end, degrading for content in the middle. The evict_hint threshold is set to fire before relevant information drifts into that degraded zone.
Part 4: The Mechanics of Actually Using It
The Save-Moment Problem
Knowing notes are valuable doesn't make the AI store them. In early sessions, vectr_remember call rates were low — not because the AI couldn't see the tool, but because there was no clear trigger for "now is the moment to save this."
Saving notes is a habit humans develop from experiencing loss. An AI editor in session 1 has never lost anything to /compact here — it's optimizing for the task in front of it, not a compression event that might happen three hours from now.
The solution: making the save-moment explicit and concrete in the CLAUDE.md template that Vectr writes into a workspace.
**The moment you find a key definition, pattern, or non-obvious detail:**
call vectr_remember(content, tags=[...], priority="high"|"medium"|"low")
— store the actual code block or finding, not a file pointer.
Treat every vectr_search or vectr_locate call as a **pair**: search,
then immediately save the key finding before your next retrieval.
If /compact runs later, the conversation summary loses exact signatures
and line numbers — your note does not.
"Pair every search with a save" turned out to be the most effective framing. Not "save when it feels important" (too vague), but "pair every retrieval with a note" (concrete, immediate trigger). Sessions that stored the most notes also had the lowest re-discovery costs in subsequent tasks.
When not to search: the SR-RAG finding. The pair pattern addresses when to save. There's a complementary question that ended up in the same CLAUDE.md template: when to search at all. Before calling vectr_search on a well-known API or framework, the AI should first write out what it already knows and only search if genuine gaps remain.
This comes from SR-RAG (arXiv:2504.01018). The finding: models often retrieve information already baked in from training, adding token cost without improving answer quality. Writing out what you already know before searching reduces unnecessary calls by 26–40% on familiar codebases. On an unfamiliar codebase, the AI's training knowledge rarely applies — every search turns up something new. On well-known frameworks, training knowledge is often more accurate than indexed documentation. The verbalization step surfaces which situation you're actually in.
Snapshots: Checkpointing an Investigation
Beyond individual notes, there's a use case for checkpointing entire session states. vectr_snapshot("lock-subsystem-mapped") seals the current note set under a named label with a timestamp. vectr_snapshot_list() at session start shows all checkpoints.
Typical multi-session workflow:
-
Exploration sessions: explore, call
vectr_rememberon each key finding. Pair every search with a save. -
Exploration complete:
vectr_snapshot("exploration-complete"). Seals the note state for this phase. -
Implementation sessions:
vectr_status()→vectr_recall(query)→ build on the snapshot. -
Implementation done:
vectr_snapshot("implementation-done"). Two named checkpoints marking the arc. -
Revisiting months later:
vectr_snapshot_list()shows the investigation history. The snapshot timestamp tells you which notes were established before a given change.
When Notes Are Wrong: vectr_forget
Notes can be wrong. A note about function behavior written before a refactor may describe the old behavior. Stale notes are worse than no notes — false confidence in outdated information.
vectr_forget(note_id) deletes it. Every vectr_recall response includes note IDs alongside the content so you can act on them inline. The workflow: recall → verify against current code → forget the stale note → store the updated one.
Vectr also appends a [STALE] marker automatically when a file path extracted from a note's content no longer exists in the workspace. The extraction is a regex scan for path-like strings — when those paths disappear from the file tree, the note gets flagged. It only catches path-level staleness, not behavioral changes in files that kept their names.
Warning: The [STALE] marker fires when a referenced file path disappears. It does NOT fire when file content changes. A note about function behavior after a refactor that renamed the file gets flagged; a note about function behavior after a refactor that changed the logic without renaming gets no warning. Always verify behavioral notes against current code before acting on them for implementation work.
The Design Principle I'd Rephrase
Looking back at the original Vectr documentation for working memory, almost every sentence led with the wrong framing. "Store to vectr, then drop from context." "Offload findings to free context budget." "Context offload layer." Every one of these is technically false, and I shipped all of them.
The correct version is shorter: store findings now so you can recall them precisely later. Through /compact. Through a new session. Through however many turns separate the discovery from the moment you need to use it. The value is in the later. The storing is cheap. The recalling is where you get the hours back.
If I were writing the documentation from scratch I'd lead with the /compact scenario — with the specific moment when a detailed understanding of a complex system compresses into a three-sentence summary that can't be acted on. That's the moment where a stored note is worth exactly what it cost to write it.
What's Next
The part I haven't answered yet: does any of this actually save time? Not in the abstract — in real benchmarks, on real codebases, compared against an AI editor with no indexing and no memory. The number I care about is not total session cost (which includes upfront research overhead that inflates the naive comparison) but re-discovery cost per task across repeated sessions on the same codebase.
Part 3 covers that measurement — including why the total sprint cost comparison is almost exactly the wrong metric to report, and what the data from CPython, Django, and Apache Camel actually showed once I separated research overhead from implementation savings.
If you want to try Vectr now, the tool page has setup instructions. The full working memory layer — vectr_remember, vectr_recall, vectr_snapshot, vectr_forget — is in the current release alongside the semantic search tools from Part 1.
References
- Packer et al., MemGPT: Towards LLMs as Operating Systems, arXiv:2310.08560, 2023
- Liu et al., Lost in the Middle: How Language Models Use Long Contexts, arXiv:2307.03172, 2023
- Self-Routing RAG: Binding Selective Retrieval with Knowledge Verbalization, arXiv:2504.01018, 2025
- Vaswani et al., Attention Is All You Need, NeurIPS 2017
- A Survey on LLM Acceleration Based on KV Cache Management, arXiv:2412.19442, 2024
For further actions, you may consider blocking this person and/or reporting abuse
