Voozh

Most AI workflow posts are just a screenshot of a chat box and a hopeful caption.

This one is different: I ran the same local model twice on the same question, once with a raw prompt and once with a memory + retrieval stack around it.

What changed

Before:

raw prompt
no compression
no semantic retrieval
more clutter in context

After:

compressed working context
semantic retrieval from memory notes
fewer prompt tokens
same model, same task, less nonsense

The measured result

From the proof pack:

Before latency: 28,590.3 ms
After latency: 25,008.9 ms
Before accuracy: 0.500
After accuracy: 1.000
Before prompt tokens: 87
After prompt tokens: 108
Memory saved: -24.1%

That last line is the fun one: the “after” run used more prompt tokens here, because I tuned it to answer the question better. Token count is a tool, not a religion.

Why this matters

The model did not become magical. The workflow got smarter.

That is the whole game with KV cache compression and prompt shaping work: make the task clearer, measure the result, and keep the same model honest across versions.

Proof pack

👁 Before/after view

👁 Scores panel

👁 Terminal transcript

URL: https://dev.to/aman_sachan_126d19c4a2773/kvquant-bitforge-same-model-smarter-context-better-answer-55ff

⇱ KVQuant / BitForge: same model, smarter context, better answer - DEV Community

What changed

The measured result

Why this matters

Proof pack

Links