Start With a Restaurant
Before we talk about AI, let me tell you about a busy restaurant
Imagine you manage a popular restaurant. Every order gets written down in full
Order #1: "One Chicken Biryani with extra raita and no onions,
One Butter Naan, Two Mango Lassi, Table 5, 7:30 PM"
Order #2: "Two Margherita Pizza with extra cheese and thin crust,
One Veg Burger with no mayo, One Coke, Table 12, 7:45 PM"
Order #3: "Three Chicken Burger with extra spicy sauce,
Two Masala Dosa, One Filter Coffee, Table 8, 8:00 PM"
Each order takes ~150 characters.On a busy Friday night with 500 orders, that's 75k characters, which is pages and pages of scribbled notes. The kitchen is drowning in paper.
Now, what if the kitchen agrees on a code system?
Codebook (shared between waiter and kitchen):
Food: Extras: Drinks:
CB = Chicken Biryani +R = extra raita ML = Mango Lassi
MP = Margherita Pizza +C = extra cheese CK = Coke
VB = Veg Burger +SP = extra spicy FC = Filter Coffee
KB = Chicken Burger -O = no onions
BN = Butter Naan -M = no mayo
MD = Masala Dosa TC = thin crust
Now those same orders become:
Order #1: 1×CB+R-O 1×BN 2×ML T5 730P
Order #2: 2×MP+C,TC 1×VB-M 1×CK T12 745P
Order #3: 3×KB+SP 2×MD 1×FC T8 800P
The kitchen decodes instantly: "1×CB+R-O" -> "One Chicken Biryani, extra raita, no onions." No ambiguity, no lost information, just shorter.
Full orders: ~150 characters each × 500 orders = 75,000 characters
Coded orders: ~40 characters each × 500 orders = 20,000 characters
Compression: 3.75x smaller. Same food gets cooked.
This is exactly what TurboQuant does to AI models.
Instead of storing the full number(0.4567689.. 16/32 bits). It stores the KV cache in (1010 4 bits) that maps to a value in shared codebook. The codebook is agreed in advance. This becomes the same for every layer and every conversation
The Trick: Spin, Snap, Pack
TurboQuant compresses each K (or V) vector in the below steps. Let me walk through the entire process on how the compression and decompression happens with a real 4-dimensional vector you can verify by hand.
We'll use 2-bit quantization (4 grid levels) to keep things small. Real models use something like 4-bit (16 levels) and 256 dimensions, but the math is identical.
Our example vector
K=[1.2,-0.8,0.5,-1.1]
A normal balanced vector, no outliers, keeping it simple to understand
so 4 numbers assuming 4 bytes(float32) each
= 16 bytes
Lets compress it to 3 bytes and then try to bring it back(there will be slight errors/residual)
Step 1 : Measure the norm and Normalize
Measure the magnitude/norm of the vector
norm=sqrt(1.2**2+0.8**2+0.5**2+1.1**2)=1.8815
we save the norm as a float16 number which 2 bytes
Now divide the vector by norm to get a unit vector
unit=K/1.8815=[+0.6378,-0.4252,+0.2657,-0.5846]
Step 2 : Spin (Random Rotation)
Multiply the unit vector by a random 4×4 rotation matrix R:
Rotation matrix R (generated once from seed=42, never changes):
[+0.8395,+0.4120,-0.2597,+0.2411][+0.2956,-0.5448,-0.4830,-0.6185][-0.3277,+0.7138,-0.3802,-0.4884][-0.3171,-0.1547,-0.7448,+0.5664]
Each rotated value is a dot product of one row of R with the unit vector:
rotated[0]=(+0.8395×+0.6378)+(+0.4120×-0.4252)+(-0.2597×+0.2657)+(+0.2411×-0.5846)=+0.5355-0.1752-0.0690-0.1410=+0.1503rotated[1]=(+0.2956×+0.6378)+(-0.5448×-0.4252)+(-0.4830×+0.2657)+(-0.6185×-0.5846)=+0.1886+0.2317-0.1283+0.3615=+0.6534rotated[2]=(-0.3277×+0.6378)+(+0.7138×-0.4252)+(-0.3802×+0.2657)+(-0.4884×-0.5846)=-0.2090-0.3035-0.1010+0.2856=-0.3280rotated[3]=(-0.3171×+0.6378)+(-0.1547×-0.4252)+(-0.7448×+0.2657)+(+0.5664×-0.5846)=-0.2023+0.0658-0.1979-0.3312=-0.6655
Before and after Rotation
Before (unit vector): [+0.638, -0.425, +0.266, -0.585]
After (rotated): [+0.150, +0.653, -0.328, -0.666]
Step 3: Snap, Building the Grid
We have 4 rotated numbers: [+0.150, +0.653, -0.328, -0.666]
We need to replace each one with the nearest "allowed" value from a small set. With 2 bits, we get to pick 4 allowed values. But which 4?
Lloyd-Max is the algorithm that finds this smart placement to find the 4 grid values (skipping the math intentionally)
This generates a spread in 4 values
Final codebook:
Level 0: -0.674
Level 1: -0.219
Level 2: +0.219
Level 3: +0.674
Step 4: Snap to Grid
Now snap each rotated value to the nearest codebook level
R-Vector: [+0.150, +0.653, -0.328, -0.666]
CodeBook: [-0.674, -0.219. +0.219, +0.674]
Indices: 0 1 2 3
Lets map the values in the R-vector to the closest vectors
value0 = +0.150 -> nearest Level 2(+0.219) error = 0.069
value1 = +0.653 -> nearest Level 3(+0.674) error = 0.021 <- tiny!
value2 = -0.328 -> nearest Level 1(-0.219) error = 0.109
value3 = -0.666 -> nearest Level 0(-0.674) error = 0.009 <- almost exact!
The R-Vector gets transformed to indices
[+0.150, +0.653, -0.328, -0.666] --> [2,3,1,0]
Step 5 Pack
We need to store this indices --> [2,3,1,0] --> hex 0x1E
which needs 1 byte (00000000) to store the indices 0 to 3
so total bytes = 2 bytes (for storing norm) + 1 byte for indices
Decompression/unpacking
- indices to numbers from the codebook [2,3,1,0] --> [+0.219, +0.674, -0.219, -0.674]
- unrotate/unspin to get the unrotated vector by multiplying R^T(transpose)
unit_hat[0]=(+0.8395×+0.219)+(+0.2956×+0.674)+(-0.3277×-0.219)+(-0.3171×-0.674)=+0.184+0.199+0.072+0.214=+0.669unit_hat[1]=(+0.4120×+0.219)+(-0.5448×+0.674)+(+0.7138×-0.219)+(-0.1547×-0.674)=+0.090-0.367-0.156+0.104=-0.329unit_hat[2]=(-0.2597×+0.219)+(-0.4830×+0.674)+(-0.3802×-0.219)+(-0.7448×-0.674)=-0.057-0.326+0.083+0.502=+0.203unit_hat[3]=(+0.2411×+0.219)+(-0.6185×+0.674)+(-0.4884×-0.219)+(+0.5664×-0.674)=+0.053-0.417+0.107-0.382=-0.639
- scale back with norm value
K_hat=unit_hat×1.8815K_hat=[+1.259,-0.620,+0.382,-1.202]
The Verdict: Original vs Reconstructed
Original Reconstructed Error Relative
──────── ───────────── ───── ────────
dim 0: +1.200 +1.259 0.059 4.9%
dim 1: -0.800 -0.620 0.180 22.6%
dim 2: +0.500 +0.382 0.118 23.6%
dim 3: -1.100 -1.202 0.102 9.3%
Overall relative MSE: 1.7%
Look at that! We crushed a 16-byte vector down to 3 bytes and got it back with 1.7% error, but still there is some difference
Now, lets do the math with 4 bit (16 levels) instead of 2
Original Reconstructed Error Relative
──────── ───────────── ───── ────────
dim 0: +1.200 +1.175 0.025 2.1%
dim 1: -0.800 -0.716 0.084 10.5%
dim 2: +0.500 +0.433 0.067 13.4%
dim 3: -1.100 -1.081 0.019 1.7%
Overall relative MSE: 0.4%
As this is a toy example we have taken, we are seeing almost good results, applying this on a real model with ~256 levels give much more better results
Real Numbers: How Much Memory Does This Save?
I implemented TurboQuant from scratch and tested it on real Gemma 4 models. Here's what we measured:
Gemma 4 E2B (5.1 billion parameters)
At 128K token context:
Without TurboQuant:
Model weights: 10.25 GB
KV cache: 2.25 GB
Total: 12.50 GB
With TurboQuant (4-bit):
Model weights: 10.25 GB (unchanged — we only compress the cache)
KV cache: 0.58 GB
Total: 10.83 GB
Saved: 1.67 GB of KV cache memory
Which is 3.9x
Extrapolating the numbers
Extending this 3.9x as we increase the context length and the baseline KV, these are approximations
| Context Length | Baseline KV | TurboQuant KV | Saved | Compression |
|---|---|---|---|---|
| 8K tokens | 2.0 GB | 0.5 GB | 1.5 GB | 3.9x |
| 32K tokens | 8.1 GB | 2.0 GB | 6.0 GB | 3.9x |
| 128K tokens | 32.2 GB | 8.2 GB | 24.0 GB | 3.9x |
| 256K tokens | 64.4 GB | 16.4 GB | 48.0 GB | 3.9x |
P.S I have skipped the QJL(Quantized Johnson-Lindenstrauss) for handling the residuals
Wrapping up
Rotate, snap, pack , three operations, 3.9x less KV cache memory, same answers.
References:
For further actions, you may consider blocking this person and/or reporting abuse
