VOOZH about

URL: https://dev.to/daniel_bharath_8164530f7b/turboquant-how-a-simple-spin-saves-gigabytes-of-gpu-memory-33jl

⇱ TurboQuant: How a Simple Spin Saves Gigabytes of GPU Memory - DEV Community


Start With a Restaurant

Before we talk about AI, let me tell you about a busy restaurant

Imagine you manage a popular restaurant. Every order gets written down in full

Order #1: "One Chicken Biryani with extra raita and no onions, 
 One Butter Naan, Two Mango Lassi, Table 5, 7:30 PM"

Order #2: "Two Margherita Pizza with extra cheese and thin crust,
 One Veg Burger with no mayo, One Coke, Table 12, 7:45 PM"

Order #3: "Three Chicken Burger with extra spicy sauce,
 Two Masala Dosa, One Filter Coffee, Table 8, 8:00 PM"

Each order takes ~150 characters.On a busy Friday night with 500 orders, that's 75k characters, which is pages and pages of scribbled notes. The kitchen is drowning in paper.

Now, what if the kitchen agrees on a code system?

Codebook (shared between waiter and kitchen):

 Food: Extras: Drinks:
 CB = Chicken Biryani +R = extra raita ML = Mango Lassi
 MP = Margherita Pizza +C = extra cheese CK = Coke
 VB = Veg Burger +SP = extra spicy FC = Filter Coffee
 KB = Chicken Burger -O = no onions
 BN = Butter Naan -M = no mayo
 MD = Masala Dosa TC = thin crust

Now those same orders become:

Order #1: 1×CB+R-O 1×BN 2×ML T5 730P
Order #2: 2×MP+C,TC 1×VB-M 1×CK T12 745P
Order #3: 3×KB+SP 2×MD 1×FC T8 800P

The kitchen decodes instantly: "1×CB+R-O" -> "One Chicken Biryani, extra raita, no onions." No ambiguity, no lost information, just shorter.

Full orders: ~150 characters each × 500 orders = 75,000 characters
Coded orders: ~40 characters each × 500 orders = 20,000 characters

Compression: 3.75x smaller. Same food gets cooked.

This is exactly what TurboQuant does to AI models.

Instead of storing the full number(0.4567689.. 16/32 bits). It stores the KV cache in (1010 4 bits) that maps to a value in shared codebook. The codebook is agreed in advance. This becomes the same for every layer and every conversation

The Trick: Spin, Snap, Pack

TurboQuant compresses each K (or V) vector in the below steps. Let me walk through the entire process on how the compression and decompression happens with a real 4-dimensional vector you can verify by hand.

We'll use 2-bit quantization (4 grid levels) to keep things small. Real models use something like 4-bit (16 levels) and 256 dimensions, but the math is identical.

Our example vector

K=[1.2,-0.8,0.5,-1.1]

A normal balanced vector, no outliers, keeping it simple to understand

so 4 numbers assuming 4 bytes(float32) each
= 16 bytes
Lets compress it to 3 bytes and then try to bring it back(there will be slight errors/residual)

Step 1 : Measure the norm and Normalize

Measure the magnitude/norm of the vector

norm=sqrt(1.2**2+0.8**2+0.5**2+1.1**2)=1.8815

we save the norm as a float16 number which 2 bytes

Now divide the vector by norm to get a unit vector

unit=K/1.8815=[+0.6378,-0.4252,+0.2657,-0.5846]

Step 2 : Spin (Random Rotation)

Multiply the unit vector by a random 4×4 rotation matrix R:

Rotation matrix R (generated once from seed=42, never changes):

[+0.8395,+0.4120,-0.2597,+0.2411][+0.2956,-0.5448,-0.4830,-0.6185][-0.3277,+0.7138,-0.3802,-0.4884][-0.3171,-0.1547,-0.7448,+0.5664]

Each rotated value is a dot product of one row of R with the unit vector:

rotated[0]=(+0.8395×+0.6378)+(+0.4120×-0.4252)+(-0.2597×+0.2657)+(+0.2411×-0.5846)=+0.5355-0.1752-0.0690-0.1410=+0.1503rotated[1]=(+0.2956×+0.6378)+(-0.5448×-0.4252)+(-0.4830×+0.2657)+(-0.6185×-0.5846)=+0.1886+0.2317-0.1283+0.3615=+0.6534rotated[2]=(-0.3277×+0.6378)+(+0.7138×-0.4252)+(-0.3802×+0.2657)+(-0.4884×-0.5846)=-0.2090-0.3035-0.1010+0.2856=-0.3280rotated[3]=(-0.3171×+0.6378)+(-0.1547×-0.4252)+(-0.7448×+0.2657)+(+0.5664×-0.5846)=-0.2023+0.0658-0.1979-0.3312=-0.6655

Before and after Rotation

Before (unit vector): [+0.638, -0.425, +0.266, -0.585]
After (rotated): [+0.150, +0.653, -0.328, -0.666]

Step 3: Snap, Building the Grid

We have 4 rotated numbers: [+0.150, +0.653, -0.328, -0.666]

We need to replace each one with the nearest "allowed" value from a small set. With 2 bits, we get to pick 4 allowed values. But which 4?

Lloyd-Max is the algorithm that finds this smart placement to find the 4 grid values (skipping the math intentionally)

This generates a spread in 4 values

Final codebook: 
 Level 0: -0.674 
 Level 1: -0.219 
 Level 2: +0.219 
 Level 3: +0.674 

👁

Step 4: Snap to Grid

Now snap each rotated value to the nearest codebook level

R-Vector: [+0.150, +0.653, -0.328, -0.666]

CodeBook: [-0.674, -0.219. +0.219, +0.674]
Indices: 0 1 2 3

Lets map the values in the R-vector to the closest vectors

value0 = +0.150 -> nearest Level 2(+0.219) error = 0.069
value1 = +0.653 -> nearest Level 3(+0.674) error = 0.021 <- tiny!
value2 = -0.328 -> nearest Level 1(-0.219) error = 0.109
value3 = -0.666 -> nearest Level 0(-0.674) error = 0.009 <- almost exact!

The R-Vector gets transformed to indices

[+0.150, +0.653, -0.328, -0.666] --> [2,3,1,0]

Step 5 Pack

We need to store this indices --> [2,3,1,0] --> hex 0x1E
which needs 1 byte (00000000) to store the indices 0 to 3

so total bytes = 2 bytes (for storing norm) + 1 byte for indices

Decompression/unpacking

  • indices to numbers from the codebook [2,3,1,0] --> [+0.219, +0.674, -0.219, -0.674]
  • unrotate/unspin to get the unrotated vector by multiplying R^T(transpose)
unit_hat[0]=(+0.8395×+0.219)+(+0.2956×+0.674)+(-0.3277×-0.219)+(-0.3171×-0.674)=+0.184+0.199+0.072+0.214=+0.669unit_hat[1]=(+0.4120×+0.219)+(-0.5448×+0.674)+(+0.7138×-0.219)+(-0.1547×-0.674)=+0.090-0.367-0.156+0.104=-0.329unit_hat[2]=(-0.2597×+0.219)+(-0.4830×+0.674)+(-0.3802×-0.219)+(-0.7448×-0.674)=-0.057-0.326+0.083+0.502=+0.203unit_hat[3]=(+0.2411×+0.219)+(-0.6185×+0.674)+(-0.4884×-0.219)+(+0.5664×-0.674)=+0.053-0.417+0.107-0.382=-0.639
  • scale back with norm value
K_hat=unit_hat×1.8815K_hat=[+1.259,-0.620,+0.382,-1.202]

The Verdict: Original vs Reconstructed

 Original Reconstructed Error Relative
 ──────── ───────────── ───── ────────
dim 0: +1.200 +1.259 0.059 4.9%
dim 1: -0.800 -0.620 0.180 22.6%
dim 2: +0.500 +0.382 0.118 23.6%
dim 3: -1.100 -1.202 0.102 9.3%

Overall relative MSE: 1.7%

Look at that! We crushed a 16-byte vector down to 3 bytes and got it back with 1.7% error, but still there is some difference

Now, lets do the math with 4 bit (16 levels) instead of 2

 Original Reconstructed Error Relative
 ──────── ───────────── ───── ────────
dim 0: +1.200 +1.175 0.025 2.1%
dim 1: -0.800 -0.716 0.084 10.5%
dim 2: +0.500 +0.433 0.067 13.4%
dim 3: -1.100 -1.081 0.019 1.7%

Overall relative MSE: 0.4%

As this is a toy example we have taken, we are seeing almost good results, applying this on a real model with ~256 levels give much more better results

Real Numbers: How Much Memory Does This Save?

I implemented TurboQuant from scratch and tested it on real Gemma 4 models. Here's what we measured:

Gemma 4 E2B (5.1 billion parameters)

At 128K token context:

Without TurboQuant:
 Model weights: 10.25 GB
 KV cache: 2.25 GB
 Total: 12.50 GB

With TurboQuant (4-bit):
 Model weights: 10.25 GB (unchanged — we only compress the cache)
 KV cache: 0.58 GB
 Total: 10.83 GB

Saved: 1.67 GB of KV cache memory

Which is 3.9x

Extrapolating the numbers

Extending this 3.9x as we increase the context length and the baseline KV, these are approximations

Context Length Baseline KV TurboQuant KV Saved Compression
8K tokens 2.0 GB 0.5 GB 1.5 GB 3.9x
32K tokens 8.1 GB 2.0 GB 6.0 GB 3.9x
128K tokens 32.2 GB 8.2 GB 24.0 GB 3.9x
256K tokens 64.4 GB 16.4 GB 48.0 GB 3.9x

👁

P.S I have skipped the QJL(Quantized Johnson-Lindenstrauss) for handling the residuals

Wrapping up

Rotate, snap, pack , three operations, 3.9x less KV cache memory, same answers.

References: