VOOZH about

URL: https://dev.to/sirajuddin-shaik/mambassm-basics-ndh

⇱ Mamba/SSM Basics - DEV Community


State Space Models offer linear-time sequence modeling with content-aware selective filtering, challenging Transformers for long-context inference.

Why This Matters

State Space Models (SSMs) provide a principled alternative to Transformers for long-sequence modeling. In production systems handling long contexts (e.g., code generation, genomic analysis), Transformer attention's quadratic cost becomes a bottleneck. Mamba achieves linear-time inference with constant-memory state, making it viable for million-token contexts where attention-based models are prohibitively expensive.

Core Idea

SSMs originate from continuous-time control theory: a latent state evolves over time driven by input, and observations are linear projections of that state. Mamba's key innovation is making the SSM parameters input-selective — the model learns to gate which information enters and exits the state, mimicking attention's ability to focus on relevant tokens without the cost.

Technical Details

The continuous-time SSM is defined as:

where is latent state, is input, and , , . Using zero-order hold discretization with step :

The recurrent update becomes:

Mamba's selective mechanism makes , , and input-dependent:

The parallel scan algorithm computes this recurrence in during training. Inference is O(1) per token with fixed state size N , yielding constant-memory decoding regardless of sequence length.

How It Works

  1. Project input: Map token to expanded dimension .
  2. Generate selective parameters: Compute input-dependent , , from .
  3. Discretize: Convert continuous to discrete using .
  4. Recurrent scan: Apply parallel scan (training) or sequential update (inference) to compute hidden states .
  5. Output projection: Compute , then project through gating (SiLU) to output dimension.

Key Insights

  • Selectivity is essential: Non-selective SSMs (S4) cannot do in-context retrieval; making input-dependent enables content-aware filtering.
  • Diagonal + low-rank structure on enables recurrence; Mamba uses diagonal matrices exclusively.
  • Hardware-aware design: The scan kernel is IO-bound, not compute-bound — Mamba's CUDA kernel fuses discretization, scan, and output projection to minimize memory reads.
  • Linear decoding cost: Unlike KV-cache which grows linearly, SSM state is fixed-size , making generation memory-constant.

Sources