VOOZH about

URL: https://www.geeksforgeeks.org/deep-learning/grouped-query-attention-gqa/

⇱ Grouped Query Attention (GQA) - GeeksforGeeks


  • Courses
  • Tutorials
  • Interview Prep

Grouped Query Attention (GQA)

Last Updated : 26 Jun, 2025

Grouped Query Attention (GQA) is an optimization technique for transformer models that balances computational efficiency and model performance. Inspired by the multi-head attention mechanism introduced in the seminal "Attention Is All You Need" paper, GQA addresses limitations of its predecessors: multi-head attention (MHA) and multi-query attention (MQA). Below is a detailed analysis of its architecture, benchmarks and tradeoffs.

Core Architecture

👁 file
Multi head vs Grouped query vs Multi query Attention

GQA divides query heads into G groups, each sharing a single key and value head. This contrasts with:

  • MHA: Each query head has unique key/value heads (high accuracy, high memory cost).
  • MQA: All query heads share one key/value head (lower memory cost, reduced accuracy).

The attention computation follows these steps:

1. Query-Key Dot Product: For each query group, compute dot products between queries and shared keys:

where  is the key dimension (scaling prevents gradient vanishing).

2. Softmax Normalization: Apply softmax to generate attention weights.

3. Value Weighting: Multiply weights by shared value vectors to produce contextual outputs.

Performance-Cost Tradeoffs

GQA interpolates between MHA and MQA, optimizing for:

  • Memory Bandwidth: Reduces KV cache size by up to 90% vs. MHA.
  • Inference Speed: 30 - 40% faster than MHA while retaining near-equivalent accuracy.
  • Model Quality: Outperforms MQA in tasks like summarization and long-context processing.

Benchmark Comparisons

👁 Benchmarks
Benchmarks Comparision

Method

KV Heads

Inference Speed

Accuracy (vs. MHA)

Memory Use

Multi-Head (MHA)

H

Baseline

100%

Highest

Multi-Query (MQA)

1

1.5–2× faster

↓ 5–15%

Lowest

GQA (G=8)

H/8

1.3–1.4× faster

↓ 1–3%

Medium

Key Advantages

1. Scalability for Long Contexts: GQA reduces memory complexity from to , enabling efficient processing of long sequences (e.g., 128K tokens) .

2. Hardware Optimization: When group count  matches GPU count in tensor-parallel setups, GQA delivers near-free performance gains.

3. Flexible Configuration: Adjusting  allows fine-tuning for specific tasks:

  • Low  (e.g., 1 -> MQA): Best for latency-critical applications.
  • High  (e.g.,  -> MHA): Ideal for high-accuracy scenarios.

Enhancements and Limitations

  • Dynamic Key Grouping (DGQA): Uses key-vector norms to allocate queries adaptively, improving accuracy by up to 8% in vision transformers.
  • Suboptimal Head Configuration: Fixed grouping can underutilize hardware; recent work decouples head count from hidden dimensions for cost-optimal designs .
  • Sokoban RL Limitation: While not directly applied in RL, GQA’s memory efficiency principles could optimize reward-calculation modules in game-level generators (e.g., reducing tile-editing overhead).
Comment