VOOZH about

URL: https://deepwiki.com/inclusionAI/AReaL/8.6-context-parallel-(ulysses)

⇱ Context Parallel (Ulysses) | inclusionAI/AReaL | DeepWiki


Loading...
Last indexed: 7 May 2026 (2e12c1)
Menu

Context Parallel (Ulysses)

Purpose and Scope

This page documents the Context Parallel (CP) feature in AReaL, implemented using Ulysses sequence parallelism. Context Parallel distributes long sequences across multiple GPUs to enable training with context lengths that would otherwise exceed single-GPU memory constraints. It is particularly critical for long-context RL alignment where sequences can reach 128K+ tokens.

Sources: areal/models/fsdp/ulysses.py1-13 areal/models/transformers/ulyssess_patch.py1-16


Overview

Context Parallel (CP) in AReaL uses the Ulysses algorithm to partition input sequences across multiple GPUs along the sequence dimension. Unlike Tensor Parallel (TP) which shards weights, Ulysses shards the activation tensors along the sequence length dimension during attention computation.

Key characteristics:

  • Sequence Sharding: Splits sequences into equal chunks across CP ranks.
  • All-to-All Collectives: Uses all_to_all to redistribute data between the sequence dimension and the head dimension during attention areal/models/transformers/ulyssess_patch.py50-52
  • Divisibility Requirements: Sequence length must be divisible by context_parallel_size, and attention heads must be divisible by the CP size areal/models/transformers/ulyssess_patch.py163-173
  • Backend Support: Integrated into FSDPEngine, MegatronEngine, and ArchonEngine.

Sources: areal/models/transformers/ulyssess_patch.py36-70 areal/models/fsdp/ulysses.py31-51


Configuration

Context Parallel is configured via the context_parallel_size parameter. In AReaL's allocation_mode syntax, this is represented by the c prefix.

ParameterTypeDefaultDescription
context_parallel_sizeint1Number of GPUs to split sequences across.
shard_vision_across_spboolFalseWhether to distribute vision encoder work across CP ranks for VLMs areal/models/transformers/ulyssess_patch.py152

Example allocation_mode:

  • fsdp:d2t2c2: 2 Data Parallel replicas, 2 Tensor Parallel ranks, 2 Context Parallel ranks.
  • archon:d1t1c4: 4-way Context Parallel on a single DP replica.

Sources: areal/models/transformers/ulyssess_patch.py149-153


Architecture and Data Flow

Ulysses Sequence Parallel Data Flow

The Ulysses implementation transforms the tensor layout to allow local attention computation on a full sequence but with a subset of heads.


Title: Ulysses Sequence Parallel Data Flow

Sources: areal/models/transformers/ulyssess_patch.py36-70 areal/models/fsdp/ulysses.py7-15


Vision Model Support

For Vision-Language Models (VLMs), AReaL provides specialized optimizations to handle image embeddings within a sequence-parallel context. This includes slicing inputs_embeds and managing multi-modal specific tensors like visual_pos_masks and deepstack_visual_embeds areal/models/transformers/ulyssess_patch.py74-133

Vision Slicing Mechanism


Title: Vision Slicing Integration in Ulysses

Key Functions:

  • ulysses_prepare_inputs: Coordinates the padding and slicing of inputs before they enter the model.
  • patch_vlm_for_ulysses_input_slicing: Injects logic into the model's forward call to slice inputs_embeds and visual_pos_masks specifically for VLMs like Qwen3-VL areal/models/transformers/ulyssess_patch.py74-99
  • slice_input_tensor: Utility that performs the actual tensor slicing based on the rank in the Ulysses process group areal/models/fsdp/ulysses.py14

Sources: areal/models/transformers/ulyssess_patch.py74-141 areal/models/fsdp/ulysses.py14-15


Qwen3-VL Specific Implementation

For Qwen3-VL models, AReaL provides a dedicated ulysses_flash_attn_forward that handles the specific attention architecture of Qwen3-VL, including q_norm and k_norm operations before sequence-to-head redistribution areal/models/transformers/qwen3_vl.py25-96


Title: Qwen3-VL Ulysses Attention Path

Sources: areal/models/transformers/qwen3_vl.py25-96


Megatron Packed Context Parallel

For MegatronEngine, AReaL supports packed sequence training with context parallelism. This implementation uses an interleaved pattern (CP*2 chunks) for load balancing with causal masking areal/engine/megatron_utils/packed_context_parallel.py14-22

Packed Sequence Reassembly

When computing logprobs or rewards on packed sequences, the CP-local tensors must be gathered and reassembled into the original sequence order. This is handled by reassemble_cp_packed_logprobs, which uses a differentiable all_gather and index-based permutation areal/engine/megatron_utils/packed_context_parallel.py148-180


Title: Megatron Packed Context Parallel Pipeline

Sources: areal/engine/megatron_utils/packed_context_parallel.py14-180 tests/test_reassemble_cp_logprobs.py63-92


Core Implementation Details

Attention Patching

AReaL patches standard transformer models to use _ulysses_flash_attention_forward. This function manages the transitions between sequence-sharded and head-sharded states.

StateShapeDimension Order
Sequence Sharded(Batch, SeqLen/CP, Heads, HeadDim)Input to All-to-All
Head Sharded(Batch, SeqLen, Heads/CP, HeadDim)Input to Attention

Sources: areal/models/transformers/ulyssess_patch.py36-70 areal/models/fsdp/ulysses.py7-13

Logprob Computation in RL

During RL training (e.g., SFT or PPO), the loss function must account for padded dummy sequences created during micro-batching. In compute_packed_sft_loss, sequences with zero valid tokens (often resulting from Ulysses SP padding) are skipped to ensure correct normalization areal/trainer/sft/lm_engine.py101-109

Sources: areal/trainer/sft/lm_engine.py81-131


Constraints and Validation

Ulysses parallelism imposes strict constraints on model architecture:

  1. Head Divisibility: The number of attention heads (num_attention_heads) and KV heads (num_key_value_heads) must be divisible by ulysses_sp_size areal/models/transformers/ulyssess_patch.py165-175
  2. Sequence Length: The total sequence length must be divisible by the CP size. For Megatron packed sequences, lengths must be a multiple of tp_size * cp_size * 2 areal/engine/megatron_utils/packed_context_parallel.py31-37
  3. Grouped Query Attention (GQA): If ulysses_sp_size is greater than the number of KV heads, AReaL automatically handles repeat_kv to ensure the All-to-All operation is balanced areal/models/transformers/ulyssess_patch.py21-33

Summary of Key Components

ComponentFile PathRole
apply_monkey_patchareal/models/transformers/ulyssess_patch.py149Entry point for patching HF models with Ulysses.
gather_seq_scatter_headsareal/models/fsdp/ulysses.py11All-to-All to move from sequence-sharded to head-sharded.
gather_heads_scatter_seqareal/models/fsdp/ulysses.py10All-to-All to move from head-sharded back to sequence-sharded.
patch_vlm_for_ulysses_input_slicingareal/models/transformers/ulyssess_patch.py74Handles VLM-specific slicing for visual embeddings.
reassemble_cp_packed_logprobsareal/engine/megatron_utils/packed_context_parallel.py148Reconstructs full sequence logprobs from CP shards.
ulysses_flash_attn_forwardareal/models/transformers/qwen3_vl.py25Specialized Ulysses attention for Qwen3-VL models.

Sources: areal/models/transformers/ulyssess_patch.py areal/models/fsdp/ulysses.py areal/models/transformers/qwen3_vl.py areal/engine/megatron_utils/packed_context_parallel.py