Last indexed: 7 May 2026 (2e12c1)

Context Parallel (Ulysses)

Purpose and Scope

This page documents the Context Parallel (CP) feature in AReaL, implemented using Ulysses sequence parallelism. Context Parallel distributes long sequences across multiple GPUs to enable training with context lengths that would otherwise exceed single-GPU memory constraints. It is particularly critical for long-context RL alignment where sequences can reach 128K+ tokens.

Sources: areal/models/fsdp/ulysses.py1-13 areal/models/transformers/ulyssess_patch.py1-16

Overview

Context Parallel (CP) in AReaL uses the Ulysses algorithm to partition input sequences across multiple GPUs along the sequence dimension. Unlike Tensor Parallel (TP) which shards weights, Ulysses shards the activation tensors along the sequence length dimension during attention computation.

Key characteristics:

Sequence Sharding: Splits sequences into equal chunks across CP ranks.
All-to-All Collectives: Uses all_to_all to redistribute data between the sequence dimension and the head dimension during attention areal/models/transformers/ulyssess_patch.py50-52
Divisibility Requirements: Sequence length must be divisible by context_parallel_size, and attention heads must be divisible by the CP size areal/models/transformers/ulyssess_patch.py163-173
Backend Support: Integrated into FSDPEngine, MegatronEngine, and ArchonEngine.

Sources: areal/models/transformers/ulyssess_patch.py36-70 areal/models/fsdp/ulysses.py31-51

Configuration

Context Parallel is configured via the context_parallel_size parameter. In AReaL's allocation_mode syntax, this is represented by the c prefix.

Parameter	Type	Default	Description
`context_parallel_size`	`int`	`1`	Number of GPUs to split sequences across.
`shard_vision_across_sp`	`bool`	`False`	Whether to distribute vision encoder work across CP ranks for VLMs areal/models/transformers/ulyssess_patch.py152

Example allocation_mode:

fsdp:d2t2c2: 2 Data Parallel replicas, 2 Tensor Parallel ranks, 2 Context Parallel ranks.
archon:d1t1c4: 4-way Context Parallel on a single DP replica.

Sources: areal/models/transformers/ulyssess_patch.py149-153

Architecture and Data Flow

Ulysses Sequence Parallel Data Flow

The Ulysses implementation transforms the tensor layout to allow local attention computation on a full sequence but with a subset of heads.

Title: Ulysses Sequence Parallel Data Flow

Sources: areal/models/transformers/ulyssess_patch.py36-70 areal/models/fsdp/ulysses.py7-15

Vision Model Support

For Vision-Language Models (VLMs), AReaL provides specialized optimizations to handle image embeddings within a sequence-parallel context. This includes slicing inputs_embeds and managing multi-modal specific tensors like visual_pos_masks and deepstack_visual_embeds areal/models/transformers/ulyssess_patch.py74-133

Vision Slicing Mechanism

Title: Vision Slicing Integration in Ulysses

Key Functions:

ulysses_prepare_inputs: Coordinates the padding and slicing of inputs before they enter the model.
patch_vlm_for_ulysses_input_slicing: Injects logic into the model's forward call to slice inputs_embeds and visual_pos_masks specifically for VLMs like Qwen3-VL areal/models/transformers/ulyssess_patch.py74-99
slice_input_tensor: Utility that performs the actual tensor slicing based on the rank in the Ulysses process group areal/models/fsdp/ulysses.py14

Sources: areal/models/transformers/ulyssess_patch.py74-141 areal/models/fsdp/ulysses.py14-15

Qwen3-VL Specific Implementation

For Qwen3-VL models, AReaL provides a dedicated ulysses_flash_attn_forward that handles the specific attention architecture of Qwen3-VL, including q_norm and k_norm operations before sequence-to-head redistribution areal/models/transformers/qwen3_vl.py25-96

Title: Qwen3-VL Ulysses Attention Path

Sources: areal/models/transformers/qwen3_vl.py25-96

Megatron Packed Context Parallel

For MegatronEngine, AReaL supports packed sequence training with context parallelism. This implementation uses an interleaved pattern (CP*2 chunks) for load balancing with causal masking areal/engine/megatron_utils/packed_context_parallel.py14-22

Packed Sequence Reassembly

When computing logprobs or rewards on packed sequences, the CP-local tensors must be gathered and reassembled into the original sequence order. This is handled by reassemble_cp_packed_logprobs, which uses a differentiable all_gather and index-based permutation areal/engine/megatron_utils/packed_context_parallel.py148-180

Title: Megatron Packed Context Parallel Pipeline

Sources: areal/engine/megatron_utils/packed_context_parallel.py14-180 tests/test_reassemble_cp_logprobs.py63-92

Core Implementation Details

Attention Patching

AReaL patches standard transformer models to use _ulysses_flash_attention_forward. This function manages the transitions between sequence-sharded and head-sharded states.

State	Shape	Dimension Order
Sequence Sharded	`(Batch, SeqLen/CP, Heads, HeadDim)`	Input to All-to-All
Head Sharded	`(Batch, SeqLen, Heads/CP, HeadDim)`	Input to Attention

Sources: areal/models/transformers/ulyssess_patch.py36-70 areal/models/fsdp/ulysses.py7-13

Logprob Computation in RL

During RL training (e.g., SFT or PPO), the loss function must account for padded dummy sequences created during micro-batching. In compute_packed_sft_loss, sequences with zero valid tokens (often resulting from Ulysses SP padding) are skipped to ensure correct normalization areal/trainer/sft/lm_engine.py101-109

Sources: areal/trainer/sft/lm_engine.py81-131

Constraints and Validation

Ulysses parallelism imposes strict constraints on model architecture:

Head Divisibility: The number of attention heads (num_attention_heads) and KV heads (num_key_value_heads) must be divisible by ulysses_sp_size areal/models/transformers/ulyssess_patch.py165-175
Sequence Length: The total sequence length must be divisible by the CP size. For Megatron packed sequences, lengths must be a multiple of tp_size * cp_size * 2 areal/engine/megatron_utils/packed_context_parallel.py31-37
Grouped Query Attention (GQA): If ulysses_sp_size is greater than the number of KV heads, AReaL automatically handles repeat_kv to ensure the All-to-All operation is balanced areal/models/transformers/ulyssess_patch.py21-33

Summary of Key Components

Component	File Path	Role
`apply_monkey_patch`	areal/models/transformers/ulyssess_patch.py149	Entry point for patching HF models with Ulysses.
`gather_seq_scatter_heads`	areal/models/fsdp/ulysses.py11	All-to-All to move from sequence-sharded to head-sharded.
`gather_heads_scatter_seq`	areal/models/fsdp/ulysses.py10	All-to-All to move from head-sharded back to sequence-sharded.
`patch_vlm_for_ulysses_input_slicing`	areal/models/transformers/ulyssess_patch.py74	Handles VLM-specific slicing for visual embeddings.
`reassemble_cp_packed_logprobs`	areal/engine/megatron_utils/packed_context_parallel.py148	Reconstructs full sequence logprobs from CP shards.
`ulysses_flash_attn_forward`	areal/models/transformers/qwen3_vl.py25	Specialized Ulysses attention for Qwen3-VL models.

Sources: areal/models/transformers/ulyssess_patch.py areal/models/fsdp/ulysses.py areal/models/transformers/qwen3_vl.py areal/engine/megatron_utils/packed_context_parallel.py

Refresh this wiki

URL: https://deepwiki.com/inclusionAI/AReaL/8.6-context-parallel-(ulysses)