Ulysses SP: training models with millions of tokens | Keryc
Training models with full-book contexts is no longer a lab curiosity: it’s a practical need for tasks like document analysis, extended reasoning, code review, and RAG systems. Why the fuss? Attention scales quadratically, and a sequence of hundreds of thousands of tokens won’t fit on a single GPU.
Qué problema resuelve Ulysses
Attention in transformers requires scoring pairs of tokens, which makes memory and FLOPs grow with the square of sequence length. FlashAttention and similar optimizations ease memory by avoiding materializing the whole matrix, but the compute remains. So what do you do when a novel is ~250k tokens and you need to train with several documents at once?
The traditional data-parallel solution doesn’t help: each GPU would still need to see the full sequence inside the attention block. Ulysses Sequence Parallelism (part of Snowflake AI Research’s ALST protocol) offers an elegant way to split attention across GPUs by parallelizing across attention heads.
Cómo funciona Ulysses Sequence Parallelism (SP)
The key idea is to change locality: instead of keeping all heads and sharding the sequence, Ulysses shards the sequence across GPUs and distributes heads among them. Simplified steps:
Sequence sharding: the sequence is split along the time dimension; each GPU gets a chunk of tokens.
QKV projection: each GPU projects q, k and v for its local chunk.
All-to-all: a collective op redistributes data so each GPU has all sequence positions but only a subset of heads.
Local attention: each GPU computes attention for its assigned heads (FlashAttention or SDPA).
All-to-all: the redistribution is reversed to return to the sequence-sharded format.
Output projection: each GPU projects the output for its local chunk.
This requires two all-to-all calls per attention layer. The benefit: heads are independent, so you can parallelize with low communication overhead if your interconnect supports it.
Communication: Ulysses sends O(total_seq * hidden / sp_size) per GPU, while Ring Attention communicates a larger factor. All-to-all makes better use of available bisection bandwidth; Ring serializes across hops.
Integración con el ecosistema Hugging Face
Accelerate acts as the foundation to enable Ulysses via ParallelismConfig and DeepSpeed integration. With accelerator.prepare() the model is registered as UlyssesSPAttentionHF and the dataloader is wrapped to handle sequence sharding and shift_labels.
Transformers Trainer accepts TrainingArguments.parallelism_config and automatically handles:
wrapping the dataloader with UlyssesSPDataLoaderAdapter
adding shift_labels and detecting pre-shifted labels
adding valid-token losses across ranks and computing the weighted loss
adjusting batch-size and dataloader-length calculations
TRL’s SFTTrainer extends this with optimizations for supervised fine-tuning of long sequences (pre-shifted labels, packing, pad_to_multiple_of equal to sp_size).
If you use a custom loop with accelerate, you must add weighted loss aggregation across SP ranks. Trainer and SFTTrainer already automate this.
Comparación práctica: Ulysses vs Ring Attention
Parallelism method: Ulysses uses head partitioning; Ring Attention uses ring exchange of KV.
Backend: Ulysses requires DeepSpeed ZeRO; Ring works with PyTorch FSDP2.
Attention support: Ulysses supports FlashAttention 2/3 and SDPA; Ring typically uses SDPA.
Communication: Ulysses does two all-to-all calls per layer (better bisection bandwidth usage), Ring does serialized point-to-point communication.
Constraints: Ulysses requires num_heads >= sp_size; Ring has no such limitation.
So, conclusion? Switching from one to the other is often as simple as tweaking your Accelerate config. Why not test both on your hardware?
Buenas prácticas y recomendaciones
Ensure divisibility: global length max_length must be divisible by sp_size. Use pad_to_multiple_of = sp_size.
Use FlashAttention 2 for Ampere and FlashAttention 3 for Hopper. Avoid FA2 on Blackwell; wait for FA4 when available.
For very large models, combine Ulysses with ZeRO Stage 3 and optimizer/parameter offload if needed.
Enable PYTORCH_ALLOC_CONF=expandable_segments:True to allow larger lengths.
Match sp_size and dp_shard_size to your GPU count: for example, on 4 GPUs use SP=4/DP=1 for max sequences or SP=2/DP=2 for a balance between length and throughput.
If available, enable use_liger_kernel=True and optimizations like FusedLinearCrossEntropy and TiledMLP to save memory on logits and MLPs.
Benchmarks esenciales y verificación de pérdida
Experiments with Qwen3-4B on H100 80GB + ZeRO-3 show:
SP=4 on 4 GPUs lets you jump from 8K to 96K tokens with the same memory configuration (peak ~66 GB per GPU at 96K). At 128K there was OOM in that setup.
Throughput: at 64K tokens SP=4 achieved ~13.4K tokens/s, ~3.7x over the 8K baseline on 1 GPU. As sequence length grows, quadratic compute dominates and sequence parallelism becomes more efficient.
Loss equivalence: with matched token budgets (adjusting GAS = SP where appropriate), training with SP and DP matches token-normalized loss; residual differences appear in reported logs but not in the canonical objective.
These results imply that Ulysses is a practical tool for training with very long contexts without sacrificing training quality, as long as you tune batching and gradient accumulation correctly.
Requisitos y versiones recomendadas
DeepSpeed >= 0.18.1
accelerate >= 1.12
transformers >= 5.0 for Trainer integration
trl >= 0.18.0 for SFTTrainer
FlashAttention 2/3 depending on GPU
Reflexión final
Ulysses Sequence Parallelism turns a fundamental transformer bottleneck into a practical lever: split heads instead of duplicating the sequence. What does that mean in practical terms? That you can now feasibly train on books, document collections, and long reasoning sessions without needing impossible clusters.
If your project requires extended context, it’s worth trying Ulysses and comparing it to Ring Attention on your hardware. The right choice depends on your network topology, GPU count, and model architecture.