Ulysses SP: training models with millions of tokens

Training models with full-book contexts is no longer a lab curiosity: it’s a practical need for tasks like document analysis, extended reasoning, code review, and RAG systems. Why the fuss? Attention scales quadratically, and a sequence of hundreds of thousands of tokens won’t fit on a single GPU.

Qué problema resuelve Ulysses

Attention in transformers requires scoring pairs of tokens, which makes memory and FLOPs grow with the square of sequence length. FlashAttention and similar optimizations ease memory by avoiding materializing the whole matrix, but the compute remains. So what do you do when a novel is ~250k tokens and you need to train with several documents at once?

The traditional data-parallel solution doesn’t help: each GPU would still need to see the full sequence inside the attention block. Ulysses Sequence Parallelism (part of Snowflake AI Research’s ALST protocol) offers an elegant way to split attention across GPUs by parallelizing across attention heads.

Qué problema resuelve Ulysses

Cómo funciona Ulysses Sequence Parallelism (SP)

Integración con el ecosistema Hugging Face

Comparación práctica: Ulysses vs Ring Attention

Buenas prácticas y recomendaciones

Benchmarks esenciales y verificación de pérdida

Requisitos y versiones recomendadas

Reflexión final

Original source

Stay up to date!

Ulysses SP: training models with millions of tokens