NVIDIA NeMo AutoModel accelerates fine-tuning of Transformers

NVIDIA introduces NeMo AutoModel, an open layer on top of Transformers v5 designed so you can train large-scale MoE generative models without rewriting your code. What makes this different? Basically it turns a single import into real optimizations: Expert Parallelism, DeepEP fused dispatch, and TransformerEngine kernels — and all using the same from_pretrained() API you already know.

Qué es NeMo AutoModel y por qué importa

NeMo AutoModel is a library inside the NeMo ecosystem that inherits Transformers v5 compatibility and adds high-performance engineering specifically for Mixture-of-Experts models. If you work with MoE, you know the challenge isn't just more parameters: it's how to move tokens between hundreds of experts, how to avoid communication bottlenecks, and how to fit everything into GPU memory.

NeMo AutoModel tackles those points with three key pieces:

Expert Parallelism (EP): standardizes a parallelism dimension for expert shards so each GPU stores only a fraction of the experts' weights.

Configuration	TPS/GPU (avg)	Peak memory per GPU	Forward+Loss	Backward
v4 (hub)	1,807	61.9 GiB	1024 ms	1246 ms
v5 (optimized)	4,583	62.1 GiB	283 ms	611 ms
NeMo AutoModel (EP=8)	15,421	42.5 GiB	109 ms	157 ms

Qué es NeMo AutoModel y por qué importa

Cómo funciona junto a Transformers v5 (técnico)

Expert Parallelism vs carve-out de v5

Benchmarks clave (resumen técnico)

Qué significa esto para ti que entrenas modelos MoE

Cómo probarlo rápido

Reflexión final

Fuente original

Stay up to date!

NVIDIA NeMo AutoModel accelerates fine-tuning of Transformers