DIFF Transformer V2: differential attention for LLMs

DIFF Transformer V2 arrives as a more practical and stable version of the differential idea in attention. What changes compared to DIFF V1 and why does it matter if you are training or deploying a large language model? Here I explain it with technical clarity but without losing readability.

Qué es DIFF V2 y por qué lo hicieron

DIFF V2 implements the differential operation directly inside attention: it duplicates the query heads to 2h, keeps the key-value (KV) heads at h_kv, and then subtracts head pairs (head 0 minus head 1, head 2 minus head 3, and so on). The subtraction is scaled by a projected per-token-per-head factor lambda, and then reduced back to the original dimension before W_O, so W_O stays identical to the base Transformer.

Why this design? Because it gives you the expressive power of differential attention without paying the cache cost of duplicated values or needing custom attention kernels. In plain terms: you keep decoding speed comparable to a standard Transformer and make the trick usable in real LLMs.

Qué es DIFF V2 y por qué lo hicieron

Diseño técnico y piezas clave

Código conceptual

Context RMS y estabilidad numérica

Resultados empíricos y comportamiento en entrenamiento

Costos, parámetros y comparación teórica

Ablaciones importantes y errores comunes

Compatibilidad con sparse attention y consideraciones prácticas

Recomendaciones para equipo de ML y ML infra

Fuente original

Stay up to date!

DIFF Transformer V2: differential attention for LLMs