DeepSeek-V4: 1M-token context for agents

DeepSeek-V4 arrives with a clear promise: let agents really use context windows of up to 1 million tokens without breaking in the middle of a task.

Sounds like science fiction? Not so much. The trick isn't just opening a giant window, it's lowering the cost per token so each inference pass is actually practical.

What makes the architecture different

The key question is simple: can your agent keep a long chain of actions and outcomes without running out of memory or compute time? DeepSeek-V4 tackles that problem from the ground up.

The main novelty is splitting attention into two alternating paths in the layers: Compressed Sparse Attention (CSA) and Heavily Compressed Attention (HCA). Each has a different goal and together they cut both FLOPs per token and KV cache size.

CSA compresses the sequence 4x using a softmax-driven pooling and a learned positional bias. Over that already-compressed sequence runs a “lightning indexer” (executed in FP4) that selects top-k blocks per query. There’s also a sliding-window branch for the most recent tokens.

What makes the architecture different

Post-training decisions and infrastructure for agents

Performance: how well does it behave as an agent?

Checkpoints, reasoning modes and usage recommendations

What does this mean for your agent project?

Original source

Stay up to date!

DeepSeek-V4: 1M-token context for agents