DeepSeek-V4 arrives with a clear promise: let agents really use context windows of up to 1 million tokens without breaking in the middle of a task.
Sounds like science fiction? Not so much. The trick isn't just opening a giant window, it's lowering the cost per token so each inference pass is actually practical.
What makes the architecture different
The key question is simple: can your agent keep a long chain of actions and outcomes without running out of memory or compute time? DeepSeek-V4 tackles that problem from the ground up.
The main novelty is splitting attention into two alternating paths in the layers: Compressed Sparse Attention (CSA) and Heavily Compressed Attention (HCA). Each has a different goal and together they cut both FLOPs per token and KV cache size.
CSA compresses the sequence 4x using a softmax-driven pooling and a learned positional bias. Over that already-compressed sequence runs a “lightning indexer” (executed in FP4) that selects top-k blocks per query. There’s also a sliding-window branch for the most recent tokens.
HCA compresses massively, 128x, and then applies dense attention over that compressed stream. Because the compressed sequence is so short, dense attention becomes cheap.
Alternating CSA and HCA avoids wasting capacity. In the 61-layer V4-Pro stack, layers 0 and 1 are HCA, layers 2 to 60 alternate CSA and HCA, and the final block runs only the sliding window.
To squeeze more memory, most of the KV is stored in FP8, with BF16 reserved for RoPE dimensions. The indexer runs in FP4. Those quantization choices, together with the compression rates, produce notable reductions in the KV cache.
Practical result: V4-Pro uses about 27% of the FLOPs per token versus V3.2 and only 10% of the KV cache. V4-Flash goes even lower: 10% of FLOPs and 7% of KV cache. Compared to an established architecture like grouped query attention with 8 heads in bfloat16, V4 can require roughly 2% of cache memory.
Beyond attention, the feed-forward layers use DeepSeekMoE and traditional residual connections are replaced by manifold-constrained hyper-connections (mHC). In short: it’s not just compressed attention—the whole stack is designed for long contexts.
Post-training decisions and infrastructure for agents
Architectural efficiency is necessary but not sufficient for agents. DeepSeek-V4 includes several post-training and infra decisions aimed at tool-driven workflows.
First, reasoning trace management changes. In V3.2 traces were kept between rounds of tool calls but discarded when a new user message arrived. For long tasks that forced state reconstruction. V4 preserves reasoning content between user messages as long as there are tool calls. For normal conversations without tools, the old behavior stays and reasoning is cleared at the start of each turn.
Second, V4 introduces the special token |DSML| and a tool-call format based on XML. Why? Many tool-calls that embed JSON-as-string suffer escape errors when the model generates nested quotes. The XML scheme cleanly separates raw text parameters from structured parameters, for example string="true" for strings to pass as-is and string="false" for parameters sent as JSON. This reduces errors with numbers, booleans and nested structures.
Third, the agent behavior was fine-tuned with RL in real environments. For that DeepSeek built DSec, a Rust platform that exposes four execution substrates behind a Python SDK: function calls, containers, microVMs (Firecracker) and full VMs (QEMU). Some key DSec features:
Fast image loading with layered storage 3FS, so RL rollouts don’t wait on container startup.
Safe trajectory replay under preemptions, to continue training without repeating costly tool calls.
Uniform API across substrates, so you can reuse harnesses and switch between lightweight runs and full VM executions.
Those infra decisions are what make it possible to train agents at scale in realistic scenarios.
Performance: how well does it behave as an agent?
The numbers show V4 isn’t the absolute champion in general reasoning, but it stands out in agentic and tool-usage tasks. Key points:
Terminal Bench 2.0: V4-Pro-Max 67.9, above GLM-5.1 (63.5) and K2.6 (66.7), behind GPT-5.4-xHigh (75.1) and Gemini-3.1-Pro (68.5).
SWE Verified: 80.6 solved, nearly matching Opus-4.6-Max (80.8) and Gemini-3.1-Pro (80.6).
MCPAtlas Public: 73.6, second place after Opus-4.6-Max (73.8).
Toolathlon: 51.8, above K2.6 (50.0), GLM-5.1 (40.7) and Gemini-3.1-Pro (48.8).
In an internal R&D benchmark with 30 coding tasks (PyTorch, CUDA, Rust, C++), V4-Pro-Max reaches a 67% pass rate; Sonnet 4.5 scored 47% and Opus 4.5 70%.
An internal survey of 85 developers who used V4-Pro reported 52% consider it ready to replace their primary model and 39% leaned toward that option.
For long-context retrieval, the MRCR 8-needle metric stays above 0.82 up to 256K tokens and remains at 0.59 at 1M tokens. That matters if your agent needs to look up old facts in very long conversations.
Checkpoints, reasoning modes and usage recommendations
There are four checkpoints available on Hugging Face Hub. The instruct variants use FP4 for MoE expert weights and FP8 for everything else. Base models are FP8 across the board.
The instruct models support three reasoning modes:
Non-think: quick reply, no explicit chain-of-thought.
Think High: explicit reasoning in <think> blocks with moderate effort.
Think Max: maximum reasoning investment, requires at least 384K tokens of context.
Recommended params: temperature=1.0, top_p=1.0 for all modes.
What does this mean for your agent project?
If you’re building agents that chain many tool calls, V4 changes the rules because it lowers cost per token and preserves reasoning traces across turns when tools are involved. Practically, you can run terminal sessions with hundreds of commands, long browsing sessions or complex tool routes without the model losing state or running out of memory.
Now, 1M of context is capacity, not a performance guarantee. What matters is how much it costs to process each token at that depth: V4 reduces both FLOPs and KV memory, and that makes it usable on real hardware.
The open question is integration: will tool ecosystems adopt the |DSML| token and the XML scheme? Will these interleaved thinking gains carry over to agent frameworks outside the lab? If you work with tool pipelines, it’s worth prototyping the XML call format and testing the long window in real scenarios.