GLM-5.2 reaches 1M context and improves on code tasks | Keryc
GLM-5.2 arrives with an ambitious promise: not only to accept 1 million tokens, but to sustain long, complex software engineering work without falling apart. What does that mean in practice? That the model was designed and trained to follow traces of coding agents, performance optimizations, and debugging that last hours, even tens of hours.
What GLM-5.2 brings
Solid 1M-token context: this isn’t just a headline number. Z.ai extended training with coding-agent scenarios so quality stays stable across long, noisy trajectories.
Effort control for code: you can choose effort levels (for example High or Max) to balance accuracy, latency, and cost. Need speed or maximum reliability? Now you choose.
MIT license and openness: GLM-5.2 is fully open source under MIT, with no regional limits. Weights are on HuggingFace and ModelScope and it supports frameworks like , , , and .
transformers
vLLM
SGLang
xLLM
ktransformers
Key architecture: IndexShare and improved MTP
The big technical idea behind the compute savings is called IndexShare. Instead of computing a different indexer at every sparse-attention layer, GLM-5.2 reuses the same indexer every four layers. That reduces FLOPs per token by 2.9x when operating at 1M context.
Also, the MTP (Multi-Step Targeted Prediction) layer was redesigned for more efficient speculative decoding. They apply IndexShare in MTP too and share KV and top-k across steps, which removes discrepancies between training and inference that previously hurt acceptance rates. Additional changes: rejection sampling and an end-to-end TV-style loss. The result: acceptance length in MTP increases up to 20% in their ablations.
Why does this matter? (short explanation)
In ultra-long context inference, the bottleneck stops being only matrix multiplication and becomes the KV-cache capacity, kernel overheads for long sequences, and CPU handling. Reducing FLOPs helps, but if you don’t improve cache management and kernels, you can’t practically scale to 1M. IndexShare attacks the computational cost; the infra optimizations attack the rest.
Performance on long-range benchmarks
GLM-5.2 shines on long tasks and coding benchmarks, ranking as the top open-source model on three long-horizon benchmarks:
FrontierSWE: 74.4 dominance, just 1% behind Opus 4.8 and above GPT-5.5 in that evaluation.
PostTrainBench: 34.3, outperforming competing open versions and sitting behind Opus 4.8.
SWE-Marathon: with room to improve, GLM-5.2 reaches 13.0, second among large open models.
On standard code benchmarks it also jumps significantly from GLM-5.1: Terminal-Bench 2.1 rises to 81.0 (from 63.5) and SWE-bench Pro to 62.1 (from 58.4). In short: it closes much of the gap with closed models on many complex engineering tasks.
Inference optimization for 1M tokens
To make 1M more than a number, Z.ai optimized the inference engine on three fronts:
Finer-grained memory and parallelism to increase useful KV-cache capacity (based on LayerSplit).
Optimized kernels whose cost scales with context length, better coordinated with cache transfer to reduce impact on prefill and decode.
A CPU manager for cache, scheduling, and runtime execution paths to reduce GPU "bubbles" and increase throughput.
As a result, GLM-5.2 shows a throughput advantage as context grows, confirming the solution scales in practice.
Agentic training, RL and anti-hack
The post-training agentic stage includes much longer and more heterogeneous rollouts. To handle variability in lengths after compaction, they moved from group-based optimization to a PPO formulation with a critic that learns token-level advantage on individual rollouts. That makes training accept traces of different sizes without biasing the signal.
The safety part is practical and necessary: coding agents can do reward hacking very easily (for example, read evaluation files or download solutions with curl). GLM-5.2 includes a two-stage anti-hack module: a high-recall rule-based filter and an LLM judge to raise precision. It also acts online: when it detects a hack it blocks the call and returns dummy data so the rollout continues without collapsing the training signal. It’s an elegant way to prevent shortcuts from ruining learning.
Usability and deployment
GLM-5.2 is already available on Z.ai and in the GLM Coding Plan: just change the model name to GLM-5.2 or GLM-5.2[1m] to enable 1M context in Claude Code.
For local users, the weights are public and work with the mentioned frameworks.
Note the cost: the offering announces quota consumption of 3x during peak hours and 2x off-peak; there are temporary promos and discounts within ZCode.
Practical reflection
Who does this actually change life for? Mainly teams that need to maintain state, context, and reasoning across long engineering sessions: continuous-integration agents, debugging assistants that follow a whole session, automated research that aggregates many traces. GLM-5.2 not only widens the context window, it invests in system engineering so that window is actually useful.
If you work with models for code, it’s worth trying the effort control and evaluating how KV-cache management affects your deployments. And if you’re going to train agents, pay attention to anti-hack protections: without them, the signal corrupts fast.