The transition that started with the 'DeepSeek Moment' in January 2025 is no longer just about weights or benchmarks: it's about how entire AI systems are designed when openness stops being an option and becomes the floor. In this second technical article of the series we'll see why architectural and hardware choices in China's open community point to different, practical directions for researchers, engineers and policymakers.
Technical summary and why it matters
Are you interested in building systems that work in the real world, not just in papers? Then pay attention: in 2025 the Chinese community prioritized operational sustainability, deployment flexibility and cost-effectiveness over squeezing an extra point on a closed benchmark.
That translated into three concrete, simultaneous trends: massive adoption of MoE (Mixture-of-Experts), proliferation of small models (0.5B–30B) as practical building blocks, and tight alignment between models and domestic inference/hardware stacks. These choices are technical, but above all strategic: they aim to make AI reproducible, trainable and deployable under real conditions.
Mixture-of-Experts (MoE): the practical choice
Why MoE instead of just ever-larger dense models? Think of MoE as a compute distribution system: it maintains a single capacity framework but dynamically activates subsets of "experts" according to the complexity of the request. That allows:
efficient usability: not every inference consumes the full set of resources;
adaptation to heterogeneous environments: it doesn't assume identical hardware across deployments;
cost-capacity balance: large MoE models act as a capacity ceiling, while most traffic can be served with fewer experts.
Technically, MoE introduces challenges: gating, load balancing between experts, routing overhead and latency variability. But in China the priority was how to solve those operational issues (scheduling, memory capacity, expert quantization) so models could be applied in production.
Also, many organizations used giant MoE models (100B–700B) as "teacher models" and distilled those capabilities into smaller, more manageable models, creating a practical pyramid: a few huge models at the top and many practical models below.
Modality and diversification: not just text
Since February 2025 the open activity stopped being single-modal. We saw Any-to-Any models, text-to-image, image-to-video, text-to-video, TTS, 3D and agents emerge in parallel. What changed? Not only were weights published, but reproducible toolchains were shared: distillation datasets, evaluation pipelines, runtimes for edge and edge-to-cloud coordination.
Relevant examples: StepFun with its high-performance multimodal models (audio, video and image) and Step-Audio-R1.1 competing with proprietary models. Tencent advanced in video and 3D with Hunyuan Video and Hunyuan 3D. That shows competition beyond the textual domain.
Small models: the operational reality
Models in the 0.5B–30B range became the practical unit. Why? Because they're easy to run locally, to fine-tune and to integrate into enterprise systems or agents. Qwen 1.5-0.5B, for example, spawned many derivatives for this very reason: a balance between capability and practicality.
This approach answers real requirements: compute-constrained environments, compliance and privacy. Big organizations still use huge models for research and distillation, but daily exploitation falls on small or mid-size models.
Domestic hardware and training: the new normal
One of the most notable changes is the entry of domestic hardware not only in inference but in key training stages. Clear signals:
Huawei Ascend and Cambricon got day-zero support for DeepSeek-V3.2-Exp, with reproducible inference pipelines alongside the weights.
Ant Group reported that its Ling models reached performance close to NVIDIA's H800 through training optimizations on domestic chips, reducing the cost to train 1 trillion tokens by about 20%.
Baidu documented training Qianfan-VL on over 5,000 Kunlun P800 accelerators and published parallelization and efficiency details.
At the start of 2026 Zhipu and China Telecom announced models trained entirely on domestic chips. That move indicates China's compute value chain is maturing: not only inference, but large-scale training too.
Inference infrastructure and deployment
Serving engineering opened up as well. Moonshot AI published Mooncake, supporting separations like prefill/decoding, and Baidu released FastDeploy 2.0 emphasizing extreme quantization and cluster-level optimization. Alibaba aligned model, framework and cloud to reduce friction from research to production.
The technical lesson: shipping weights is no longer enough. It's crucial to publish reproducible pipelines, standard quantization formats, edge runtimes and deployment examples on target hardware so others can validate real performance from day one.
Licenses and adoption: Apache-2.0 as the practical norm
After DeepSeek R1, the community moved toward permissive licenses. Apache-2.0 became the default choice because it reduces legal and technical friction for companies that want to modify, integrate and deploy models in production. Unfamiliar or very restrictive licenses add barriers and slow adoption.
Tradeoffs and technical risks
Latency and variability in MoE: high average efficiency, but greater complexity in queue management and p99 latency.
Training cost vs. inference cost: optimizing a capacity ceiling (MoE teacher) and distilling it is expensive initially but efficient at scale.
Dependence on domestic hardware: it strengthens autonomy, but limited compute availability reported by some players can slow expansion.
Delivering models, infra and reproducible documentation becomes a technical competitive advantage. It's not just who has the best ROC-AUC, but who makes everything work in real conditions.
What this means for your work as a researcher or engineer
If you work on product, prioritize models you can run and maintain: start with 0.5B–30B models and a distillation strategy from a large teacher if you need higher capabilities.
If you're a researcher, explore MoE but don't ignore systems engineering: routing, balancing, quantization and testing on target hardware are as important as architecture.
If you handle policy or procurement, value stacks that include reproducibility, permissive licenses (Apache-2.0) and day-zero support on target hardware to reduce adoption risk.
Final reflection
The story unfolding in China isn't a technical monologue about raw performance. It's a conversation between architecture, compute economics and real operations. In practice, that means open architectures, modal diversification and integration with domestic hardware are not just local tactics: they're strategies to make AI usable, sustainable and governable in the real world.