Google Cloud, Intel, and Hugging Face published a benchmark that might change how you think about serving large open models. The promise? Better performance and lower cost by using Google Cloud's new C4
instances with Intel Xeon 6 (Granite Rapids) processors to run GPT OSS, the open MoE variant of OpenAI. (huggingface.co)
What Intel and Hugging Face published
The article documents controlled tests comparing C4
VMs (Intel Xeon 6 GNR) against the previous C3
(4th gen Xeon SPR), using the unsloth/gpt-oss-120b-BF16
model for text generation with bfloat16
precision. The goal was to measure performance per token (decoding latency) and throughput normalized by vCPU across different batch sizes. (huggingface.co)
Let me make it simple: GPT OSS is a Mixture of Experts (MoE) model that activates only some “experts” per token, which makes it much more CPU-efficient if the framework doesn’t duplicate work. Intel and Hugging Face added optimizations so each expert only processes the tokens assigned to it. (huggingface.co)
Key results and numbers
- TCO (Total Cost of Ownership) improvement up to 1.7x in favor of
C4
overC3
. (huggingface.co) - Throughput per vCPU between 1.4x and 1.7x depending on batch size. (huggingface.co)
- At batch 64,
C4
achieves 1.7x throughput per vCPU and, with near-parity in price per vCPU, that translates to a 1.7x advantage in TCO. (huggingface.co) - Reproducible tests: 1024 token input, 1024 token output, batches from 1 to 64, use of
static KV cache
, and SDPA attention backend. (huggingface.co)
Why should this matter to you now?
Do you have a product serving generation at scale, or are you evaluating infra for open LLMs? This suggests that:
- Modern CPUs can be a viable production option for MoE, especially if the software avoids redundant computation. (huggingface.co)
- For startups and SMBs, relying less exclusively on GPU accelerators can mean cheaper, simpler deployment paths. Can you imagine cutting monthly bills without rewriting your model? This is the hint.
It's also useful if your team prefers public cloud infra with large, homogeneous instances, since the gains come from moving to a newer VM generation on Google Cloud. (huggingface.co)
How to reproduce the benchmark quickly
If you want to try it yourself, the blog includes clear steps. In short:
- Create a
c4-standard-144
orc3-standard-176
VM depending on the comparison. (huggingface.co) - Clone the repo and use the included Docker recipe:
git clone https://github.com/huggingface/transformers.git
cd transformers/
git checkout 26b65fb5168f324277b85c558ef8209bfceae1fe
cd docker/transformers-intel-cpu/
sudo docker build . -t <your_image>
Inside the container they install the specified versions of transformers
and torch
for CPU and run the published benchmark script. All commands and benchmark steps are documented in the blog. (huggingface.co)
Limitations and open questions
Not everything is plug-and-play. Keep in mind:
- The improvements are shown for a specific case: GPT OSS MoE with deterministic attention and cache settings. Other models or configurations may behave differently. (huggingface.co)
- The TCO argument assumes price parity per vCPU between generations. If billing in your account or region differs, actual savings may vary. (huggingface.co)
- This is a steady-state decoding and throughput-per-token focused test. Peak latency, cold-starts, and mixed workloads may need extra testing. (huggingface.co)
Final thought
This post is a practical reminder: not every AI gain comes from newer models or more GPUs. Sometimes the easiest win is changing instance type, upgrading CPUs, and tuning software to avoid wasted operations. Is it worth trying C4
if you're already on C3
? If you serve MoE models at scale, yes — it probably is.
More details and the step-by-step guide are on the Hugging Face blog. (huggingface.co)