Hugging Face and Google Cloud launch alliance for open AI | Keryc
Hugging Face and Google Cloud announce a deep alliance so companies can build their own AI with open models. What does that mean for you as a developer, founder, or technical lead? In short: more options, less friction, and better performance when using open models on Google Cloud infrastructure.
What they announced
The two companies announce several integrations and technical improvements aimed at accelerating the use of open models. The most notable items are:
A CDN Gateway that caches models and datasets directly on Google Cloud using Hugging Face's optimized storage (Xet) and Google's network.
Deeper integration with Vertex AI Model Garden, GKE AI/ML, Cloud Run and Compute Engine to deploy models in just a few steps.
Better support and easier use of TPU (Google's accelerators) with native support in Hugging Face libraries.
Improvements to Hugging Face Inference Endpoints with new instance types, better performance and price reductions.
Collaboration on security powered by Google Threat Intelligence and Mandiant to protect models, datasets and Spaces.
Use of Hugging Face on Google Cloud grew 10x in 3 years, and today that means tens of petabytes downloaded per month and billions of requests.
Key technical benefits
If you're the one deciding architecture or implementing inference in production, this brings concrete advantages:
Lower latency and time-to-first-token: the CDN Gateway reduces download times and stages models close to where you run inference.
Supply-chain robustness for models: local cache and redundancy lower failures caused by network latency or issues with the upstream repo.
Simpler deployment and governance: going from a model page on Hugging Face to Vertex Model Garden or a GKE cluster will be more direct; private organizations can use private models with flows similar to public ones.
Cost and performance: more instance types available and lower prices for Inference Endpoints improve the cost/latency tradeoff in production.
CDN Gateway and storage
The idea behind the CDN Gateway is to keep a cached copy of models and datasets in Google infra to reduce download friction. Technically this implies:
Origin: repositories on the Hugging Face Hub.
Caching: optimized storage (Xet) combined with Google buckets and networks to serve models from nearby regions.
Result: less cold start time, faster downloads and reduced outgoing traffic from the Hub.
If you've ever waited minutes for a heavyweight LLM to arrive in your pipeline, this should shorten those times noticeably.
Inference and deployment
Hugging Face Inference Endpoints is already the simplest way to go from model to a REST or gRPC service. With this alliance you'll see:
More instance options (including new GPUs and TPU-ready instances).
Better integration to deploy directly to Vertex, GKE or Cloud Run with a few clicks or commands.
Options to deploy private models securely within an enterprise.
Think of the flow: you pick a model, configure it, and in minutes you have a managed endpoint that autos-scales. That reduces operational complexity for small teams.
TPUs and performance
Google's TPUs are in their seventh generation and keep maturing in both hardware and software. Hugging Face will work so users can leverage TPUs as easily as GPUs, thanks to native support in the libraries. In practice that means:
Less porting work for models that already use transformers and accelerate.
Opportunity for better throughput and lower cost per inferred token on certain workloads.
If your workload is training or inference for LLMs, having accessible, easy-to-use TPUs can change the cost and time equation.
Security and governance
It's not just about performance. The alliance includes efforts to improve the security of the model ecosystem:
Scanning and protection powered by Google Threat Intelligence and Mandiant.
Stronger controls for models and datasets, applicable to Spaces and private repositories.
Improved traceability and auditing to meet internal policies and regulations.
This matters for regulated sectors or companies that require strict controls over models and data.
Practical use cases
Recommendation startup: uses the CDN Gateway to serve a ranking model quickly to its Cloud Run service, reducing latency in the user experience.
Hospital with a private model: packages and hosts a model in Hugging Face Enterprise and consumes it from Vertex in a private VPC, without exposing the model weights.
Media company: deploys a generation-and-moderation pipeline using Inference Endpoints with optimized instances and centralized governance rules.
Sounds like a promise? Yes—but there are already concrete signs: downloads at scale and 10x adoption in 3 years.
What it means for people building AI
If you build or lead AI projects, this alliance gives you more control and more paths to optimize cost, latency and security without giving up the flexibility of open models. The idea is that you can choose the infrastructure (Vertex, GKE, Cloud Run or VMs) and that the workflow won't be a technical roadblock.
Technically, this reduces friction in the chain from the Hub to production inference: model distribution, accelerator compatibility, automated deployment and security controls.
Do you want me to try something with you? For example, I can suggest an architecture to deploy a Hugging Face LLM on Vertex with caching on CDN Gateway and a fallback to GKE. Tell me your case and we'll build it.