Agents generate production-ready CUDA kernels

Feb 13, 20264 minutes

They did something that sounds like science fiction but is very practical: they taught code agents to write production CUDA kernels, integrated them with transformers and diffusers, and obtained binaries ready to use with PyTorch and reproducible benchmarks. Can you imagine delegating to an agent the task of optimizing a kernel for an H100 and getting a fully buildable, measurable end-to-end project? That's exactly what they achieved with an agent skill.

What they did

They built an agent skill that packages domain knowledge about CUDA kernel development: optimization guides per architecture, vectorized templates, integration patterns for transformers and diffusers, and benchmark scripts. Then they pointed Claude and Codex at two real targets: a diffusers pipeline (LTX-Video) and a transformers model (Qwen3-8B). The agents generated working kernels, the C++ bindings for PyTorch, and scripts to measure micro-benchmarks and end-to-end performance.

Why it matters (and why it's hard)

Writing CUDA kernels is already complicated: memory accesses tuned to the architecture, vectorization, warp reductions, and the correct use of shared memory versus registers. Integrating those kernels into transformers or diffusers adds extra traps: normalization conventions, module hierarchies, and how to register ops for torch.compile.

What they did was concentrate that scattered expertise into documentation, templates, and examples that an agent can read and apply automatically. Pretty handy, right?

How to use the skill right now

You install the kernels library and add the skill to the agent with one line. Examples:

pip install kernels
kernels skills add cuda-kernels --claude
# For Codex
kernels skills add cuda-kernels --codex
# Custom destination
kernels skills add cuda-kernels --dest ./my-agent/skills/

Then you ask the agent something concrete, for example:

"Build a vectorized RMSNorm kernel for H100 targeting the Qwen3-8B model in transformers."
"Build an optimized attention kernel for H100 targeting Qwen3-8B. Benchmark it against the PyTorch baseline and validate improvements end-to-end."

The agent reads the skill, selects architecture parameters (for example CUDA capabilities for H100), generates *.cu files, creates the bindings under torch-ext/, assembles build.toml, and prepares benchmark scripts.

What the skill includes (technical summary)

Specific guides for NVIDIA H100, A100 and T4: compute capabilities, shared memory, bandwidth and block sizing.
Integration patterns for diffusers and transformers and common pitfalls.
Kernel templates with vectorized accesses for BF16, FP16 and FP32.
Benchmark workflows: micro-benchmarks and end-to-end comparisons.
Integration with HuggingFace Kernel Hub via get_kernel.

Skill structure (simplified):

.claude/skills/cuda-kernels/
├── SKILL.md
├── scripts/
└── references/

The main guidance is roughly ~550 tokens and includes scripts, optimization guides and ready examples.

Key results and benchmarks

They tested two real targets on H100 80GB HBM3 in BF16.

LTX-Video (diffusers)

RMSNorm micro-benchmarks show an average speedup of 1.88x versus PyTorch. Bandwidth efficiency: 34.7% of the H100 theoretical (3350 GB/s).

Table (summary example):

Shape	Custom (ms)	PyTorch (ms)	Speedup
[1x1024x2048]	0.039	0.064	1.64x
[4x4096x3072]	0.173	0.393	2.26x

End-to-end video generation (with a single optimized kernel type) provided ~6% improvement when combined with torch.compile.

Qwen3-8B (transformers)

Optimized RMSNorm showed speedups that scale with context length: 1.58x at 128 tokens, up to 2.47x at 8192 tokens. Average 1.94x and bandwidth efficiency 22.3% of H100.

Table (summary example):

Shape	Custom (ms)	PyTorch (ms)	Speedup
[1x128x4096]	0.040	0.062	1.58x
[1x8192x4096]	0.109	0.269	2.47x

Interpretation: for long-context inference, the custom kernel can cut RMSNorm latency by almost half. RMSNorm may be only ~5% of total compute in LTX-Video, but optimizing it reduces the relative bottleneck.

From prototype to the Hub: workflow to publish kernels

The agent already generates a project with the structure required by kernel-builder:

your_kernel/
├── build.toml
├── kernel_src/rmsnorm.cu
└── torch-ext/
    ├── torch_binding.cpp
    └── your_kernels/__init__.py

build.toml includes appropriate cuda-capabilities (for example 9.0 for H100). The publish pipeline uses Nix flakes to build against all supported PyTorch/CUDA combinations, then you upload the binaries to the Hub:

# Simplified run
nix flake update
nix run .#build-and-copy -L
huggingface-cli repo create your-org/your-kernel --type model
huggingface-cli upload your-org/your-kernel ./build

Once published, anyone can load the kernel without compiling with one line:

from kernels import get_kernel
rmsnorm = get_kernel('your-org/your-kernel')

get_kernel detects Python, PyTorch and CUDA versions and downloads the appropriate precompiled binary.

Practical recommendations if you want to try it

If you've already compiled CUDA and hate dependency matrices, this saves you hours: the agent generates the project and the Hub delivers precompiled binaries.
For critical kernels, validate with the included benchmark scripts: micro-benchmarks and end-to-end tests.
Contribute to the skill if you have architecture-specific optimizations; the skill is a starting point, not a perfect black box.

Publishing and sharing kernels lowers the friction so small teams and open-source projects can benefit from hardware-specific optimizations without maintaining multiple toolchains.

In short: it's not magic. It's packaging specialized human knowledge into a skill and letting agents like Claude and Codex apply that knowledge reproducibly. What next? More kernels, more architectures, and community contributions so GPU optimization stops being a bottleneck for innovation.

Original source

https://huggingface.co/blog/custom-cuda-kernels-agent-skills

Stay up to date!

Get AI news, tool launches, and innovative products straight to your inbox. Everything clear and useful.

Agents generate production-ready CUDA kernels

Feb 13, 20264 minutes

What they did

Why it matters (and why it's hard)

What they did was concentrate that scattered expertise into documentation, templates, and examples that an agent can read and apply automatically. Pretty handy, right?

How to use the skill right now

You install the kernels library and add the skill to the agent with one line. Examples:

pip install kernels
kernels skills add cuda-kernels --claude
# For Codex
kernels skills add cuda-kernels --codex
# Custom destination
kernels skills add cuda-kernels --dest ./my-agent/skills/

Then you ask the agent something concrete, for example:

"Build a vectorized RMSNorm kernel for H100 targeting the Qwen3-8B model in transformers."
"Build an optimized attention kernel for H100 targeting Qwen3-8B. Benchmark it against the PyTorch baseline and validate improvements end-to-end."

What the skill includes (technical summary)

Specific guides for NVIDIA H100, A100 and T4: compute capabilities, shared memory, bandwidth and block sizing.
Integration patterns for diffusers and transformers and common pitfalls.
Kernel templates with vectorized accesses for BF16, FP16 and FP32.
Benchmark workflows: micro-benchmarks and end-to-end comparisons.
Integration with HuggingFace Kernel Hub via get_kernel.

Skill structure (simplified):

.claude/skills/cuda-kernels/
├── SKILL.md
├── scripts/
└── references/

The main guidance is roughly ~550 tokens and includes scripts, optimization guides and ready examples.

Key results and benchmarks

They tested two real targets on H100 80GB HBM3 in BF16.

LTX-Video (diffusers)

RMSNorm micro-benchmarks show an average speedup of 1.88x versus PyTorch. Bandwidth efficiency: 34.7% of the H100 theoretical (3350 GB/s).

Table (summary example):

Shape	Custom (ms)	PyTorch (ms)	Speedup
[1x1024x2048]	0.039	0.064	1.64x
[4x4096x3072]	0.173	0.393	2.26x

End-to-end video generation (with a single optimized kernel type) provided ~6% improvement when combined with torch.compile.

Qwen3-8B (transformers)

Optimized RMSNorm showed speedups that scale with context length: 1.58x at 128 tokens, up to 2.47x at 8192 tokens. Average 1.94x and bandwidth efficiency 22.3% of H100.

Table (summary example):

Shape	Custom (ms)	PyTorch (ms)	Speedup
[1x128x4096]	0.040	0.062	1.58x
[1x8192x4096]	0.109	0.269	2.47x

From prototype to the Hub: workflow to publish kernels

The agent already generates a project with the structure required by kernel-builder:

your_kernel/
├── build.toml
├── kernel_src/rmsnorm.cu
└── torch-ext/
    ├── torch_binding.cpp
    └── your_kernels/__init__.py

# Simplified run
nix flake update
nix run .#build-and-copy -L
huggingface-cli repo create your-org/your-kernel --type model
huggingface-cli upload your-org/your-kernel ./build

Once published, anyone can load the kernel without compiling with one line:

from kernels import get_kernel
rmsnorm = get_kernel('your-org/your-kernel')

get_kernel detects Python, PyTorch and CUDA versions and downloads the appropriate precompiled binary.

Practical recommendations if you want to try it

If you've already compiled CUDA and hate dependency matrices, this saves you hours: the agent generates the project and the Hub delivers precompiled binaries.
For critical kernels, validate with the included benchmark scripts: micro-benchmarks and end-to-end tests.
Contribute to the skill if you have architecture-specific optimizations; the skill is a starting point, not a perfect black box.

Publishing and sharing kernels lowers the friction so small teams and open-source projects can benefit from hardware-specific optimizations without maintaining multiple toolchains.

Original source

https://huggingface.co/blog/custom-cuda-kernels-agent-skills

Stay up to date!

Get AI news, tool launches, and innovative products straight to your inbox. Everything clear and useful.