Custom kernels are the engine behind fast deep learning: they let you run GPU operations tailored to your workload, from tensor transforms to massive matrix multiplications. What’s the catch? Compiling for the right architecture, getting flags right, wrestling compiler errors and ABI issues can turn into a nightmare. This is where Hugging Face steps in with kernel-builder and the kernels library: reproducibility with Nix, multi-backend support (CUDA, ROCm, Metal, XPU) and a tidy way to turn your GPU code into a native PyTorch operator.
What you'll learn and why it matters
This technical guide focuses on ROCm kernels. We use the GEMM kernel from RadeonFlow_Kernels as an example — the Grand Prize winner of the AMD Developer Challenge 2025 — optimized for the AMD Instinct MI300X and working in FP8 e4m3fnuz format.
Why should you care? If you work in research, accelerator development, or ML infra, being able to compile, test and publish reproducible ROCm kernels saves you hours —and bugs— when others try to use your work in PyTorch.
Technical summary of the GEMM kernel (the essentials)
- Type: FP8 block-wise GEMM (General Matrix Multiplication) for MI300X.
- Format: e4m3fnuz (FP8 with 4 exponent bits and 3 mantissa bits).
- Scaling: applies per-block scaling (
a_scale,b_scale) to keep numerical stability. - Assumes precompiled shapes and transposed layout (adjust the launcher if you need more shapes).
Native function arguments: a, b, a_scale, b_scale, c with specific shapes and dtypes (inputs in FP8, output in bf16, scales in fp32). This convention guarantees high throughput at the cost of reduced precision — useful for inference and some quantized training workloads.
Project structure expected by kernel-builder
kernel-builder expects a clear layout. A minimal example:
- build.toml - the build manifest
- gemm/ - HIP source / headers (kernel and launcher)
- flake.nix - for reproducibility with Nix
- torch-ext/ - C++ bindings that register the operator in PyTorch
Simplified example of build.toml (key sections):
[general]
name = "gemm"
universal = false
[torch]
src = ["torch-ext/torch_binding.cpp", "torch-ext/torch_binding.h"]
[kernel.gemm]
backend = "rocm"
rocm-archs = ["gfx942"]
depends = ["torch"]
src = [
"include/gpu_types.h",
"gemm/gemm_kernel.h",
"gemm/gemm_launcher.hip",
]
include = ["include"]
Important points:
backend = rocmandrocm-archspoint to specific architectures such asgfx942for MI300.torchindependsindicates the kernel integrates as a PyTorch extension.
File conventions: .h vs .hip
- Use
.hfor headers with declarations, inlines or templates that get included. - Use
.hipfor HIP implementations that need separate compilation (launchers, complex device functions).
Renaming .cpp files to .h or .hip where appropriate helps kernel-builder identify and compile correctly.
The launcher and the binding: how GPU -> PyTorch connect
The launcher (for example gemm_launcher.hip) exposes a C function run(...) that receives GPU memory pointers and shape parameters. Inside it dispatches based on shape and calls the fastest available kernel, or aborts if the shape isn't supported.
The C++ binding (torch-ext/torch_binding.cpp) validates tensors, extracts dimensions and calls run using the HIP stream. It finally registers the op in PyTorch with TORCH_LIBRARY_EXPAND so you can use torch.ops or a friendlier Python interface.
In Python, torch-ext/gemm/__init__.py typically wraps that call in a function gemm(a, b, a_scale, b_scale, out=None) which creates the output tensor in bfloat16 and delegates to ops.gemm.
Reproducibility with Nix and flake.nix
For reproducible builds kernel-builder uses flake.nix. A simple flake includes the kernel-builder repo as a dependency and generates reproducible outputs:
{ description = "Flake for GEMM kernel"; inputs = { kernel-builder.url = "github:huggingface/kernel-builder"; }; outputs = { self, kernel-builder, }: kernel-builder.lib.genFlakeOutputs { inherit self; path = ./.; }; }
Want to avoid long rebuilds? Use Hugging Face's cache with cachix:
cachix use huggingface(installs cachix and configures the cache)- Or temporarily:
nix run nixpkgs#cachix -- use huggingface
Don’t forget to run nix flake update and commit flake.nix and flake.lock so anyone can reproduce your build.
Build, test and package
Typical local development flow:
nix developto enter an environment with all dependencies.build2cmake generate-torch build.tomlto generate the CMake projects.cmake -B build-extandcmake --build build-extto compile the extension.
To create packages and builds for all supported versions:
- Run
nix build . -Lfrom the project root. The result will be inresult. - Move artifacts to
build/with:mkdir -p build && rsync -av --delete --chmod=Du+w,Fu+w result/ build/
Before publishing, clean dev artifacts with build2cmake clean build.toml.
Publish to the Hugging Face Hub
Essential steps:
hf repo create gemmto create the repo in your account or org.- Initialize git, configure LFS for binaries (
git lfs track "*.so") and push only the necessary files:
Example commit:
git add build/ gemm/ include/ src/utils tests/checker torch-ext flake.nix flake.lock build.toml
git commit -m "feat: Created a compliant gemm kernel"
git push -u origin main
Once on the Hub, other developers can load the kernel without installing it locally using the kernels library.
Load and use the kernel from Python
This is the nicest part: the kernel is registered and loaded from the Hub. Example usage:
import torch
from kernels import get_kernel
gemm = get_kernel("kernels-community/gemm")
M, N, K = 1024, 1536, 7168
QUANT_SIZE = 128
device = torch.device("cuda")
A_fp32 = torch.randn(M, K, device=device)
B_fp32 = torch.randn(K, N, device=device)
A_fp8 = A_fp32.to(torch.float8_e4m3fnuz)
B_fp8 = B_fp32.to(torch.float8_e4m3fnuz)
A_scale = torch.ones(K // QUANT_SIZE, M, device=device, dtype=torch.float32)
B_scale = torch.ones(K // QUANT_SIZE, N // QUANT_SIZE, device=device, dtype=torch.float32)
C = torch.zeros(M, N, device=device, dtype=torch.bfloat16)
result = gemm.gemm(A_fp8, B_fp8, A_scale, B_scale, C)
Done: your ROCm kernel is used as a remote library and your PyTorch code invokes it as if it were a local extension.
Best practices and technical tips
- Keep
flake.lockin the repo for exact reproducibility. - Use
cachixto speed builds and avoid every developer compiling from scratch. - Document supported shapes and layouts; the launcher usually fails if the input shape wasn’t precompiled.
- Include tests that validate accuracy and performance (for example, a
tests/checkerfolder). - If you need more shapes, modify the launcher to add dispatch or implement a slower fallback.
- Control which files you upload: avoid unnecessary dev artifacts on the Hub.
Building ROCm kernels doesn’t have to be a maze of flags and cryptic errors. With kernel-builder, Nix and the Hugging Face platform you can move from prototype to a shareable kernel the community can actually use — with clear, reproducible steps. Ready to have your kernel run on MI300X and be used by thousands?
