Hugging Face makes it easier to build and share ROCm kernels

5 hours ago4 minutes

Custom kernels are the engine behind fast deep learning: they let you run GPU operations tailored to your workload, from tensor transforms to massive matrix multiplications. What’s the catch? Compiling for the right architecture, getting flags right, wrestling compiler errors and ABI issues can turn into a nightmare. This is where Hugging Face steps in with kernel-builder and the kernels library: reproducibility with Nix, multi-backend support (CUDA, ROCm, Metal, XPU) and a tidy way to turn your GPU code into a native PyTorch operator.

What you'll learn and why it matters

This technical guide focuses on ROCm kernels. We use the GEMM kernel from RadeonFlow_Kernels as an example — the Grand Prize winner of the AMD Developer Challenge 2025 — optimized for the AMD Instinct MI300X and working in FP8 e4m3fnuz format.

Why should you care? If you work in research, accelerator development, or ML infra, being able to compile, test and publish reproducible ROCm kernels saves you hours —and bugs— when others try to use your work in PyTorch.

Technical summary of the GEMM kernel (the essentials)

Type: FP8 block-wise GEMM (General Matrix Multiplication) for MI300X.
Format: e4m3fnuz (FP8 with 4 exponent bits and 3 mantissa bits).
Scaling: applies per-block scaling (a_scale, b_scale) to keep numerical stability.
Assumes precompiled shapes and transposed layout (adjust the launcher if you need more shapes).

Native function arguments: a, b, a_scale, b_scale, c with specific shapes and dtypes (inputs in FP8, output in bf16, scales in fp32). This convention guarantees high throughput at the cost of reduced precision — useful for inference and some quantized training workloads.

Project structure expected by kernel-builder

kernel-builder expects a clear layout. A minimal example:

build.toml - the build manifest
gemm/ - HIP source / headers (kernel and launcher)
flake.nix - for reproducibility with Nix
torch-ext/ - C++ bindings that register the operator in PyTorch

Simplified example of build.toml (key sections):

[general]
name = "gemm"
universal = false

[torch]
src = ["torch-ext/torch_binding.cpp", "torch-ext/torch_binding.h"]

[kernel.gemm]
backend = "rocm"
rocm-archs = ["gfx942"]
depends = ["torch"]
src = [
  "include/gpu_types.h",
  "gemm/gemm_kernel.h",
  "gemm/gemm_launcher.hip",
]
include = ["include"]

Important points:

backend = rocm and rocm-archs point to specific architectures such as gfx942 for MI300.
torch in depends indicates the kernel integrates as a PyTorch extension.

File conventions: .h vs .hip

Use .h for headers with declarations, inlines or templates that get included.
Use .hip for HIP implementations that need separate compilation (launchers, complex device functions).

Renaming .cpp files to .h or .hip where appropriate helps kernel-builder identify and compile correctly.

The launcher and the binding: how GPU -> PyTorch connect

The launcher (for example gemm_launcher.hip) exposes a C function run(...) that receives GPU memory pointers and shape parameters. Inside it dispatches based on shape and calls the fastest available kernel, or aborts if the shape isn't supported.

The C++ binding (torch-ext/torch_binding.cpp) validates tensors, extracts dimensions and calls run using the HIP stream. It finally registers the op in PyTorch with TORCH_LIBRARY_EXPAND so you can use torch.ops or a friendlier Python interface.

In Python, torch-ext/gemm/__init__.py typically wraps that call in a function gemm(a, b, a_scale, b_scale, out=None) which creates the output tensor in bfloat16 and delegates to ops.gemm.

Reproducibility with Nix and flake.nix

For reproducible builds kernel-builder uses flake.nix. A simple flake includes the kernel-builder repo as a dependency and generates reproducible outputs:

{ description = "Flake for GEMM kernel"; inputs = { kernel-builder.url = "github:huggingface/kernel-builder"; }; outputs = { self, kernel-builder, }: kernel-builder.lib.genFlakeOutputs { inherit self; path = ./.; }; }

Want to avoid long rebuilds? Use Hugging Face's cache with cachix:

cachix use huggingface (installs cachix and configures the cache)
Or temporarily: nix run nixpkgs#cachix -- use huggingface

Don’t forget to run nix flake update and commit flake.nix and flake.lock so anyone can reproduce your build.

Build, test and package

Typical local development flow:

nix develop to enter an environment with all dependencies.
build2cmake generate-torch build.toml to generate the CMake projects.
cmake -B build-ext and cmake --build build-ext to compile the extension.

To create packages and builds for all supported versions:

Run nix build . -L from the project root. The result will be in result.
Move artifacts to build/ with: mkdir -p build && rsync -av --delete --chmod=Du+w,Fu+w result/ build/

Before publishing, clean dev artifacts with build2cmake clean build.toml.

Publish to the Hugging Face Hub

Essential steps:

hf repo create gemm to create the repo in your account or org.
Initialize git, configure LFS for binaries (git lfs track "*.so") and push only the necessary files:

Example commit:

git add build/ gemm/ include/ src/utils tests/checker torch-ext flake.nix flake.lock build.toml
git commit -m "feat: Created a compliant gemm kernel"
git push -u origin main

Once on the Hub, other developers can load the kernel without installing it locally using the kernels library.

Load and use the kernel from Python

This is the nicest part: the kernel is registered and loaded from the Hub. Example usage:

import torch
from kernels import get_kernel

gemm = get_kernel("kernels-community/gemm")
M, N, K = 1024, 1536, 7168
QUANT_SIZE = 128
device = torch.device("cuda")
A_fp32 = torch.randn(M, K, device=device)
B_fp32 = torch.randn(K, N, device=device)
A_fp8 = A_fp32.to(torch.float8_e4m3fnuz)
B_fp8 = B_fp32.to(torch.float8_e4m3fnuz)
A_scale = torch.ones(K // QUANT_SIZE, M, device=device, dtype=torch.float32)
B_scale = torch.ones(K // QUANT_SIZE, N // QUANT_SIZE, device=device, dtype=torch.float32)
C = torch.zeros(M, N, device=device, dtype=torch.bfloat16)
result = gemm.gemm(A_fp8, B_fp8, A_scale, B_scale, C)

Done: your ROCm kernel is used as a remote library and your PyTorch code invokes it as if it were a local extension.

Best practices and technical tips

Keep flake.lock in the repo for exact reproducibility.
Use cachix to speed builds and avoid every developer compiling from scratch.
Document supported shapes and layouts; the launcher usually fails if the input shape wasn’t precompiled.
Include tests that validate accuracy and performance (for example, a tests/checker folder).
If you need more shapes, modify the launcher to add dispatch or implement a slower fallback.
Control which files you upload: avoid unnecessary dev artifacts on the Hub.

Building ROCm kernels doesn’t have to be a maze of flags and cryptic errors. With kernel-builder, Nix and the Hugging Face platform you can move from prototype to a shareable kernel the community can actually use — with clear, reproducible steps. Ready to have your kernel run on MI300X and be used by thousands?

Original source

https://huggingface.co/blog/build-rocm-kernels

Stay up to date!

Get AI news, tool launches, and innovative products straight to your inbox. Everything clear and useful.