Claude creates skills for CUDA kernels and improves local models

Jan 28, 20263 minutes

Claude can teach smaller models to solve complex problems. Interesting, right? In this article I explain how the "teacher-student" flow works using the upskill tool, and why this can save you time and money when deploying local models for specialized tasks like writing CUDA kernels.

Qué es upskill y por qué importa

upskill is a tool to generate, validate, and transfer skills (capabilities) from powerful models to cheaper or local ones. The idea is simple: use an expensive, high-quality model as the "teacher" to create a structured guide (a SKILL.md file and test cases), and then test that guide with lighter models as the "students."

Why should you care? Because many specialized tasks need domain knowledge that small models don’t have by default. With upskill you encapsulate that knowledge once and load it whenever needed. It’s like giving your assistant an expert manual they didn’t have before.

Cómo funciona el flujo técnico (resumen práctico)

The teacher (for example Claude Opus-4.5) performs the task in a session, generating a trace or record.
From the trace a SKILL.md and a set of test cases (skill_meta.json) are produced.
upskill runs automatic evaluations: it tests the student’s performance with and without the skill and measures the "skill lift" (improvement in accuracy) and token consumption.

This lets you compare models not only by accuracy, but by token efficiency and cost per task. Ideal for deciding if a local model is enough for production.

Ejemplo práctico: escribir kernels CUDA con kernel-builder

The example in the original post is clear: use Claude to create a skill that teaches how to build optimized CUDA kernels for PyTorch with HuggingFace’s kernel-builder library.

What does the skill include? Everything you need: project structure, build.toml, conventions for GPU architectures (for example H100 -> compute capability 9.0), optimizations (shared memory use, alignment), and PyTorch bindings. That’s about 500 tokens that condense hours of documentation.

Snippet of the structure upskill generates:

./skills/kernel-builder-cuda-kernels/
├── SKILL.md        # instrucciones principales (~520 tokens)
└── skill_meta.json # metadatos y casos de prueba

And a summarized example of SKILL.md:

name: kernel-builder-cuda-kernels
description: Build optimized CUDA kernels for PyTorch using HuggingFace kernel-builder.

## Project Structure
project/
├── build.toml
├── kernel_src/
│   ├── attention.cu
│   ├── layernorm.cu
│   └── geglu.cu
└── torch-ext/
    └── torch_binding.cpp

Resultados y métricas: precisión vs tokens

Numbers are convincing. The post shows concrete cases:

A local model went from 40% to 85% pass rate after applying the skill (+45%).
Another model (a sonnet-type) went from 60% to 95% (+35%).

But watch out: it’s not always a net win. For large models (e.g., Claude Opus 4.5) the skill sometimes doesn’t improve accuracy and can increase token usage. What does that mean for you? For recurring tasks you want not only accuracy, but efficiency: getting the same result while consuming fewer tokens.

So upskill gives you two dimensions: accuracy improvement and change in token usage. Both matter when choosing your target model.

Cómo lo pruebas tú mismo (comandos útiles)

Installing and running is straightforward:

pip install upskill
# o usar uvx
uvx upskill --help

Generate a skill from a trace:

upskill generate 'write nvidia kernels' --from ./trace.md

Evaluate models with a skill:

upskill eval ./skills/my-skill/ --model haiku --model sonnet

Example to evaluate locally with an OpenAI-compatible server:

# arrancar un servidor local (ejemplo con llama-server)
llama-server -hf unsloth/GLM-4.7-Flash-GGUF:Q4_K_M

# evaluar
upskill eval ./skills/kernel-builder-cuda-kernels/ \
  --model 'unsloth/GLM-4.7-Flash-GGUF:Q4_0' \
  --base-url http://localhost:8080/v1

Remember to export keys if you’ll use APIs:

export ANTHROPIC_API_KEY=sk-ant-...
export HF_TOKEN=hf_...

upskill uses Claude Opus-4.5 by default as the generator, but it supports OpenAI and compatible endpoints for local models.

Buenas prácticas y recomendaciones

Always evaluate with and without the skill. Don’t assume the skill improves everything; measure it.
Measure tokens in addition to accuracy. For repetitive pipelines, efficiency can be more valuable than a small accuracy gain.
Iterate: regenerate the skill from an improved trace or add test cases when failures occur.
Version your skills with the code: ./skills/mi-skill/ is a shareable asset across teams.

What else is this useful for? Capturing internal knowledge, building libraries of skills for internal tools, and standardizing how agents interact with your stack.

Reflexión final

The idea is powerful but practical: use an expensive model as teacher and cheaper models as executors. It’s not magic, it’s knowledge engineering packaged up. If you work with specialized tasks—from CUDA kernels to internal YAML parsers—packaging that know-how into skills can save you hours of prompting and money in production.

Fuente original

https://huggingface.co/blog/upskill

Stay up to date!

Get AI news, tool launches, and innovative products straight to your inbox. Everything clear and useful.