Local models triage PRs from the OpenClaw repo for free | Keryc
June 2026 will be remembered as the moment it became clear that closed models can disappear. With the removal of Claude Fable 5 still fresh in memory, it’s even more obvious why it matters to own your AI stack and know how to run models locally if your business depends on them.
What they did and why
The team behind OpenClaw built a system to triage issues and PRs using local models (for example gemma-4-26b-a4b and qwen3.6-35b-a3b) inside an agent harness called pi. The goal isn’t to replace a traditional classifier like BERT, but to use agents that can ask for context, inspect the repo and return structured labels.
Why use agents instead of a plain classifier? Because running this in the cloud with a paid account brings limits and latency. On local hardware (a GB10 with 128 GB of unified memory in their case) you can get near-real-time notifications and very low costs (just electricity and maintenance).
Arquitectura y flujo técnico
The design is simple and pragmatic:
openclaw/gitcrawl keeps a local mirror of the repo and normalizes every event (issue/PR) into uniform objects.
localpager stores those objects in SQLite and creates classification jobs.
A worker claims jobs, builds a GitHub context (title, body, labels, selected diffs) and sends it to localpager-agent.
localpager-agent runs the agentic logic: it can think, use a restricted shell (reposhell) for read-only operations and finally call final_json to return the classification in a defined schema.
The output is saved to SQLite and a deterministic component decides whether to notify on Discord according to configured policies.
Important: only the classification uses an LLM. Sending notifications is deterministic to reduce latency and errors.
Security: reposhell and limits
You don’t give the model a full bash. Instead you use reposhell, a limited shell that accepts only read commands (e.g. ls, cat, rg, git show --name-only). That way you avoid a malicious PR with prompt injection making the model execute arbitrary commands.
They tested several local models: gemma-4-26b-a4b, qwen3.6-35b-a3b and used DeepSeek-V4-Flash as a reference. Key optimizations for Gemma on GB10:
vLLM with NVFP4 quantization (hardware-friendly)
prefix caching and FP8 KV cache
CUTLASS MoE backend and language-model-only mode
On an evaluation of 330 rows (each item labeled by consensus with GPT-5.5 and Opus) they found clear differences:
Metric
gemma-4-26b-a4b
qwen3.6-35b-a3b
DeepSeek-V4-Flash
Precision
0.716 ± 0.010
0.831 ± 0.007
0.938
Recall
0.905 ± 0.004
0.818 ± 0.006
0.714
F1
0.800 ± 0.008
0.824 ± 0.002
0.811
Exact match
0.410 ± 0.014
0.540 ± 0.014
0.509
False positives
227.0 ± 10.5
105.7 ± 6.4
30
False negatives
60.0 ± 2.6
115.3 ± 4.0
181
Wall seconds / row
1.41 ± 0.04
13.51 ± 0.79
144.14
Output tok/s / worker
25
50
13
Concurrency
16
4
1
Total parameters
26B
35B
284B
Quick interpretation: Gemma is faster and catches more cases (high recall) but generates more false positives. Qwen is more conservative (higher precision and exact match) at the cost of more time per row. DeepSeek is very precise but too large and slow for the GB10.
Metodología y auditoría
To have a solid reference, they labeled 330 items and used an adjudication process: each item was labeled multiple times by SOTA models (GPT-5.5 and Opus) and only labels with agreement were accepted. They also run an audit loop: every 2 hours they run a job with GPT-5.5 (cloud) that acts as a judge to find false positives and negatives. This helps calibrate the system before trusting the local model 100%.
Audit cost: in their setup the batched GPT-5.5 review consumes about 40k tokens every 2 hours, roughly 2–3 cents per run, or about $9/month at 12 runs/day. It’s a calibration cost, not a mandatory part of the final design.
Casos de uso y límite de aplicabilidad
The pattern — which we might call agentic classification — isn’t exclusive to open-source repos. Natural applications:
Classify news in newsrooms
Filter relevant posts in networks and forums
Triage support tickets
Review moderation appeals
Filter leads or arXiv investigations
Practical recommendation: for quick prototypes and closed domains, medium local models that one-shot classify well without fine-tuning are an excellent first choice. For production at scale, think of a mix: local for real-time filtering + periodic audit with a SOTA cloud model.
Lecciones y recomendaciones técnicas
Having hardware with unified memory (GB10/Blackwell, 128 GB) makes batching and high concurrency easier.
NVFP4 and vLLM offer large throughput gains versus GGUF Q4_K_M on that topology.
Protect the environment: use limited shells, final_json tools and mandatory schemas to avoid unexpected outputs.
Reserve GPU for tasks that truly require inference; use deterministic rules where you can.
Balance precision vs recall according to your notifier: if false positives are costly (noise in Discord), choose models with higher precision or add post-processing rules.
The core idea is simple: you can build a real-time triage system with very low marginal cost and competitive results if you combine local agents, security measures and an audit loop with SOTA models.