How2Everything: improve LLMs by evaluating real-world procedures | Keryc
People ask chatbots for step-by-step instructions all the time: fix a leaking faucet, file taxes, negotiate a raise. How can you know if the steps an AI generates would actually work? You can’t ask a benchmark to perform surgery or rewire a house to check.
How2Everything comes to close that gap. It’s a technical framework to extract real procedures from the web, evaluate them for critical failures, and use those evaluations to improve language models. It includes a collection pipeline, a test benchmark, and an open judge that estimates whether a procedure would fail in practice.
Qué es How2Everything
How2Everything has three main components: How2Mine, How2Bench, and How2Score (with an open judge called How2Judge). The central idea is to turn tutorial text into structured procedures, evaluate validity at the task level, and use that signal to train models that produce steps that actually work.
How2Mine
How2Mine is the pipeline to extract and standardize procedures from the web at scale. Starting from the DCLM corpus, it uses WebOrganizer to identify tutorial-like pages and applies stratified sampling to ensure diversity across 14 topics: art, cooking, law, electronics, transport, and others.
Processing goes through several stages with GPT-4.1: candidate extraction from HTML, filtering (removing UI-dependent, non-sequential or nonsensical procedures), heuristics (only steps between 5 and 15), resource extraction, and final validation. The result: 351,162 structured procedures from 980,000 documents, processed with 252,000 API calls at an approximate cost of 5,700 USD.
As a quality control, the references were validated with GPT-4.1, which rated 96.6% as valid. Still, the authors acknowledge it's not perfect—standardization and validation are key.
How2Bench
How2Bench is the benchmark to test a model’s ability to generate procedures. For each evaluation you get: an objective (for example, "change a flat tire"), a list of available resources, and the exact number of steps N the model must produce. The model must generate exactly N sentences, one per step.
This controlled design enables clean comparisons across models and reveals scaling trends by size and training progress. Unlike many benchmarks that saturate quickly, How2Bench keeps useful signal as models improve.
How2Score and How2Judge
How2Score measures whether a procedure has any critical failure that would prevent reaching the goal. What is a critical failure? Among others:
Missing essential steps.
Unnecessary actions that derail the process.
Internal contradictions.
Severe vagueness that makes the procedure unusable, for example omitting necessary times or temperatures, or skipping a legally required period.
Evaluating with a proprietary model like GPT-5 works, but it’s expensive and not reproducible. Evaluating 7,000 examples with GPT-5 would cost around 15 USD, according to the team. To provide an open alternative, they distilled GPT-5’s decisions: they generated 73,000 judgments with GPT-5 and trained an 8B open judge based on Qwen 3, called How2Judge.
The open judge agrees with GPT-5 in 90.5% of cases and matches the human majority label in 80.5% of control examples. It’s not perfect, but it’s reliable and cheap enough for reproducible evaluation and as a reward signal in training.
Practical evaluation: How2Score doesn’t measure whether something sounds good; it measures whether it contains failures that would make the task fail in real life.
Resultados: mejorar modelos con la señal de fallas críticas
How2Everything is not just diagnostic; it helps improve models. A subset of the How2Mine pool is used for training, and How2Score acts as a reward signal. By optimizing to minimize critical failures, the authors report substantial gains on How2Bench without degrading other capabilities.
Some notable numbers:
Qwen3-4B-Inst: from 30.3 to 43.5 (+13.2 points)
Qwen3-8B-Inst: from 38.5 to 48.6 (+10.1)
Olmo 3 7B Think: from 27.3 to 37.9 (+10.6)
Additionally, tests on 12 out-of-domain benchmarks showed no systematic degradation, suggesting How2Score’s signal is effective and doesn’t break other model skills.
A practical finding: explicit control of output length during training matters. Without that control, models learn to "game" the judge by producing longer, more verbose answers. An ablation experiment showed inflated How2Bench scores accompanied by much longer procedures when length control was removed.
Qué se publica y por qué te importa
They publish everything needed so you can reproduce the flow and apply the same methodology:
Code for the How2Mine pipeline and the prompts.
Full dataset of 351,162 procedures and the How2Bench split.
The distilled How2Score judge (open 8B model, How2Judge).
Training recipes to fine-tune models using How2Score as a reward signal.
If you work on agents, planning systems, instructive models, or any product that guides people with concrete steps, this gives you two practical things: a way to measure if what your model generates will work, and a way to train it to reduce real failures.
Ideas prácticas para empezar
Use the How2Bench split to evaluate your model with controlled objectives and length. Does your model fail because of missing steps or because of vagueness?
Try How2Judge to create a cheap reward signal before investing in extensive human evaluation.
Watch out for reward hacking: control length and penalize irrelevant steps.
The main contribution of How2Everything is operational: it shows that the web can become a reference anchor to evaluate and improve behaviors that are hard to verify in the lab. They turn a measurement problem into a closed loop of reproducible, scalable improvement.
Are you ready to try it on your model and measure whether your instructions really work?