Five labs, five minds: financial drama with small models

Jun 6, 2026Keryc Díaz5 minutes

Thousand Token Wood v2 puts you in a forest where you are the Patron: you lend, whisper tips, nudge prices and manipulate alliances while a magistrate tracks you. Five creatures, five voices, and most important for an engineer: each creature thinks with a different small model. What happens when you bring heterogeneous models together in a financial simulation? This is the technical and practical version of that answer.

What is Thousand Token Wood v2

v1 was a nice sandbox: five creatures in a world governed by a finely-tuned 0.5B model. You watched bubbles and crashes, but it was more voyeurism than play. v2 turns it into an operational game. You move levers: you lend at interest, surface true tips or baits, short, bribe and negotiate, and the creatures remember and respond to you.

The key innovation for us: each creature uses a different model from a different lab instead of one single net with many prompts. That makes the market interesting: agents actually disagree.

Architecture and the serving-layer lesson

The surprise for builders: the biggest friction wasn't training models, it was serving them together.

vLLM 0.22.1 JIT-compiles kernels at load and requires the CUDA tool nvcc. A minimal base image doesn't include it, so all models failed with "could not find nvcc" until we used a development CUDA image. An image fix solved everything.
gpt-oss-20b ran in MXFP4 quantization and fit on an L4 24GB with room to spare; you don't need a top-tier GPU. Attention: it responds in a format with an analytical preamble by channels, so the consumer must extract the final channel.
MiniCPM3-4B required trust_remote_code; Nemotron-Mini-4B loaded cleanly. Each model brings its own footguns, usually fixed with a one-line config.

The practical lesson: once serving is sorted (images, drivers, flags), adding another voice is a config entry, not a deep refactor.

Robustness: the JSON parser that saves the simulation

Small models have different tokenization and formatting habits; the output each sends you can be malformed.

The robust solution in v2 was a tolerant parse-and-repair layer for JSON. All outputs pass through it: if something is irreparable it gets discarded, and the simulation doesn't crash.

Build that layer once and adding models becomes trivial. Don't underestimate the value of strong sanitation in the inference flow.

Different tokenizers produce different malformations. A robust parser turns a tangle of formats into a single reliable protocol.

Security: the firewall against information leaks

The dramatic core of the game is the insider tip. You can give a creature a true tip (your edge) or a false one (bait). If the creature acts and wins too much, the magistrate investigates.

For this to be real, the truth of the tip must be hidden from the agents. How do you solve that?

The truth flag lives outside the prompt: in the player's ledger.
You build the public event without the flag; the narrator only summarizes public events.
There's an automated test that scans every full prompt, every turn, looking for forbidden tokens. That test is the guarantee: assume any secret will leak unless the test proves it's impossible.

If you're going to give secret information to an agent, treat it like a firewall problem in the data flow, not as an instruction inside the prompt.

Persistent memory and history control

The creatures maintain persistent relationships: a signed sentiment toward the Patron and toward other creatures that updates with events (you shorted me, you paid me, you allied with a rival). That affects behavior: a hostile creature rejects loans; an ally cooperates.

The big risk is prompt inflation. Raw history grows and saturates a small model. The golden rule in v2:

Never drown the prompt with the entire history. Only present a one-line bucketized summary, for example "you feel warm toward Oona, cautious with the Patron".
Keep detailed notes internally, but bound their size and don't show them directly.
Part of the behavior bias emerges (the summary guides), and part is mechanical and deterministic (e.g., reject a loan if hostility > threshold). That makes the system observable and testable.

Representative run and key metrics

Palanca	Resultado
Modelos en el council	4 labs, all under 32B, served on Modal
Fine-tuned 0.5B reliability	0% self-deal, 100% valid offers (improvement over its 3B teacher)
Truth firewall	0 leaks of the tip flag in all scanned prompts
Insider-tip advantage	A true tip prepositions positive P&L; a false one does not
Heat to investigation	Two suspicious wins crossed the magistrate's line
Ruin	Margin call and default expel a creature; they return in another chapter

A single seeded run exercised patronage, information warfare, relationships and leverage end-to-end.

Practical lessons for engineers

A small model is very good at generating reliable formats and poor at reasoning; you close the gap with structure, prompting and a small fine-tune, not just scale.
A heterogeneous council is richer than a homogeneous one and doesn't cost much once serving is solved.
Secret information is a data-flow problem: put the firewall outside the prompt and validate with automated tests.
Persistent memory, if presented only as a bounded summary, is the cheapest way to make agents feel alive.

Final reflection

Thousand Token Wood v2 is an instructive miniature: it plays with economy, information and trust, but above all it's an engineering lab for integrating disparate small models. If you're building agents, take note: fixing serving and designing secure data flows will give you more practical value than scaling the model. Want distinct voices in your simulation? Solve the infrastructure first and protect them with tests.

Original source

https://huggingface.co/blog/build-small-hackathon/thousand-token-wood-sim-v2

Stay up to date!

Get AI news, tool launches, and innovative products straight to your inbox. Everything clear and useful.