Thousand Token Wood: multi-agent economy with a 3B model | Keryc
I built Thousand Token Wood for the Build Small Hackathon. It's a small economy: five forest creatures, each an agent run with Qwen2.5-3B, trade five goods for pebbles, gossip, hoarding and panic. Tap the wood and bubbles, bankruptcies and a widening wealth gap appear on their own.
What it is and why it matters
Thousand Token Wood is an engineering experiment aimed at people building with small models. The stack is simple: the model is served with vLLM on Modal, and the window into the wood is an app in Gradio. There's also an open traces dataset where each row stores the prompt, the raw response, the parsed actions and the agent's private thinking.
The practical idea: a living economy needs many agents thinking multiple times per step. Using a frontier model for that is expensive and slow; a 3B lets you run a merchants council at reasonable speed because each creature decides in a single batched GPU call per turn. Pretty handy, right?
Engineering lessons (what actually works)
Reliable format, uncertain reasoning. Qwen2.5-3B produced valid JSON in 100% of calls, but its economic judgment was weak. One agent that produces acorns even placed an order to buy acorns. What's the fix? Not bigger models, but a sharper signal: better prompts and structure.
Designed scarcity. In the naive version production outpaced consumption and nobody had incentives to trade. To generate exchange I introduced three simple mechanics:
Varied diet: a creature can only eat one unit of the same food per meal, so it has to buy what it doesn't grow.
Spoiling: perishables rot if they accumulate, forcing you to sell surpluses.
Winter firewood crisis: every creature needs to burn firewood each turn, demand rises and only one creature makes firewood. That last rule creates the economic drama.
Pragmatic prompt engineering. Instead of scaling the model, I told each agent what it produces and must never buy, calculated the exact list of goods in deficit and gave a worked example. Decision quality improved drastically.
Parsing tolerance. The parse-and-repair layer turns malformed JSON responses into safe no-ops instead of breaking the simulation. That keeps the experience smooth.
Wellbeing as a stable state. I originally modeled wellbeing as an accumulator: chronic deficits led to a spiral of death that wasn't fun. I rewrote it as a mean-reverting mood: it improves with food and warmth and never hits zero. The stakes should be pebbles, prices and status, not absolute starvation.
Historical shocks as playable events. You can draw a 'Wood Legend': Tulip Mania becomes Great Acorn Mania, South Sea Bubble becomes Hollow Log Company, or the 1929 run becomes the Run on Oona's Reserve. Those stories trigger real shocks in the economy and agents react without specific scripting.
A concrete example: the Oona run
In a reskinned run, the rumor that owl Oona's vault was empty caused Oona to liquidate honey. Supply flooded the market and the price of honey fell from 10 to 3 in a few rounds. It wasn't scripted: it emerged from local decisions under the scarcity mechanics and price drift.
Why did prices move now and not before? Originally agents replicated the reference price shown in the prompt. The fix was to let the reference drift with the supply-demand residual after each round: unfilled buys push price up, surplus pushes it down. With that, prices trend during scarcity and stay stable when the market is balanced.
Representative metrics (15-turn run)
Metric
Result
Valid JSON actions
100% (75 of 75 calls)
Trades per turn
sustained 3 to 9, never silent
Price of honey
crashed 10 to 3 during the run legend
Price of firewood
rose 4 to 7 when the winter shortage hit
Wealth gap (Gini)
widened 0.14 to 0.38
Outcome
the woodcutter ended up wealthier, the hoarder went bankrupt
Those movements are documented in the open traces: every decision, prompt and parse is recorded for reproduction or study.
Architecture and design patterns (practical)
Served with vLLM on Modal for manageable latency and cost.
Frontend in Gradio for rapid iteration and accessibility.
Batched inference: all agents are evaluated in a single GPU call per turn. That makes a real-time simulation viable with a small model.
Parse-and-repair: each response goes through a JSON validator. If it fails, the action becomes a no-op and is logged for analysis.
With a small alpha (for example 0.05) prices trend smoothly without being thrown off by a single large order.
Technical and practical reflections
The takeaway for builders using small models is clear: 3Bs are excellent format generators and very practical for real-time multi-agent systems, but their reasoning needs structure. In practice that means:
Design scarcity if you want interaction.
Restrict the decision space with precise prompts and examples.
Accept and handle imperfect responses with tolerant parsing.
Put consequences in prices and status, not irreversible failures.
Small models, big adventures: you don't need a huge LLM to observe convincing market behavior. Sometimes economic history already contains the drama; you just need a system with well-designed limits to play it out.