You asked an agent to build a site with Paris monuments in 3D from images. I didn't open an image generator. I didn't touch a 3D reconstruction tool. The agent called two Hugging Face Spaces and assembled everything: images, reconstructions as Gaussian splats, compression, a viewer and static deployment. Sounds magical? It's block-based engineering, and it's already here.
What agents.md does and why it matters
So far, the hard part wasn't really training a good image, video, TTS or 3D model. The real problem was integration: SDKs, weights, GPUs, input formats, polling. What if each model were a documented, easily-invocable block? Could an agent just glue them together like npm packages?
That's exactly what agents.md delivers in a Gradio Space: the minimal recipe for an agent to invoke that service. A curl https://huggingface.co/spaces/VAST-AI/TripoSplat/agents.md returns in one go what you need: the schema URL, call and poll templates, how to upload files and the auth hint. With that, an agent can use the Space end-to-end.
Practical example: the gallery with TripoSplat
The author put an agent to work that chained two Spaces: one to generate images and another to reconstruct 3D from a single view.
- Image generation: an image Space (for example ideogram4) produces isolated views on a black background, ready for reconstruction.
- 3D reconstruction:
VAST-AI/TripoSplattakes each image and generates a Gaussian splat in.plyformat.
From there the agent did the automatic "glue":
- It detected that outputs were Y-down and rotated them upright.
- Auto-framed each monument and cropped according to composition.
- Compressed the
.plyfiles to.ksplat(about 3x smaller) for fast browser loads. - Built a viewer with Three.js: scroll to change model, drag to rotate, cinematic transitions.
- Deployed everything as a static Space.
The only human decisions were matters of taste: 'more zoom', 'swap the obelisk for another shape', 'shorten the transition'. The rest was automatic iteration: the agent reacted when a glass pyramid splatted badly or when the reconstruction inferred the back from a single view.
Which endpoints and formats does an agent use?
The pattern is simple and repeatable. An agents.md describes things like:
GET .../gradio_api/info(schema)POST .../gradio_api/call/v2/{endpoint}with a body like{param_name: value, ...}for callsGET .../gradio_api/call/{endpoint}/{event_id}to poll for resultsPOST .../gradio_api/upload -F 'files=@file.ext'to upload files- Auth: Bearer $HF_TOKEN
You don't need a client library or hardcoded integrations. An agent reads the agents.md, inserts its HF_TOKEN, and can orchestrate flows.
Technical and product implications
This is not just a neat trick. It follows the "building block economy" logic Mitchell Hashimoto described: the most effective way to build software today is to orchestrate small, well-documented components, not reinvent polished monoliths.
In multimedia this changes several things:
- Lower technical barrier: assembling image → 3D → streaming pipelines no longer requires installing and adapting each model.
- Faster iteration: an agent can try combinations, detect failures (formats, orientations, artifacts) and fix them without constant human intervention.
- Reuse and composition: a Space's outputs become another Space's inputs with minimal friction.
Technically, this pushes architectures toward modular orchestration: agents that interpret descriptions (agents.md), trigger endpoints, handle polling and transformations (rotate, recompress, convert formats), and deploy results.
How to try it yourself
- Copy the
agents.mdfrom a Space that interests you:curl https://huggingface.co/spaces/ideogram-ai/ideogram4/agents.md. - Copy the
agents.mdfrom TripoSplat:curl https://huggingface.co/spaces/VAST-AI/TripoSplat/agents.md. - Paste the links into your preferred code agent (Claude Code, etc.), add
HF_TOKENand ask it to assemble a pipeline: image → TripoSplat → web viewer.
The Space repository contains reproducible scripts that show exactly the calls the agent made. It's a great template to experiment with.
In the end, building multimedia software stops being about setting up infra and becomes orchestration design: define the blocks, their transformations and the rules that connect them.
Next time you see an impressive demo on the web, ask yourself: was it a polished monolith or an orchestra of well-documented blocks? I bet on the latter.
