MolmoWeb: open agent that automates web tasks | Keryc
MolmoWeb is the open bet to move multimodal intelligence beyond just reading screens and into taking actions for you in the browser. Can you imagine an agent that looks at the same page you do, decides the next step and then clicks, types or scrolls without relying on private APIs? That’s what the Allen Institute announces with MolmoWeb — and they’re releasing models, data and tools so you can reproduce and improve it.
What is MolmoWeb
MolmoWeb is a visual web agent built on the Molmo 2 family in two sizes: 4B and 8B parameters. It’s designed to be deployed in self-hosted environments, either locally or in the cloud. The flow is simple: it receives a natural language instruction, a browser screenshot and the action history, then generates a short thought in natural language explaining its reasoning and the browser action to execute.
Supported actions include navigating to URLs, clicking at normalized coordinates, typing into fields, scrolling, opening or switching tabs and sending return messages to the user. By operating directly on the browser view, the agent behaves much like a person: it interprets the visual interface and responds to what it sees.
How it works technically
MolmoWeb doesn’t use structured representations like HTML to decide actions. Instead it works with screenshots, which has practical advantages: a screenshot consumes many fewer tokens than a serialized page, visual interfaces are more stable against DOM changes, and it’s easier to interpret and debug the agent’s reasoning.
The model follows a look-decide-do loop: it observes the screen, produces a thought in natural language that explains its reasoning, and emits the next action. Click coordinates are represented as normalized values and converted to pixels when the action is executed.
Important from a technical standpoint: MolmoWeb was trained without distillation from proprietary agents. The data comes from synthetic trajectories generated by agents that use accessibility trees and from human demonstrations.
MolmoWebMix: the open dataset
One of the central contributions is MolmoWebMix, an open dataset designed to train multimodal web agents. It combines:
Human demonstrations: 30,000 human trajectories captured via a Chrome extension, with 590,000 subtasks across 1,100 websites. It’s, so far, the largest public dataset of human web execution.
Synthetic trajectories: trajectories automatically generated using agents that operate over accessibility trees. Includes individual runs filtered by success, multi-agent pipelines that decompose tasks, and deterministic crawls exploring links.
GUI perception data: data to teach locating elements on screen and answering questions about screenshots. The screenshot QA portion adds up to more than 2.2 million question-answer pairs extracted from about 400 sites.
MolmoWebMix ships with collection and filtering tools and a technical report that details methodology and cleaning criteria.
Results and benchmarks
MolmoWeb was evaluated on four benchmarks that require interaction with live sites: WebVoyager, Online-Mind2Web, DeepShop and WebTailBench. A judge based on a VLM determines whether the task was completed.
Key performance highlights:
MolmoWeb 8B reaches 78.2% on WebVoyager, 42.3% on DeepShop and 49.5% on WebTailBench, outperforming leading open models like Fara-7B.
The 4B model, despite its size, beats Fara-7B in several conditions and keeps an edge even with limited step budgets.
In visual grounding, a dedicated 8B model trained with MolmoWeb data outperforms proprietary and open systems on ScreenSpot and ScreenSpot v2.
Test-time scaling: launching multiple independent rollouts and choosing the best result greatly improves reliability. For example, with pass@4 the 8B reaches 94.7% on WebVoyager and 60.5% on Online-Mind2Web, versus 78.2% and 35.3% with a single attempt.
These results are notable because they compete with agents that use richer representations or much larger models.
Limitations and safety
MolmoWeb has clear limitations you should know about:
Being purely vision-based it can fail to read text in screenshots.
Incorrect actions can derail execution — for example, scrolling before content finishes loading.
Ambiguous or highly constrained instructions reduce performance. Complex actions like drag-and-drop or scrolling inside an element remain challenging.
It wasn’t trained to handle logins or financial transactions for safety and privacy reasons.
Security measures in the hosted demo include a whitelist of sites, use of the Google Cloud Natural Language API to filter unsafe queries, field-type checks before typing and blocking actions in password or card fields. Those restrictions apply to the demo, not the model itself.
What you can do with MolmoWeb today
The whole stack is published on Hugging Face and GitHub: weights, the MolmoWebMix dataset, evaluation tools and an inference library to run locally. The training code will arrive soon, allowing you to reproduce the full pipeline.
Practical applications:
Automate repetitive browser tasks on a schedule.
Run templated queries across multiple sites for price monitoring or data collection.
Chain simple steps into longer workflows where each step depends on the browser’s real state.
Also, because it’s open-source you can fine-tune the model with your data and adapt the agent to specific use cases. For researchers, opening the pipeline means experimenting with new architectures, more data or improved safety mechanisms.
Final thoughts
MolmoWeb isn’t a perfect solution, but it’s an important step: moving from models that only describe screens to models that act on them — and doing so openly and reproducibly. Opening weights, data and tools accelerates research and makes it easier for the community to tackle hard questions about safety, ethics and norms for web use.
Are we ready for agents that browse the web on our behalf? MolmoWeb gives us an open foundation to explore that question with transparency.