Molmo improves pointing and acts on the web

When Ai2 launched Molmo they bet on something concrete: openness. Not just open weights and code, but models you can inspect, adapt and replicate. That bet now turns into an ecosystem that not only sees, but also points and acts in the digital and physical world.

MolmoPoint: more efficient cross-modal pointing

Pointing sounds simple, right? But for a vision-and-language model (VLM) doing it well is surprisingly hard. The classic approach turns an X,Y coordinate into text — an indirect, brittle shortcut — and demands very specific training mixes and lots of data.

MolmoPoint changes the game: instead of generating coordinates as text, the model directly selects what it already sees. First it picks a coarse region, then refines down to the exact point. It’s a cross-modal solution: the same mechanism can point in images, video sequences, or even at fragments of other inputs like text or audio.

What do you get? Big gains in training efficiency and benchmark metrics: better pointing performance, improved detection of elements on screens, and stronger object tracking compared to similar open models. It works especially well at high resolution and on interfaces full of tiny buttons. MolmoPoint ships with variants for images, video and UIs, plus open datasets with thousands of annotated screenshots and human tracks.

MolmoPoint: more efficient cross-modal pointing

MolmoWeb: visual agents that interact with the web

Open building blocks: from robotics to AR

Technical and practical implications

Why it matters for you

Original source

Stay up to date!

Molmo improves pointing and acts on the web