When Ai2 launched Molmo they bet on something concrete: openness. Not just open weights and code, but models you can inspect, adapt and replicate. That bet now turns into an ecosystem that not only sees, but also points and acts in the digital and physical world.
MolmoPoint: more efficient cross-modal pointing
Pointing sounds simple, right? But for a vision-and-language model (VLM) doing it well is surprisingly hard. The classic approach turns an X,Y coordinate into text — an indirect, brittle shortcut — and demands very specific training mixes and lots of data.
MolmoPoint changes the game: instead of generating coordinates as text, the model directly selects what it already sees. First it picks a coarse region, then refines down to the exact point. It’s a cross-modal solution: the same mechanism can point in images, video sequences, or even at fragments of other inputs like text or audio.
What do you get? Big gains in training efficiency and benchmark metrics: better pointing performance, improved detection of elements on screens, and stronger object tracking compared to similar open models. It works especially well at high resolution and on interfaces full of tiny buttons. MolmoPoint ships with variants for images, video and UIs, plus open datasets with thousands of annotated screenshots and human tracks.
