MolmoPoint improves pointing in vision-language models | Keryc
MolmoPoint proposes changing the way vision-language models indicate places in an image or video. What if, instead of forcing the model to write coordinates as text, you let it point directly on its own visual representation? That's exactly what MolmoPoint does, and it brings clear gains in accuracy, efficiency, and robustness.
What is MolmoPoint and why it changes pointing
Grounding or pointing is key: without it, a model only describes images; with it, it can say exactly where something is. Think of a robot that needs to pick up a mug, an assistant that must point to a button in an app, or a system that counts objects in a video. MolmoPoint doesn't ask for coordinates as text. Instead, it lets the model select parts of its own visual features.
The core idea is simple but powerful: use grounding tokens that act like queries over the model's internal visual representations. That avoids teaching the model an artificial coordinate system, reduces the number of output tokens, and improves stability when resolution changes.
Architecture and technical details
MolmoPoint introduces a coarse-to-fine mechanism built around three special tokens: <PATCH>, <SUBPATCH> and <LOCATION>.
First, the model attends over visual tokens to choose a coarse patch (<PATCH>).
Then it refines that selection to a finer subpatch using lower-level features (<SUBPATCH>).
Finally, it predicts a location within the subpatch with <LOCATION>.
That flow ties the pointing output directly to internal visual embeddings, instead of translating everything into external textual coordinates.
Rotary embeddings and explicit stop
MolmoPoint uses rotary embeddings to encode the distance between candidate patches and the previously selected patch. Why? That helps keep a consistent order and prevents the model from double-pointing at the same spot.
It also incorporates a "no-more-points" class that lets the model signal there are no more relevant points, rather than being forced to pick another patch.
Technical advantages
Fewer output tokens per point: drops from 8 tokens to 3 tokens per point.
More robust to resolution changes, because pointing is done over the same visual embeddings used for recognition.
Faster and easier learning: in small setups, MolmoPoint beats the baseline with only 8,192 training examples.
Models and data
AllenAI releases three main models and two key data resources:
MolmoPoint-8B: general for images and video.
MolmoPoint-GUI-8B: specialized in software interfaces (screens, apps, web).
MolmoPoint-Vid-4B: optimized for video.
New data:
MolmoPoint-GUISyn: a synthetic dataset of ~36,000 high-resolution screenshots with over 2 million annotated points (average 54 points per image). It was created by asking an LLM to produce HTML that simulates real software, rendering with Playwright, and extracting bounding boxes per element.
MolmoPoint-TrackData: an extension of Molmo2-VideoPoint with manually annotated tracks and synthetic tracks including occlusions and complex motion dynamics.
All code, models and data are released open source.
Evaluation and results (technical summary)
MolmoPoint is evaluated against benchmarks for images, GUIs and video. Notable results:
PointBench (pointing and spatial reasoning skills): MolmoPoint-8B reaches 70.7% average accuracy, versus 68.7% for Molmo 2 (8B).
PixMo-Points: 89.2 F1 for MolmoPoint-8B, versus 85.2 for Molmo 2 (8B).
GUI grounding: MolmoPoint-GUI-8B achieves 61.1 on ScreenSpot-Pro and 70.0 on OSWorldG, leading among fully open models.
Video counting/pointing: MolmoPoint-8B wins on counting metrics and beats human preference evaluations 59.1% of the time (excluding ties). MolmoPoint-Vid-4B achieves 58.7 close-accuracy on Molmo2-VideoCount.
Tracking: MolmoPoint-8B reaches state-of-the-art on MeViS and improves +5.7 J&F on Molmo2-Track compared to Molmo 2 (8B).
Ablation studies show the grounding tokens account for most of the improvement, while the new tracking data expands robustness to more object types and scenes.
Why this matters for real applications
What does this mean outside the lab? Better interactions and less engineering work to integrate multimodal models:
Robots: point to precise parts of an object so the robot can grab it safely.
Agents that automate software: identify and press the exact element on an interface without failing due to resolution differences.
Video analytics: track and count objects more reliably, even with occlusions and complex motion.
Multimodal conversational interfaces: the model can show exactly what it means, without ambiguity.
Also, the idea isn't exclusive to vision. In principle, the same "grounding tokens" scheme could apply to audio or text tokens to point within those domains.
MolmoPoint suggests that treating pointing as selection of internal embeddings is a more natural and practical abstraction than converting everything to textual coordinates. That simplifies learning, reduces output token costs, and improves accuracy.
Final reflection
MolmoPoint doesn't just show better numbers on benchmarks; it changes the convention on how models should connect language and vision. Here's the lesson: sometimes letting the model use its own internal networks to point is more effective than forcing it to speak in an artificial external language. For developers and teams building applications that need precise grounding, MolmoPoint offers an open, simpler, and higher-performing alternative.