MolmoMotion: 3D motion prediction guided by language | Keryc
MolmoMotion proposes something that sounds simple but is powerful: predicting in 3D how points on an object will move from a single image, a language instruction, and a few query points. Why does that matter? Because anticipating is different from perceiving: seeing what already happened is useful, but planning and generation need to know what comes next.
Qué es MolmoMotion
MolmoMotion predicts 3D trajectories of points attached to an object in a metric world frame. The typical input is:
an RGB observation (one or a few frames),
a text description of the action (for example Move and rotate the wooden bowl with fruit on the table),
and a set of query points with their initial 3D positions.
The model returns the future trajectory of each point in 3D coordinates, ready to feed a robot planner or condition a video generator. To link vision and language they use Molmo 2 as the backbone, combining image tokens, text tokens and 2D feature tokens sampled from the visual encoder.
Key idea: represent motion as 3D points “stuck” to the object. It’s compact, class-agnostic and robust to camera changes.
Representación y por qué escogieron puntos 3D
MolmoMotion uses a deliberate representation: a small set of surface points in the world frame. They wanted three properties:
Class-agnostic: they don’t rely on templates (humans, hands or rigid objects),
View-stable: the same physical trajectory stays consistent across different views,
Directly usable: these are 3D trajectories a robot or a video generator can consume.
This representation captures rigid motion, articulated motion and, within limits, deformable motion, without assuming the object type. Because it’s compact and explicit, you avoid the cost of rendering full video when you only care about dynamics.
Arquitectura y variantes técnicas
MolmoMotion builds on Molmo 2. The general flow:
Images and text are encoded into tokens.
2D point tokens are extracted from the visual encoder.
The initial 3D coordinates of the query points are injected.
The model predicts future coordinates per point.
They train two variants:
MolmoMotion-AR (autoregressive): writes 3D coordinates as structured text, step by step in temporal order. Advantage: smooth rollouts and higher accuracy when the future is well-defined. It’s similar to coordinate prediction used by some VLMs.
MolmoMotion-FM (flow-matching): models trajectories in continuous space by transforming noise into motion. Better for uncertainty and scenarios with multiple plausible futures.
Technically, AR favors strong sequential conditioning; FM offers a continuous distribution over trajectories and captures multimodality.
Cómo crearon MolmoMotion-1M (pipeline de anotación)
Training required data that didn’t exist: large-scale videos with 3D point trajectories anchored to objects and paired with action descriptions. They built an automatic pipeline that, broadly speaking, does:
Ground the object from the description and sample query points on its surface.
Track dense 2D points over the object.
Lift those 2D tracks into a shared metric 3D frame (depth/pose estimation).
Filter: remove tracks that don’t move coherently, smooth trajectories and segment clips into windows where the object actually moves.
The result: MolmoMotion-1M, extracted from 1.16M videos, with 1.16M annotated clips (the largest collection of 3D trajectories with action descriptions, per the authors), covering 736 motion types and 5.6K distinct objects.
PointMotionBench: evaluación centrada en puntos 3D
To measure performance they created PointMotionBench, a human-validated benchmark with 2.7K clips, 111 object categories and 61 motion types. The protocol is straightforward: you get the current observation, the query points and the description; the metric evaluates how close the predicted trajectories are to the real 3D trajectory.
Key results:
MolmoMotion outperforms existing methods: pixel-space video generators, parametric 3D methods and a constant-velocity baseline.
When used to guide video generation, it improves motion quality across the five motion metrics they report, and beats a much larger image-to-video model on four out of five metrics.
Aplicaciones en robótica y generación de video
MolmoMotion isn’t just a lab toy: it transfers across environments. After fine-tuning on DROID (a large robotic manipulation dataset), the model predicts sensible trajectories for different objects, views and tasks.
One control experiment shows clear differences:
In simulation, a policy controlled with MolmoMotion reaches 76.3% success in pick-and-place vs 56.0% when the same policy uses Molmo 2.
MolmoMotion learns faster: it gets to 51% after 10K steps, where Molmo 2 reaches 19%.
On real robots (after fine-tuning), MolmoMotion reaches the same L2 error as the baseline after 12K steps in only ~2K steps.
Also, conditioning a video generator with 3D trajectories yields video that follows instructions more precisely, especially for small or precise motions. Want a cup nudged a few centimeters? MolmoMotion helps you get that right.
Limitaciones prácticas
It’s not perfect. During training they use eight query points per object: enough for useful trajectories, but not enough to densely represent surface geometry or complex deformable motions. That limits handling of fine deformations.
There are also annotation-quality challenges: noisy depth and tracking require careful filtering and smoothing, and some motion types remain hard to model with only a few points.
Qué puedes hacer ahora
Download weights and data: the team publishes the model weights, the MolmoMotion-1M dataset and PointMotionBench so you can try and compare.
Use it in robotics: if you work on planning, the 3D trajectories are direct inputs and speed up policy learning.
Condition video generation: if you want fine control over motion, using MolmoMotion as a guide improves results compared to text prompts alone.
MolmoMotion is an important piece in the physical anticipation puzzle: it brings motion prediction to a 3D, generic and usable format for real systems. The next steps? Densify surfaces, handle complex deformations and keep closing the gap between predicting and acting.