MolmoMotion: predicts 3D motion guided by language | Keryc
MolmoMotion is a model that anticipates how objects will move in 3D from a single image, surface-marked points and a natural language instruction.
What is this for? Think of a robot that has to pick up a cup: it doesn’t just need to see the cup, it needs to imagine how the cup will move when grabbed. MolmoMotion predicts future trajectories of 3D points in the real world that can feed motion planning for robots or models that generate coherent moving video.
What is MolmoMotion
MolmoMotion turns an RGB observation, a list of query points on an object and an instruction (for example, "Move and rotate the wooden bowl with fruit") into future trajectories for those points in metric 3D coordinates.
The core idea is to represent motion compactly and usefully: you’re not rendering full video, you’re predicting how points anchored to the object will move in a shared world frame. That makes the prediction easy for other machines — robots, simulators, generative video models — to consume directly.
How it works (Under the hood)
MolmoMotion uses Molmo 2 as the backbone to connect language, vision and 3D points. The overall flow is:
Identify the object referred to by the instruction within the frame.
Locate the query points and their initial 3D positions.
Condition the prediction on a short video history and the language instruction.
Generate the future trajectory of each point in the world frame.
Motion representation
The representation is 3D points tied to the object. We pick this for three clear properties:
Class-agnostic: it doesn’t rely on templates for humans or object categories.
View-stable: the same physical trajectory stays consistent under camera changes.
Directly usable: trajectories plug into control systems or generative models without complex conversion.
A sparse set of points can describe rigid objects, articulated parts and, to some degree, deformable ones.
Model variants
They train two variants with different objectives:
MolmoMotion-AR (autoregressive): writes coordinates in a structured format similar to how VLMs generate coordinate text. Predicting step by step encourages smooth rollouts and gives the best accuracy when the future is fairly deterministic.
MolmoMotion-FM (flow-matching): transforms noise into continuous 3D motion, which is ideal to capture uncertainty when an instruction admits multiple plausible futures.
Data: MolmoMotion-1M and PointMotionBench
To train the model they created MolmoMotion-1M, a dataset with 1.16 million videos and 3D point tracks aligned to action descriptions. That required an automated pipeline that:
Grounds the referred object and samples query points.
Tracks 2D points and lifts them to a metric 3D frame.
Filters unstable trajectories, smooths them and trims intervals where the object actually moves.
They also release PointMotionBench, a human-validated benchmark with 2.7K clips to evaluate accuracy for object-centered 3D motion forecasting.
Results: benchmarking and downstream tasks
On PointMotionBench, MolmoMotion outperforms existing methods in 3D trajectory prediction. As a reference, MolmoMotion-AR with 3 input frames reaches a mean displacement error of 0.109 m, versus 0.129 m for the closest competitors on some splits.
Why does this matter in practice? Because better prediction improves real tasks:
Robotics: after fine-tuning on manipulation data (DROID), a control policy initialized with MolmoMotion achieves 76.3% success on pick-and-place tasks versus 56.0% when initialized with Molmo 2. It also learns much faster: 51% success at 10K steps vs 19% for the other initializer.
Video generation: using MolmoMotion trajectories to guide a generator improves motion quality and temporal coherence. In tests, combining DaS + MolmoMotion boosts metrics like temporal consistency and subject consistency compared to much larger image-to-video models.
These results indicate the physical motion knowledge generalizes across domains: from internet videos to robotic control and conditioned video generation.
Limitations and next steps
MolmoMotion uses eight query points per object during training. That’s enough to forecast useful trajectories, but not to build a dense surface geometry. In practice this limits how well it handles complex deformations.
Also, like any learned method, it depends on the coverage and quality of the data: very rare scenes or actions not represented in training can degrade predictions.
Future directions include increasing point density, better modeling of deformables and combining multimodal uncertainty with safe control.
MolmoMotion isn’t just another benchmark number. It’s a practical building block so machines stop being mere observers and start anticipating. That changes how you design robots, control simulations and add consistent motion to generated video. And the best part: Ai2 releases the weights, the dataset and the benchmark so the community can improve and use them in production.