MolmoAct 2 arrives as a complete package: model, data, and tools so the community can study and improve how robots act in the physical world. Why does this matter? Writing emails or debugging code is already routine for AI, but getting a robot to load a dishwasher or prepare lab samples for hours is still one of the most practical and urgent challenges.
What MolmoAct 2 is and why it matters
MolmoAct 2 is the open evolution of the first Action Reasoning Model. It's not just a model with better numbers: it's designed to reason in 3D before acting, run much faster in real time, and come with open resources (weights, datasets, and an improved VLA pipeline) so you or your team can reproduce, research, and adapt the system.
MolmoAct 2 offers inference up to 37x faster than its previous version and ships the largest open-source bimanual dataset published to date.
If you work in robotics, lab automation, or you're just curious to see AI in the physical world, this changes the conversation: it's not a closed prototype but a foundation you can study and extend.
Key architecture and technical novelties
MolmoAct 2 isn't a minor tweak of its predecessor. It builds on Molmo 2-ER, a variant specialized in embedded reasoning that was trained on ~3 million examples of visual-spatial reasoning: pointing in images, object detection, multi-image spatial reasoning, and spatial questions over video. That backbone makes the model much better at tasks that require understanding geometry and correspondence between views.
The system combines that reasoning VLM with an "action expert" that generates actions via flow matching, connected by a KV-cache-style cache bridge to keep perception and control coherent. They also publish a 100% open action tokenizer, MolmoAct 2-FAST Tokenizer, a reimplementation of FAST trained on their data.
MolmoAct 2 also introduces an adapter architecture in the VLA pipeline to make integration and study easier, plus a mechanism called adaptive-depth reasoning that decides when to predict depth tokens so you don't waste compute.
MolmoAct 2-Think and selective depth
For tasks that need explicit 3D, MolmoAct 2-Think adds depth tokens. But instead of predicting depth for every image patch, the system focuses prediction on regions where things are moving or changing. The result: a 17% speedup versus predicting depth across all patches, while keeping 3D reasoning quality where it matters.
Data: MolmoAct 2-Bimanual YAM and dataset mixing
The team releases MolmoAct 2-Bimanual YAM, with over 720 hours of coordinated bimanual demonstrations: folding towels, scanning groceries, charging a phone, and cleaning tables, among others. It's the largest open-source bimanual dataset to date and represents 30× more robotic data than used in the original MolmoAct.
They also mixed this dataset with blends of SO-100/SO-101 (open-source arms), filtered DROID Franka, Google Robot BC-Z and Fractal data, Bridge WidowX, and MolmoAct's own domestic set. They improved language labels too, re-annotating demonstrations with an open VLM and expanding unique labels from ~71K to ~146K.
Performance: simulation, zero-shot and adaptation
On embedded reasoning benchmarks, Molmo 2-ER averages 63.8/100 across 13 tests (pointing, ego-exo correlation, spatial reasoning in video), outperforming systems like GPT-5 and Gemini 2.5 Pro on those tasks.
In household simulation (MolmoBot) MolmoAct 2 reaches 20.6% success—about double π0.5. On RoboEval, focused on continuous bimanual manipulation, it scores 0.443 versus π0.5's 0.405.
Zero-shot on a Franka arm the numbers are tangible: tasks like placing an apple on a plate reach 100% success; precise tasks like inserting a pipette hit 86.7%. On average MolmoAct 2 achieves 87.1% success compared to 48.4% for the previous MolmoBot and 45.2% for π0.5.
After post-training, on the LIBERO suite MolmoAct 2 reaches 97.2% and MolmoAct 2-Think 98.1% average success—improvements of about 10–11 points over the prior version.
An independent benchmark by Cortex AI evaluated five bimanual policies and placed MolmoAct 2 first with 0.51, beating alternatives like OpenVLA-OFT and π0.5 and winning 7 of 8 tasks in the set.
Latency and responsiveness
Speed changes how robots feel in real life: an action call takes ~180 ms on the base model and ~790 ms when adaptive deep reasoning is active. For comparison, the original version required ~6700 ms in a benchmark environment on an NVIDIA H100. That separates a robot that seems to pause between moves from one that reacts almost in real time.
Deployment and real-world pilots
To make adoption easier they publish a reference setup: two YAM arms, an Intel RealSense D435 top camera, two D405s for close views, an extendable mount, and a simple table. That helps reproduce tabletop experiments and bimanual work without starting from scratch.
MolmoAct 2 has already been tested in pilots with academic partners. In Stanford's Cong Lab, the model helps with repetitive steps in CRISPR experiments, moving samples and operating bench equipment. It doesn't replace human expertise, but it can automate routine operations and speed up time-consuming workflows.
They also ran internal robustness tests: rewritten instructions, objects moved, distractors and substitutions. Those tests show how well the model follows human intent when the scene changes.
Limitations and research opportunities
MolmoAct 2 is powerful, but not perfect. It still fails when the gripper blocks the camera, when control latency is shorter than the model's response time, or in extremely fine manipulations. 2D operator traces can introduce depth-axis errors.
These limitations are exactly why the system is open: models, data, and soon training code are tools for the community to research practical problems like occlusion, latency, and fine control.
MolmoAct 2 is meant to be studied, replicated, and improved. If you work with robots, lab automation, or physical-digital interfaces, having access to weights, datasets, and an open pipeline accelerates research and reduces friction for real deployments.
Think of it this way: we're no longer only looking at models that "work in the lab"; we have an open foundation to start closing the gap between controlled research and robots that are useful in real-world settings.