MolmoSpaces: open platform for embedded AI and robotics | Keryc
MolmoSpaces is the large-scale bet to help the next generation of AI act in the physical world with real generality. Why does that matter to you? Training robots only in tidy, controlled labs makes them fragile when they hit real homes, offices, hospitals, or crowded museums.
Think about it: a robot that learned to pick up a cup on a perfect white table might fail when the same cup sits on a cluttered kitchen counter. MolmoSpaces aims to close that gap by giving researchers and practitioners diverse, realistic training grounds.
What is MolmoSpaces?
MolmoSpaces is an open ecosystem to study embodied learning at scale. It unifies over 230,000 interior scenes and 130,000 object models, together with more than 42 million 6-DoF grasp poses, plus tools to convert, validate, and evaluate everything across multiple simulators.
Base format: with conversion to for portability.
MJCF
USD
Compatibility: MuJoCo, ManiSkill and NVIDIA Isaac Lab/Sim.
Asset origins: a curated mix of Objaverse and assets from the THOR family.
MolmoSpaces is designed to be reproducible and extensible: you can inspect the MJCF, regenerate grasps, add robots, and compare results across different physics engines.
Physics and realistic simulation as a foundation
Unlike simulators that use “magic grasps” (where objects are grabbed when they enter a sphere), MolmoSpaces prioritizes real physics engines (for example MuJoCo) and validated parameters.
For rigid objects we verify mass and density by comparing to estimates annotated by LLMs and adjust densities when needed.
For articulated objects we use a teleoperation suite and a simulated robot (Franka FR3) tuned with system identification from real trajectories.
Collisions and mesh prep are annotated manually: colliders with CoACD, primitives for furniture receptacles, and convex decomposition for fine manipulable objects.
These steps reduce artifacts like interpenetrations, drifting and unrealistic grasps that make sim-to-real transfer harder.
MolmoSpaces-Bench is a benchmark designed to evaluate generalist policies under controlled variations. Instead of a single aggregated score, it proposes distributional analyses along several axes.
Task complexity: from atomic steps to hierarchical tasks.
Sensing conditions: lighting, viewpoints.
Physical dynamics: friction, mass.
Task semantics: variations in instruction wording.
Included tasks: atomic skills (pick, place, open, close), compositions and goals that integrate navigation. This lets you study, for example, how robust a grasp is to mass changes or how fragile prompts are to small semantic tweaks.
Assets and scenes at scale
The pipeline starts from 625,000 Objaverse assets and applies strict filters: complete metadata, single-object validation, scale normalization, texture quality (score >= 4), cross-renderer fidelity (CLIP similarity >= 0.6), efficient geometry (< 1.5 MB) and receptacle validation.
Result: ~129,000 curated assets (about 3,000 synsets), split into train/val/test. From THOR they extracted and converted 1,600+ rigid objects into 134 categories and added many articulated objects (doors, fridges, microwaves, etc.) with explicit joint type, axis and range annotations.
Scenes come from iTHOR-120, ProcTHOR-10K, ProcTHOR-Objaverse, Holodeck and combine:
Hand-made environments and manually reproduced digital twins.
Heuristic procedural generation and LLM-assisted generation for diversity.
Automatic validations include motion tests (rigid objects must move more than 2 cm under small forces), articulation tests (minimum 60% of range), and collision/drift detection. Over 95% of scenes pass these checks. Occupancy maps are also generated to place collision-free starts.
Grasps: 42 million poses and robust verification
MolmoSpaces includes more than 42M 6-DoF grasp poses over 48,000+ objects (up to ~1,000 poses per object). The key methodology:
Sampling directly from MJCF geometry using the Robotiq-2F85 gripper model.
For articulated objects sampling focuses on leaf components (handles) and discards grasps that collide with non-handle geometry.
Diverse selection: clustering in 6-DoF space and uniform selection across clusters, with contact-point preferences based on object type.
Robustness tests: linear and rotational perturbations for rigid items; for articulated parts we require actuability over at least 70% of joint range while maintaining contact.
Final verification with a floating gripper that attempts to lift and operate the object.
Poses can be injected into scenes via a grasp loader and the platform includes a trajectory-generation pipeline to create reproducible demonstrations and imitation datasets.
Tools, compatibility and data collection
Everything is modular and open: MJCF, grasps, physics parameters, materials and metadata. Included tools:
Loaders and utilities to load assets across simulators.
Conversion script to USD for Isaac Lab/Sim compatibility.
ManiSkill loader support.
Teleoperation interface to collect demonstrations with mobile platforms like Teledex (even from your phone).
The infrastructure supports different embodiments (single-arm, dual-arm) and controllers, making comparisons between setups easier.
What does this mean for research and products?
MolmoSpaces gives the community what many have been asking for: data, scenes and tools to vary factors one at a time at scale. What do you get from that? More reproducible research, finer failure diagnostics for out-of-distribution cases, and better sim-to-real transfer studies.
For entrepreneurs and applied teams, the platform lowers prototyping costs and lets you test algorithms in situations closer to real-world complexity. For researchers, it enables controlled experiments across physical, sensory and semantic dimensions.
At the end of the day, controlling training diversity and measuring generalization systematically gets us closer to robots that don’t just learn tasks, but learn to adapt.
MolmoSpaces is already available: assets, scenes, grasps, tools and pipelines to start experimenting.