What makes MolmoMotion's approach to representing motion different from other methods?

MolmoMotion represents motion as a sparse set of 3D points attached to an object's surface. This method was chosen because it is class-agnostic (it doesn't require specific templates for humans, hands, or other categories), view-stable (the 3D trajectory remains consistent regardless of camera movement), and the output is directly usable by downstream applications like robot planners without needing to be interpreted from pixels or rendered meshes.

MolmoMotion: Language-guided 3D motion forecasting

AI2 Releases MolmoMotion for Language-Guided 3D Motion Forecasting

The Allen Institute for AI (AI2) has released MolmoMotion, a new model designed to forecast an object's future 3D motion from a single video frame and a natural language instruction. The model addresses a fundamental challenge in AI: moving from retrospective perception (what has happened) to prospective prediction (what will happen next). This capability is critical for developing more sophisticated systems in robotics, where anticipating an object's path is necessary for manipulation, and in video generation, where physical plausibility depends on realistic motion.

MolmoMotion operates by representing motion as a sparse set of 3D points on an object, a method that is both efficient and generalizable across different object types, including rigid, articulated, and deformable bodies. The system is built on the Molmo 2 vision-language model, allowing it to ground text commands like “roll a lint roller” to specific objects and then predict their trajectories. Alongside the model, AI2 is releasing a substantial set of resources to support further research:

MolmoMotion-1M: A new dataset containing 1.16 million videos with annotated, object-grounded 3D point trajectories and corresponding action descriptions.
PointMotionBench: A human-validated benchmark of 2.7K video clips designed to quantitatively measure the accuracy of 3D motion forecasting.
Two Model Variants: An autoregressive version (MolmoMotion-AR) for high-accuracy predictions and a flow-matching version (MolmoMotion-FM) better suited for handling uncertainty.

The impact of this approach was demonstrated in several downstream tasks. In simulated robotics, a control policy guided by MolmoMotion improved pick-and-place success rates from 56.0% to 76.3% compared to a baseline. When used to steer a video generation model, its trajectory predictions resulted in generated videos that followed instructions more precisely and achieved higher scores on motion quality metrics than much larger, unguided models. By open-sourcing the models, dataset, and benchmark, AI2 provides the community with a comprehensive toolkit for building systems that can reason about and anticipate physical motion.

By creating an automated pipeline to extract 3D motion data from unconstrained internet video, AI2's core contribution is not just the model itself, but a scalable solution to the data bottleneck that has historically limited the development of general-purpose motion forecasting systems.

>> Verify Original Transmission at Hugging Face