Ai2 said its MolmoAct model is safe, interpretable, adaptable, and truly open. | Source: Ai2, Adobe Stock
The Allen Institute for AI, also known as Ai2, yesterday announced the release of MolmoAct 7B, an embodied AI model that it said brings state-of-the-art artificial intelligence models into the physical world.
Instead of reasoning through language and converting that into movement, Ai2 said MolmoAct actually sees its surroundings; understands the relationships between space, movement, and time; and plans its movements accordingly. The model generates visual reasoning tokens that transform 2D image inputs into 3D spatial plans, enabling robots to navigate the physical world with greater intelligence and control.
“Embodied AI needs a new foundation that prioritizes reasoning, transparency, and openness,” stated Ali Farhadi, CEO of Ai2. “With MolmoAct, we’re not just releasing a model; we’re laying the groundwork for a new era of AI, bringing the intelligence of powerful AI models into the physical world. It’s a step toward AI that can reason and navigate the world in ways that are more aligned with how humans do — and collaborate with us safely and effectively.”
Ai2 is a Seattle-based nonprofit AI research institute with the mission of building AI to solve the world’s biggest problems. Founded in 2014 by late Microsoft co-founder Paul G. Allen, Ai2 said it develops foundational AI research and new applications through large-scale open models, open data, robotics, conservation platforms, and more.
Ai2 claims MolmoAct is the first ‘action reasoning model’
While spatial reasoning isn’t new, most modern systems rely on closed, end-to-end architectures trained on massive proprietary datasets. These models are difficult to reproduce, expensive to scale, and often operate as opaque black boxes, according to Ai2.
The institute claimed that MolmoAct offers a fundamentally different approach. The model is trained entirely on open data, is designed for transparency, and is built for real-world generalization. Its step-by-step visual reasoning traces enable users to preview what a robot plans to do and steer its behavior in real time as conditions change, Ai2 said.
Ai2 called MolmoAct an “action reasoning model” (ARM) to indicate that it can interpret high-level natural language instructions and reason through a sequence of physical actions to carry them out in the real world.
Traditional end-to-end robotics models treat tasks as a single, opaque step, said the institute. Instead, ARMs interpret high-level instructions and break them down into a transparent chain of spatially grounded decisions:
- 3D-aware perception: grounding the robot’s understanding of its environment using depth and spatial context
- Visual waypoint planning: outlining a step-by-step task trajectory in image space
- Action decoding: converting the plan into precise, robot-specific control commands
This layered reasoning enables MolmoAct to interpret commands like “Sort this trash pile” not as a single step, but as a structured series of sub-tasks. The model recognizes the scene, groups objects by type, grasps them one by one, and repeats.
Ai2 builds MolmoAct to scale rapidly
MolmoAct 7B, the first in its model family, was trained on a curated dataset of about 12,000 “robot episodes” from real-world environments, such as kitchens and bedrooms. Ai2 transformed these demonstrations into robot-reasoning sequences that expose how complex instructions map to grounded, goal-directed actions.
Along with the model, the company is releasing the MolmoAct post-training dataset containing ~12,000 distinct “robot episodes.” Ai2 researchers spent months curating videos of robots performing actions in diverse household settings. This includes anything from arranging pillows on a living room couch to putting away laundry in a bedroom.
Despite its strong performance, Ai2 said it trained MolmoAct efficiently. It required just 18 million samples, pretraining on 256 NVIDIA H100 graphics processing units (GPUs) for about 24 hours, and fine-tuning on 64 GPUs for only two more.
In contrast, many commercial models require hundreds of millions of samples and far more compute. Yet MolmoAct outperformed many of these systems on key benchmarks—including a 71.9% success rate on SimPLER. This demonstrated that high-quality data and thoughtful design can outperform models trained with far more data and compute, said Ai2.
Ai2 keeps MolmoAct open and transparent
Ai2 said it built MolmoAct for transparency. Users can preview the model’s planned movements before execution, with motion trajectories overlaid on camera images.
In addition, users can adjust these plans using natural language or quick sketching corrections on a touchscreen—providing fine-grained control and enhancing safety in real-world environments like homes, hospitals, and warehouses.
In accordance with Ai2’s mission, MolmoAct is fully open-source and reproducible. The institute is releasing everything needed to build, run, and extend the model: training pipelines, pre- and post-training datasets, model checkpoints, and evaluation benchmarks.
The model and model artifacts – including training checkpoints and evals – are available from Ai2’s Hugging Face repository.
Learn about the latest in AI at RoboBusiness
This year’s RoboBusiness, which will be on Oct. 15 and 16 in Santa Clara, Calif., will feature the Physical AI Forum. This track will feature talks about a range of topics, including conversations around safety and AI, simulation-to-reality reinforcement training, data curation, deploying AI-powered robots, and more.
Attendees can hear from experts from Dexterity, ABB Robotics, UC Berkeley, Roboto, GrayMatter Robotics, Diligent Robotics, and Dexman AI. In addition, the show will start with a keynote from Deepu Talla, the vice president of robotics at edge AI at NVIDIA, on how physical AI is ushering in a new era of robotics.
RoboBusiness is the premier event for developers and suppliers of commercial robots. The event is produced by WTWH Media, which also produces The Robot Report, Automated Warehouse, and the Robotics Summit & Expo.
This year’s conference will include more than 60 speakers, a startup workshop, the annual Pitchfire competition, and numerous networking opportunities. Over 100 exhibitors on the show floor will showcase their latest enabling technologies, products, and services to help solve your robotics development challenges.
Registration is now open for RoboBusiness 2025.