A fundamental challenge in embodied intelligence is developing expressive and compact state representations for efficient world modeling and decision making. However, existing methods often fail to achieve this balance, yielding representations that are either overly redundant or lacking in task-critical information. We propose an unsupervised approach that learns a highly compressed two-token state representation using a lightweight encoder and a pre-trained Diffusion Transformer (DiT) decoder, capitalizing on its strong generative prior. Our representation is efficient, interpretable, and integrates seamlessly into existing VLA-based models, improving performance by 14.3% on LIBERO and 30% in real-world task success with minimal inference overhead. More importantly, we find that the difference between these tokens, obtained via latent interpolation, naturally serves as a highly effective latent action, which can be further decoded into executable robot actions. This emergent capability reveals that our representation captures structured dynamics without explicit supervision. We name our method StaMo for its ability to learn generalizable robotic Motion from compact State representation, which is encoded from static images, challenging the prevalent dependence to learning latent action on complex architectures and video data. The resulting latent actions also enhance policy co-training, outperforming prior methods by 10.4% with improved interpretability. Moreover, our approach scales effectively across diverse data sources, including real-world robot data, simulation, and human egocentric video.
An overview of our StaMo framework.Our method efficiently compresses and encodes robotic visual representations, enabling the learning of a compact state representation. Motion naturally emerges as the difference between these states in the highly compressed token space. This approach facilitates efficient world modeling and demonstrates strong generalization, with the potential to scale up with more data.
Reconstruction, Motion Interpolation and Motion Transfer.
Visual reconstruction results from the same episode..The first and last frames are reconstructed from ground-truth images using the StaMo encoder, while the intermediate frames are generated by linearly interpolating between the latent state tokens of the two endpoints. The transitions show that both the robotic arm and the objects move in a continuous and smooth manner.
Transfer linear interpolation experiment with the StaMo encoder.The left and right panels illustrate different task scenarios, where reconstructions are obtained by tokens(3) + tokens(2) – tokens(1), demonstrating the linear interpolation property of latent representations during transfer.
@article{liu2025stamo,
title={StaMo: Unsupervised Learning of Generalizable Robotic Motions from Static Images},
author={Liu, Mingyu and Shu, Jiuhe and Chen, Hui and Li, Zeju and Zhao, Canyu and Yang, Jiange and Gao, Shenyuan and Chen, Hao and Shen, Chunhua},
journal={arXiv preprint arXiv:2510.05057},
year={2025}
}