metadata
license: apache-2.0
pipeline_tag: robotics
library_name: transformers
Mantis
This is the official checkpoint of Mantis: A Versatile Vision-Language-Action Model with Disentangled Visual Foresight
π₯ Highlights
- Disentangled Visual Foresight augments action learning without overburdening the backbone.
- Progressive Training preserves the understanding capabilities of the backbone.
- Adaptive Temporal Ensemble reduces inference cost while maintaining stable control.
How to use
This is the base Mantis model. For detailed usage please refer to our repository.
π Citation
If you find our code or models useful in your work, please cite our paper:
@article{yang2025mantis,
title={Mantis: A Versatile Vision-Language-Action Model with Disentangled Visual Foresight},
author={Yang, Yi and Li, Xueqi and Chen, Yiyang and Song, Jin and Wang, Yihan and Xiao, Zipeng and Su, Jiadi and Qiaoben, You and Liu, Pengfei and Deng, Zhijie},
journal={arXiv preprint arXiv:2511.16175},
year={2025}
}