|
|
--- |
|
|
license: apache-2.0 |
|
|
pipeline_tag: robotics |
|
|
library_name: transformers |
|
|
--- |
|
|
|
|
|
# Mantis |
|
|
|
|
|
> This is the official checkpoint of **Mantis: A Versatile Vision-Language-Action Model |
|
|
with Disentangled Visual Foresight** |
|
|
|
|
|
- **Paper:** https://arxiv.org/pdf/2511.16175 |
|
|
- **Code:** https://github.com/zhijie-group/Mantis |
|
|
|
|
|
### π₯ Highlights |
|
|
- **Disentangled Visual Foresight** augments action learning without overburdening the backbone. |
|
|
- **Progressive Training** preserves the understanding capabilities of the backbone. |
|
|
- **Adaptive Temporal Ensemble** reduces inference cost while maintaining stable control. |
|
|
|
|
|
### How to use |
|
|
This is the base Mantis model. For detailed usage please refer to [our repository](https://github.com/zhijie-group/Mantis). |
|
|
|
|
|
### π Citation |
|
|
If you find our code or models useful in your work, please cite [our paper](https://arxiv.org/pdf/2511.16175): |
|
|
``` |
|
|
@article{yang2025mantis, |
|
|
title={Mantis: A Versatile Vision-Language-Action Model with Disentangled Visual Foresight}, |
|
|
author={Yang, Yi and Li, Xueqi and Chen, Yiyang and Song, Jin and Wang, Yihan and Xiao, Zipeng and Su, Jiadi and Qiaoben, You and Liu, Pengfei and Deng, Zhijie}, |
|
|
journal={arXiv preprint arXiv:2511.16175}, |
|
|
year={2025} |
|
|
} |
|
|
``` |