MM-ACT: Learn from Multimodal Parallel Generation to Act

arXiv Hugging Face Models Hugging Face Datasets


MM-ACT Arch

This repository contains MM-ACT, a unified Vision-Language-Action (VLA) model that integrates text, image, and action in shared token space and performs generation across all three modalities. MM-ACT adopts a re-mask parallel decoding strategy for text and image generation, and employs a one-step parallel decoding strategy for action generation to improve efficiency.

The model was presented in the paper MM-ACT: Learn from Multimodal Parallel Generation to Act.

Code: https://github.com/HHYHRHY/MM-ACT

Usage

For detailed usage, including training and deployment scripts, please refer to the official GitHub repository.

1. Clone Repo and Environment Setup

git clone https://github.com/HHYHRHY/MM-ACT.git
cd MM-ACT

# Create environment
conda create -n mmact python=3.13
conda activate mmact

# Install requirements
pip install -r requirement.txt

2. Dataset Preparation

  • LIBERO

    We utilize LIBERO datasets from Huggingface_LeRobot, and uses LeRobot datasets for loading robot data. Please download LIBERO-Object, LIBERO-Spatial,LIBERO-Goal and LIBERO-10. For LIBERO-10, we also provide our task planning datasets in LIBERO-10-task.

  • RoboTwin

    For RoboTwin datasets, we utilize a dataset sampling pipeline that includes task planning generation. You can download our datasets or collect your own datasets with our pipeline in Robotwin_subtask. This branch includes updates to original RoboTwin data collection pipeline to support our subtask text annotations. The collection usage is identical to the main branch. Please report any bugs or questions of text annotations in MM-ACT's issue.

3. Model Weight Preparation

Download the base model weights from MMaDA: MMaDA-8B-Base and expand the original model's action codebook (we use 2048):

python model_utils/resize_model_vocab.py --model ${origin_model_path} --out ${output_model_path} --num_new ${action_codebook_size}

Acknowledgments

This work is based on MMaDA, RoboTwin, LIBERO, LeRobot, OpenVLA. Thanks these great work.

Downloads last month

-

Downloads are not tracked for this model. How to track
Video Preview
loading

Collection including hhyhrhy/MM-ACT-Model

Paper for hhyhrhy/MM-ACT-Model