InterRVOS: Interaction-aware Referring Video Object Segmentation
Paper
โข
2506.02356
โข
Published
[๐ GitHub] [๐ Paper] [๐ Quick Start]
This repository contains the model ReVIOSa-4B for the paper InterRVOS: Interaction-aware Referring Video Object Segmentation.
We introduce Interaction-aware Referring Video Object Segmentation (InterRVOS), a novel task that focuses on the modeling of interactions. It requires the model to segment the actor and target objects separately, reflecting their asymmetric roles in an interaction. Please refer to the project page for detailed visualization results.
We provide an example code to run ReVIOSa using transformers.
import torch
from transformers import AutoTokenizer, AutoModel
from PIL import Image
import numpy as np
import os
# load the model and tokenizer
path = "wooj0216/ReVIOSa-4B"
model = AutoModel.from_pretrained(
path,
torch_dtype=torch.bfloat16,
low_cpu_mem_usage=True,
use_flash_attn=True,
trust_remote_code=True).eval().cuda()
tokenizer = AutoTokenizer.from_pretrained(path, trust_remote_code=True, use_fast=False)
video_folder = "/PATH/TO/VIDEO_FOLDER"
images_paths = os.listdir(video_folder)
images_paths = [os.path.join(video_folder, image_path) for image_name in images_paths]
text_prompts = "<image>Please segment the child reaching out to man."
input_dict = {
'video': images_paths,
'text': text_prompts,
'past_text': '',
'mask_prompts': None,
'tokenizer': tokenizer,
}
return_dict = model.predict_forward(**input_dict)
answer = return_dict["prediction"]
masks = return_dict['prediction_masks']
This project is based on Sa2VA. Many thanks to the authors for their great works!
If you find this repository useful, please consider referring to the following paper:
@misc{jin2025interrvosinteractionawarereferringvideo,
title={InterRVOS: Interaction-aware Referring Video Object Segmentation},
author={Woojeong Jin and Seongchan Kim and Jaeho Lee and Seungryong Kim},
year={2025},
eprint={2506.02356},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2506.02356},
}