File size: 7,327 Bytes
fc1dea2
 
 
 
 
 
 
 
 
 
 
 
 
 
 
3c4afe4
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
f8785eb
3c4afe4
 
f8785eb
d9c96b2
f8785eb
 
 
3c4afe4
 
 
 
 
 
 
 
 
 
c4e9ed4
3c4afe4
 
3886bf3
3c4afe4
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
b26d2bb
3c4afe4
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
---
license: apache-2.0
datasets:
- internlm/Spatial-SSRL-81k
language:
- en
base_model:
- Qwen/Qwen3-VL-4B-Instruct
pipeline_tag: image-text-to-text
tags:
- multimodal
- spatial
- sptial understanding
- self-supervised learning
library_name: transformers
---


# Spatial-SSRL-Qwen3VL-4B

📖<a href="https://arxiv.org/abs/2510.27606">Paper</a>| 🏠<a href="https://github.com/InternLM/Spatial-SSRL">Github</a> |🤗<a href="https://huggingface.co/internlm/Spatial-SSRL-7B">Spatial-SSRL-7B Model</a> | 
🤗<a href="https://huggingface.co/internlm/Spatial-SSRL-Qwen3VL-4B">Spatial-SSRL-Qwen3VL-4B Model</a> |
  🤗<a href="https://huggingface.co/datasets/internlm/Spatial-SSRL-81k">Spatial-SSRL-81k Dataset</a> | 📰<a href="https://huggingface.co/papers/2510.27606">Daily Paper</a> 

Spatial-SSRL-Qwen3VL-4B is a large vision-language model targeting spatial understanding, built on the base of Qwen3-VL-4B-Instruct. It's optimized by applying Spatial-SSRL, a lightweight self-supervised reinforcement learning
paradigm which can scale RLVR efficiently. The model demonstrates strong spatial intelligence while preserving the original general visual capabilities of the base model. 

## 📢 News
- 🚀 [2025/11/24] We have released the [🤗Spatial-SSRL-Qwen3VL-4B Model](https://huggingface.co/internlm/Spatial-SSRL-Qwen3VL-4B), initialized from Qwen3-VL-4B-Instruct.
- 🚀 [2025/11/03] Now you can try out Spatial-SSRL-7B on [🤗Spatial-SSRL Space](https://huggingface.co/spaces/yuhangzang/Spatial-SSRL).
- 🚀 [2025/11/03] We have released the [🤗Spatial-SSRL-7B Model](https://huggingface.co/internlm/Spatial-SSRL-7B), and [🤗Spatial-SSRL-81k Dataset](https://huggingface.co/datasets/internlm/Spatial-SSRL-81k).
- 🚀 [2025/11/02] We have released the [🏠Spatial-SSRL Repository](https://github.com/InternLM/Spatial-SSRL).

## 🌈 Overview
We are thrilled to introduce <strong>Spatial-SSRL</strong>, a novel self-supervised RL paradigm aimed at enhancing LVLM spatial understanding. 
By optimizing Qwen2.5-VL-7B with Spatial-SSRL, the model exhibits stronger spatial intelligence across seven spatial understanding benchmarks in both image and video settings.
</p>
<p style="text-align: center;"> 
  <img src="assets/teaser_1029final.png" alt="Teaser" width="100%"> 
</p>
Spatial-SSRL is a <strong>lightweight</strong> tool-free framework that is natually compatible with the RLVR training paradigm and easy to extend to a multitude of pretext tasks.
Five tasks are currently formulated in the framework, requiring only ordinary RGB and RGB-D images. <strong>And we welcome you to join Spatial-SSRL with effective pretext tasks to further strengthen the capabilities of LVLMs!</strong>

<p style="text-align: center;"> 
  <img src="assets/pipeline_1029final.png" alt="Pipeline" width="100%"> 
</p>

## 💡 Highlights
- 🔥 **Highly Scalable:** Spatial-SSRL uses ordinary raw RGB and RGB-D images instead of richly-annotated public datasets or manual labels for data curation, making it highly scalable.
- 🔥 **Cost-effective:** Avoiding the need for human labels or API calls for general LVLMs throughout the entire pipeline endows Spatial-SSRL with cost-effectiveness.
- 🔥 **Lightweight:** Prior approaches for spatial understanding heavily rely on annotation of external tools, incurring inherent errors in training data and additional cost. In constrast, Spatial-SSRL is completely tool-free and can easily be extended to more self-supervised tasks. 
- 🔥 **Naturally Verifiable:** Intrinsic supervisory signals determined by pretext objectives are naturally verifiable, aligning Spatial-SSRL well with the RLVR paradigm.
<p style="text-align: center;"> 
  <img src="assets/comparison_v2.png" alt="Teaser" width="100%"> 
</p>

## 📊 Results
We train Qwen3-VL-4B-Instruct with our Spatial-SSRL paradigm and the average experimental results on spatial understanding and general VQA benchmarks are shown below.
<p style="text-align: center;"> 
  <img src="assets/exp_result_new3.png" alt="Pipeline" width="100%"> 
</p>

## 🛠️ Usage
Here we provide a code snippet for you to start a simple trial of <strong>Spatial-SSRL-Qwen3VL-4B</strong> on your own device. You can download the model from 🤗<a href="https://huggingface.co/internlm/Spatial-SSRL-Qwen3VL-4B">Spatial-SSRL-Qwen3VL-4B Model</a > before your trial!
</p>

```python
from transformers import AutoProcessor, AutoModelForImageTextToText #transformers==4.57.1
from qwen_vl_utils import process_vision_info #0.0.14
import torch

model_path = "internlm/Spatial-SSRL-Qwen3VL-4B" #You can change it to your own local path if deployed already

#Change the path of the input image
img_path = "assets/eg1.jpg"

#Change your question here
question = "Question: Consider the real-world 3D locations and orientations of the objects. If I stand at the man's position facing where it is facing, is the menu on the left or right of me?\nOptions:\nA. on the left\nB. on the right\n"

question += "Please select the correct answer from the options above. \n"
#We recommend using the format prompt to make the inference consistent with training
format_prompt = "You FIRST think about the reasoning process as an internal monologue and then provide the final answer. The reasoning process MUST BE enclosed within <think> </think> tags. The final answer MUST BE put in \\boxed{}."

model = AutoModelForImageTextToText.from_pretrained(
    model_path, torch_dtype=torch.float16, device_map='auto', attn_implementation='flash_attention_2'
    )
processor = AutoProcessor.from_pretrained(model_path)

messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "image",
                "image": img_path,
            },
            {"type": "text", "text": question + format_prompt},
        ],
    }
]

text = processor.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
    text=[text],
    images=image_inputs,
    videos=video_inputs,
    padding=True,
    return_tensors="pt",
)
inputs = inputs.to("cuda")

generated_ids = model.generate(**inputs, max_new_tokens=4096, do_sample=False)
generated_ids_trimmed = [
    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print("Model Response:", output_text[0])
```

## Cases
<p style="text-align: center;"> 
  <img src="assets/case-qwen3vl.jpg" alt="Teaser" width="100%"> 
</p>


## ✒️Citation
If you find our model useful, please kindly cite:
```
@article{liu2025spatial,
  title={Spatial-SSRL: Enhancing Spatial Understanding via Self-Supervised Reinforcement Learning},
  author={Liu, Yuhong and Zhang, Beichen and Zang, Yuhang and Cao, Yuhang and Xing, Long and Dong, Xiaoyi and Duan, Haodong and Lin, Dahua and Wang, Jiaqi},
  journal={arXiv preprint arXiv:2510.27606},
  year={2025}
}
```

## 📄 License
![Code License](https://img.shields.io/badge/Code%20License-Apache_2.0-green.svg) ![Data License](https://img.shields.io/badge/Data%20License-CC%20By%20NC%204.0-red.svg) 

**Usage and License Notices**: The data and code are intended and licensed for research use only.