GRiP-7B: Guiding the Inner Eye
Overview
This repository contains the official model checkpoints of GRiP (Guided Reasoning and Perception), a novel visual grounded reasoning model developed by Basic Algorithm Center, Platform and Content Group, Tencent.
Models capable of "thinking with images" represent a major leap in multimodal AI. GRiP is designed to cultivate robust and flexible visual grounded reasoning by explicitly guiding the model's perceptual focus and logical pathways. Initialized from Qwen2.5-VL-7B-Instruct, GRiP employs a two-stage training framework:
- Bootstrapping: Structured instruction tuning to teach the syntax of grounded reasoning.
- Policy Refinement: A cognitive-enhanced Reinforcement Learning (RL) stage featuring novel reward mechanisms.
GRiP achieves state-of-the-art results among open-source models on challenging benchmarks like TreeBench, V* Bench, and HR-Bench, demonstrating superior capability in complex visual reasoning.
Methodology
The core of GRiP lies in its Policy Refinement stage, which addresses the "Coarse Reward Problem" in existing RL methods. We introduce a multi-faceted reward architecture:
Where:
Salience-Weighted IoU Reward ($R_{\text{sw-IoU}}$): Incentivizes the model to prioritize mission-critical objects over trivial distractors. It weights the recall component by an object's salience score $s_k$:
Multi-Heuristic Reward ($R_{\text{MHR}}$): Encourages cognitive flexibility by rewarding diverse valid reasoning pathways (e.g., Bottom-Up, Top-Down, Deductive Verification). The model is rewarded based on similarity to the best-matching reference trajectory:
Performance
TreeBench Evaluation
TreeBench is a highly challenging benchmark for fine-grained perception and multi-step reasoning. GRiP significantly outperforms its base model and other open-source competitors.
| Method | Base Model | Overall | mIoU | Perception | Reasoning |
|---|---|---|---|---|---|
| GPT-4o-1120 | - | 46.9 | - | - | - |
| o3-0416 | - | 54.8 | - | - | - |
| LLaVA-OneVision-72B | LLaMA-3 | 40.5 | - | 62.1 | 53.7 |
| InternVL3-78B | InternViT | 46.4 | - | 62.1 | 61.0 |
| Qwen2.5-VL-7B | Qwen2.5 | 37.0 | - | 55.2 | 39.0 |
| DeepEyes-7B | Qwen2-VL | 37.5 | 30.0 | 62.1 | 36.6 |
| Pixel-Reasoner-7B | Qwen2-VL | 39.0 | 35.7 | 58.6 | 39.0 |
| GRiP (Ours) | Qwen2.5-VL-7B | 51.3 | 45.0 | 69.1 | 58.7 |
Generalization on V* Bench and HR-Bench
GRiP demonstrates strong generalization capabilities on attribute recognition, spatial understanding, and high-resolution reasoning.
| Method | V* Bench (Overall) | HR-Bench-4K (Overall) | HR-Bench-8K (Overall) |
|---|---|---|---|
| GPT-4o-1120 | 66.0 | - | - |
| o3-0416 | 95.7 | - | - |
| Qwen2.5-VL-7B | 74.3 | 72.1 | 68.8 |
| Qwen2.5-VL-72B | 84.8 | 79.4 | 76.3 |
| DeepEyes-7B | 90.0 | 75.1 | 72.6 |
| GRiP (Ours) | 91.9 | 78.6 | 75.0 |
Train and Inference
Please refer to our Huggingface Repository for training and inference codes.
Training Details
- Hardware: 8 $\times$ NVIDIA H20 (96GB) GPUs.
- Frameworks: LLaMA-Factory for SFT, EasyRL for RL training.
- Optimization: AdamW optimizer, GRPO algorithm for Policy Refinement.
Acknowledgements
Our work is built upon the excellent Qwen2.5-VL. We also thank the developers of LLaMA-Factory and EasyRL for their efficient training frameworks.
Citation
If you find our work helpful, please cite:
@article{wei2025grip,
title={Guiding the Inner Eye: A Framework for Hierarchical and Flexible Visual Grounded Reasoning},
author={Wei, Zhaoyang and Ding, Wenchao and Hao, Yanchao and Chen, Xi},
journal={arXiv preprint},
year={2025}
}
- Downloads last month
- 197
