GRiP-7B: Guiding the Inner Eye

Arxiv | Huggingface

Overview

This repository contains the official model checkpoints of GRiP (Guided Reasoning and Perception), a novel visual grounded reasoning model developed by Basic Algorithm Center, Platform and Content Group, Tencent.

Models capable of "thinking with images" represent a major leap in multimodal AI. GRiP is designed to cultivate robust and flexible visual grounded reasoning by explicitly guiding the model's perceptual focus and logical pathways. Initialized from Qwen2.5-VL-7B-Instruct, GRiP employs a two-stage training framework:

  1. Bootstrapping: Structured instruction tuning to teach the syntax of grounded reasoning.
  2. Policy Refinement: A cognitive-enhanced Reinforcement Learning (RL) stage featuring novel reward mechanisms.

GRiP achieves state-of-the-art results among open-source models on challenging benchmarks like TreeBench, V* Bench, and HR-Bench, demonstrating superior capability in complex visual reasoning.

Methodology

The core of GRiP lies in its Policy Refinement stage, which addresses the "Coarse Reward Problem" in existing RL methods. We introduce a multi-faceted reward architecture:

Rtotal=Racc+Rfmt+Rsw-IoU+RMHR R_{\text{total}} = R_{\text{acc}} + R_{\text{fmt}} + R_{\text{sw-IoU}} + R_{\text{MHR}}

Where:

  • Salience-Weighted IoU Reward ($R_{\text{sw-IoU}}$): Incentivizes the model to prioritize mission-critical objects over trivial distractors. It weights the recall component by an object's salience score $s_k$: Rrecall=1โˆ‘skโˆ‘k=1Mskโ‹…maxโกiIoU(pi,gk) R_{\text{recall}} = \frac{1}{\sum s_k} \sum_{k=1}^{M} s_k \cdot \max_{i} \text{IoU}(p_i, g_k)

  • Multi-Heuristic Reward ($R_{\text{MHR}}$): Encourages cognitive flexibility by rewarding diverse valid reasoning pathways (e.g., Bottom-Up, Top-Down, Deductive Verification). The model is rewarded based on similarity to the best-matching reference trajectory: RMHR=maxโกjโˆˆ{1,2,3}sim(ฯ„gen,ฯ„refj) R_{\text{MHR}} = \max_{j \in \{1,2,3\}} \text{sim}(\tau_{\text{gen}}, \tau_{\text{ref}}^j)

image

Performance

TreeBench Evaluation

TreeBench is a highly challenging benchmark for fine-grained perception and multi-step reasoning. GRiP significantly outperforms its base model and other open-source competitors.

Method Base Model Overall mIoU Perception Reasoning
GPT-4o-1120 - 46.9 - - -
o3-0416 - 54.8 - - -
LLaVA-OneVision-72B LLaMA-3 40.5 - 62.1 53.7
InternVL3-78B InternViT 46.4 - 62.1 61.0
Qwen2.5-VL-7B Qwen2.5 37.0 - 55.2 39.0
DeepEyes-7B Qwen2-VL 37.5 30.0 62.1 36.6
Pixel-Reasoner-7B Qwen2-VL 39.0 35.7 58.6 39.0
GRiP (Ours) Qwen2.5-VL-7B 51.3 45.0 69.1 58.7

Generalization on V* Bench and HR-Bench

GRiP demonstrates strong generalization capabilities on attribute recognition, spatial understanding, and high-resolution reasoning.

Method V* Bench (Overall) HR-Bench-4K (Overall) HR-Bench-8K (Overall)
GPT-4o-1120 66.0 - -
o3-0416 95.7 - -
Qwen2.5-VL-7B 74.3 72.1 68.8
Qwen2.5-VL-72B 84.8 79.4 76.3
DeepEyes-7B 90.0 75.1 72.6
GRiP (Ours) 91.9 78.6 75.0

Train and Inference

Please refer to our Huggingface Repository for training and inference codes.

Training Details

  • Hardware: 8 $\times$ NVIDIA H20 (96GB) GPUs.
  • Frameworks: LLaMA-Factory for SFT, EasyRL for RL training.
  • Optimization: AdamW optimizer, GRPO algorithm for Policy Refinement.

Acknowledgements

Our work is built upon the excellent Qwen2.5-VL. We also thank the developers of LLaMA-Factory and EasyRL for their efficient training frameworks.

Citation

If you find our work helpful, please cite:

@article{wei2025grip,
  title={Guiding the Inner Eye: A Framework for Hierarchical and Flexible Visual Grounded Reasoning},
  author={Wei, Zhaoyang and Ding, Wenchao and Hao, Yanchao and Chen, Xi},
  journal={arXiv preprint},
  year={2025}
}
Downloads last month
197
Safetensors
Model size
8B params
Tensor type
BF16
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for TencentBAC/GRiP

Finetuned
(906)
this model
Quantizations
2 models

Collection including TencentBAC/GRiP