BLIVA: A Simple Multimodal LLM for Better Handling of Text-Rich Visual Questions Paper • 2308.09936 • Published Aug 19, 2023 • 1
Matryoshka Query Transformer for Large Vision-Language Models Paper • 2405.19315 • Published May 29, 2024 • 1
MRAG-Bench: Vision-Centric Evaluation for Retrieval-Augmented Multimodal Models Paper • 2410.08182 • Published Oct 10, 2024
Verbalized Representation Learning for Interpretable Few-Shot Generalization Paper • 2411.18651 • Published Nov 27, 2024
Interleaving Reasoning for Better Text-to-Image Generation Paper • 2509.06945 • Published Sep 8, 2025 • 15
TemMed-Bench: Evaluating Temporal Medical Image Reasoning in Vision-Language Models Paper • 2509.25143 • Published Sep 29, 2025
ARES: Multimodal Adaptive Reasoning via Difficulty-Aware Token-Level Entropy Shaping Paper • 2510.08457 • Published Oct 9, 2025 • 13
MMSI-Video-Bench: A Holistic Benchmark for Video-Based Spatial Intelligence Paper • 2512.10863 • Published Dec 11, 2025 • 22
OpenVLThinkerV2: A Generalist Multimodal Reasoning Model for Multi-domain Visual Tasks Paper • 2604.08539 • Published 3 days ago • 38
SIM1: Physics-Aligned Simulator as Zero-Shot Data Scaler in Deformable Worlds Paper • 2604.08544 • Published 3 days ago • 10
InternVLA-A1: Unifying Understanding, Generation and Action for Robotic Manipulation Paper • 2601.02456 • Published Jan 5 • 7
EgoSim: Egocentric World Simulator for Embodied Interaction Generation Paper • 2604.01001 • Published 11 days ago • 36