Geometrically-Constrained Agent for Spatial Reasoning Paper • 2511.22659 • Published 10 days ago • 38
RegionE: Adaptive Region-Aware Generation for Efficient Image Editing Paper • 2510.25590 • Published Oct 29 • 27
MinerU2.5: A Decoupled Vision-Language Model for Efficient High-Resolution Document Parsing Paper • 2509.22186 • Published Sep 26 • 136
InfGen: A Resolution-Agnostic Paradigm for Scalable Image Synthesis Paper • 2509.10441 • Published Sep 12 • 30
A Survey of Context Engineering for Large Language Models Paper • 2507.13334 • Published Jul 17 • 259
Go to Zero: Towards Zero-shot Motion Generation with Million-scale Data Paper • 2507.07095 • Published Jul 9 • 54
FiTv2: Scalable and Improved Flexible Vision Transformer for Diffusion Model Paper • 2410.13925 • Published Oct 17, 2024 • 24
What If We Recaption Billions of Web Images with LLaMA-3? Paper • 2406.08478 • Published Jun 12, 2024 • 41
LLaVA-UHD: an LMM Perceiving Any Aspect Ratio and High-Resolution Images Paper • 2403.11703 • Published Mar 18, 2024 • 17
InternLM-XComposer2: Mastering Free-form Text-Image Composition and Comprehension in Vision-Language Large Model Paper • 2401.16420 • Published Jan 29, 2024 • 55
From GPT-4 to Gemini and Beyond: Assessing the Landscape of MLLMs on Generalizability, Trustworthiness and Causality through Four Modalities Paper • 2401.15071 • Published Jan 26, 2024 • 37