SWE-bench Multimodal: Do AI Systems Generalize to Visual Software Domains? Paper • 2410.03859 • Published Oct 4, 2024 • 1
VideoGameBench: Can Vision-Language Models complete popular video games? Paper • 2505.18134 • Published May 23 • 6
Unlimiformer: Long-Range Transformers with Unlimited Length Input Paper • 2305.01625 • Published May 2, 2023 • 6
LLMs as Workers in Human-Computational Algorithms? Replicating Crowdsourcing Pipelines with LLMs Paper • 2307.10168 • Published Jul 19, 2023 • 10