WeaveBench: A Long-Horizon, Real-World Benchmark for Computer-Use Agents with Hybrid Interfaces Paper • 2606.09426 • Published 19 days ago • 102
From Raw Experience to Skill Consumption: A Systematic Study of Model-Generated Agent Skills Paper • 2605.23899 • Published May 22 • 29 • 2
Reward Hacking in the Era of Large Models: Mechanisms, Emergent Misalignment, Challenges Paper • 2604.13602 • Published Apr 15 • 32