Abstract
Qworld generates question-specific evaluation criteria through recursive expansion trees to better assess large language model capabilities on health-related questions.
Evaluating large language models (LLMs) on open-ended questions is difficult because response quality depends on the question's context. Binary scores and static rubrics fail to capture these context-dependent requirements. Existing methods define criteria at the dataset level or generate them in a single pass, which limits their ability to explore the evaluation space implied by each question. We introduce One-Question-One-World (Qworld), a method that generates question-specific evaluation criteria using a recursive expansion tree. Given a question, Qworld decomposes it into scenarios, perspectives, and fine-grained binary criteria through structured hierarchical and horizontal expansion. The resulting criteria specify what a high-quality answer must address for that question. On HealthBench, Qworld covers 89% of expert-authored criteria and generates 79% novel criteria validated by human experts. Experts rate Qworld criteria higher in insight and granularity than those produced by prior methods. When applied to 11 frontier LLMs on HealthBench and Humanity's Last Exam, Qworld reveals capability differences in dimensions such as long-term impact, equity, error handling, and interdisciplinary reasoning that coarse rubrics do not distinguish. By formulating criteria generation as structured coverage of question-implied evaluation axes, Qworld enables evaluation that adapts to each question rather than relying on fixed task-level criteria.
Community
Are we even measuring the right things when we evaluate LLMs?
We introduce QWorld, a framework where every question generates its own evaluation world through recursive expansion tree. One question becomes 45+ fine-grained criteria. On HealthBench alone: 200k+ criteria across 530+ dimensions.
79% of QWorld's criteria are entirely novel. No expert had ever written them down, yet human judges validate they matter.
It surfaces blind spots in every frontier model: sustainability, equity, emergency recognition. Dimensions standard benchmarks don't even have.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Learning to Judge: LLMs Designing and Applying Evaluation Rubrics (2026)
- JADE: Expert-Grounded Dynamic Evaluation for Open-Ended Professional Tasks (2026)
- SCORE: Specificity, Context Utilization, Robustness, and Relevance for Reference-Free LLM Evaluation (2026)
- RubricRAG: Towards Interpretable and Reliable LLM Evaluation via Domain Knowledge Retrieval for Rubric Generation (2026)
- ClinAlign: Scaling Healthcare Alignment from Clinician Preference (2026)
- PEEM: Prompt Engineering Evaluation Metrics for Interpretable Joint Evaluation of Prompts and Responses (2026)
- UEval: A Benchmark for Unified Multimodal Generation (2026)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper