arxiv:2512.17083

When F1 Fails: Granularity-Aware Evaluation for Dialogue Topic Segmentation

Published on Dec 18

Authors:

Abstract

Evaluation of dialogue topic segmentation focuses on boundary density, segment coherence, and window-tolerant F1, revealing that reported performance differences are often due to annotation granularity mismatches rather than model improvements.

AI-generated summary

Dialogue topic segmentation supports summarization, retrieval, memory management, and conversational continuity. Despite decades of prior work, evaluation practice in dialogue topic segmentation remains dominated by strict boundary matching and F1-based metrics, even as modern LLM-based conversational systems increasingly rely on segmentation to manage conversation history beyond the model's fixed context window, where unstructured context accumulation degrades efficiency and coherence. This paper introduces an evaluation objective for dialogue topic segmentation that treats boundary density and segment coherence as primary criteria, alongside window-tolerant F1 (W-F1). Through extensive cross-dataset empirical evaluation, we show that reported performance differences across dialogue segmentation benchmarks are driven not by model quality, but by annotation granularity mismatches and sparse boundary labels. This indicates that many reported improvements arise from evaluation artifacts rather than improved boundary detection. We evaluated multiple, structurally distinct dialogue segmentation strategies across eight dialogue datasets spanning task-oriented, open-domain, meeting-style, and synthetic interactions. Across these settings, we observe high segment coherence combined with extreme oversegmentation relative to sparse labels, producing misleadingly low exact-match F1 scores. We show that topic segmentation is best understood as selecting an appropriate granularity rather than predicting a single correct boundary set. We operationalize this view by explicitly separating boundary scoring from boundary selection.

View arXiv page View PDF Add to collection

Community

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2512.17083 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2512.17083 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2512.17083 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.