Papers
arxiv:2603.28301

LIBERO-Para: A Diagnostic Benchmark and Metrics for Paraphrase Robustness in VLA Models

Published on Mar 30
· Submitted by
Chanyoung Kim
on Apr 7
Authors:
,
,
,

Abstract

Vision-Language-Action models show significant performance drops when handling paraphrased instructions due to surface-level matching rather than semantic understanding, highlighting the need for better linguistic generalization metrics.

AI-generated summary

Vision-Language-Action (VLA) models achieve strong performance in robotic manipulation by leveraging pre-trained vision-language backbones. However, in downstream robotic settings, they are typically fine-tuned with limited data, leading to overfitting to specific instruction formulations and leaving robustness to paraphrased instructions underexplored. To study this gap, we introduce LIBERO-Para, a controlled benchmark that independently varies action expressions and object references for fine-grained analysis of linguistic generalization. Across seven VLA configurations (0.6B-7.5B), we observe consistent performance degradation of 22-52 pp under paraphrasing. This degradation is primarily driven by object-level lexical variation: even simple synonym substitutions cause large drops, indicating reliance on surface-level matching rather than semantic grounding. Moreover, 80-96% of failures arise from planning-level trajectory divergence rather than execution errors, showing that paraphrasing disrupts task identification. Binary success rate treats all paraphrases equally, obscuring whether models perform consistently across difficulty levels or rely on easier cases. To address this, we propose PRIDE, a metric that quantifies paraphrase difficulty using semantic and syntactic factors. Our benchmark and corresponding code are available at: https://github.com/cau-hai-lab/LIBERO-Para

Community

Paper author Paper submitter

We introduce LIBERO-Para, a controlled benchmark that evaluates paraphrase robustness in VLA models by independently varying action expressions and object references. Dataset: https://huggingface.co/datasets/HAI-Lab/LIBERO-Para

Interesting breakdown of this paper on arXivLens: https://arxivlens.com/PaperView/Details/libero-para-a-diagnostic-benchmark-and-metrics-for-paraphrase-robustness-in-vla-models-5196-f79d33f2
Covers the executive summary, detailed methodology, and practical applications.

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

this paper caught my eye because of the benchmark angle, found a good summary here https://arxivexplained.com/paper/libero-para-a-diagnostic-benchmark-and-metrics-for-paraphrase-robustness-in-vla-models helped me get through it quicker

LIBERO-Para: A Diagnostic Benchmark and Metrics for Paraphrase Robustness in VLA Models

LIBERO-Para exposes a critical fragility in Vision-Language-Action (VLA) models: they break when given paraphrased instructions that describe the exact same task. A robot that successfully executes "pick up the red cup" may fail completely when told to "grab the crimson mug," despite both instructions requiring identical behavior. The benchmark provides a systematic diagnostic framework and dedicated metrics to measure and quantify this paraphrase brittleness across VLA architectures.

Key Idea

VLA models take camera images and language instructions as input and produce robot actions as output. LIBERO-Para demonstrates that these models are surprisingly sensitive to surface-level instruction wording rather than underlying task semantics. The pipeline exposes how the same visual scene and same desired action can produce dramatically different outcomes depending on how the instruction is phrased.

VLAPipeline

Method / Approach

The benchmark systematically pairs original task instructions with semantically equivalent paraphrases and measures the performance gap. By holding the task, scene, and desired outcome constant while varying only the instruction wording, LIBERO-Para isolates paraphrase sensitivity as an independent failure mode. The results are stark: models that perform well on original instructions show major performance drops on paraphrased versions, with some models losing 20-30% or more of their success rate.

ParaphraseBreak

Results

Across multiple VLA architectures, LIBERO-Para documents consistent and significant performance degradation when switching from original to paraphrased instructions. The dedicated robustness metrics quantify this gap per model, revealing that no current VLA architecture handles paraphrases gracefully. The benchmark provides a clear diagnostic signal for researchers working on more robust language grounding in robotic systems.

RobustnessMetric

Sign up or log in to comment

Get this paper in your agent:

hf papers read 2603.28301
Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2603.28301 in a model README.md to link it from this page.

Datasets citing this paper 1

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2603.28301 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.