ToolRM-Qwen3-4B-Thinking-2507

[Paper] | [Dataset] | [Benchmark] | [Code]

🌟 Highlights

ToolRM is a family of lightweight generative reward models tailored for general tool-use scenarios. To build these models, we propose a novel pipeline that constructs pairwise preference data using rule-based scoring and multidimensional sampling. This yields ToolPref-Pairwise-30K, a diverse, balanced, and challenging dataset of critique tasks that supports reinforcement learning with verifiable feedback. To evaluate tool-use RMs, we also introduce TRBench-BFCL, a benchmark built on the agentic evaluation suite BFCL. Trained on our constructed data, models from the Qwen3-4B/8B series outperform several giant LLMs in pairwise reward judgments. Beyond training objectives, ToolRM generalizes to broader critique tasks, including Best-of-N sampling and self-correction.

🚀 Quickstart

The following contains a code snippet illustrating how to use ToolRM conduct pairwise critique based on given inputs:

from transformers import AutoModelForCausalLM, AutoTokenizer
import re

def extract_choice_from_text(text: str):
    answer_pattern = r'<choice>\n(.*?)\n</choice>'
    match = re.search(answer_pattern, text, re.DOTALL)
    if not match:
        return None

    answer_str = match.group(1).strip()
    if answer_str in ['1', '2']:
        return answer_str
    else:
        return None

model_name = "RioLee/ToolRM-Qwen3-4B-Thinking-2507"

# load the tokenizer and the model
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype="auto",
    device_map="auto"
)

# conduct pairwise judgment with ToolRM using the 'thinking' template:
prompt = """<task>
You are an expert evaluator of AI assistant performance. Given a complete user-assistant conversation history and two generated assistant responses, you are to conduct a thorough, fact-based, and comprehensive comparison. Based on specific evidence from your evaluation, make a clear choice of which response is superior. There may be a list of tools available to the assistant. The assistant starts with one or more cycles of (thinking about which tool to use -> performing tool call -> waiting for tool response), and ends with (thinking about the answer -> answer of the question). The thinking processes, tool calls, tool responses, and answer are enclosed within their tags. There could be multiple thinking processes, tool calls, tool call parameters and tool response parameters.
</task>

<evaluation_criteria>
- Available tools must be fully and appropriately leveraged to meet the requirements.
- Tool call names must be valid, correct, and complete.
- Tool call arguments must be valid, correct, and complete.
- Fabrication, including the creation of information or knowledge not provided by the user, conflicting with user input, or not derived from the tools, must be penalized.
- Repetitive or unnecessary tool calls must be penalized.
- Excessive or unnecessary requests for user clarification beyond what is essential must be penalized.
</evaluation_criteria>

<conversation_history>
[system]: # Tools

You may call one or more functions to assist with the user query.

You are provided with function signatures within <tools></tools> XML tags:
<tools>
{"type": "function", "function": {"name": "spotify.play", "description": "Play specific tracks from a given artist for a specific time duration.", "parameters": {"type": "dict", "properties": {"artist": {"type": "string", "description": "The artist whose songs you want to play."}, "duration": {"type": "integer", "description": "The duration for which the songs should be played, in minutes."}}, "required": ["artist", "duration"]}}}
</tools>

For each function call, return a json object with function name and arguments within <tool_call></tool_call> XML tags:
<tool_call>
{"name": <function-name>, "arguments": <args-json-object>}
</tool_call>
[user]: Play songs from the artists Taylor Swift and Maroon 5, with a play time of 20 minutes and 15 minutes respectively, on Spotify.
</conversation_history>

<current_response_1>
<tool_call>
{"name": "spotify.play", "arguments": {"artist": "Taylor Swift", "duration": 20}}
</tool_call>
<tool_call>
{"name": "spotify.play", "arguments": {"artist": "Maroon 5", "duration": 15}}
</tool_call>
</current_response_1>

<current_response_2>
<tool_call>
{"name": "spotify_play", "arguments": {"artist": "Taylor Swift", "duration": 20}}
</tool_call>
</current_response_2>

Output your choice (either '1' or '2') within <choice></choice> XML tags. No explanations should precede or follow the choice. Answer in the following format.
<choice>
{your_choice}
</choice>
"""

messages = [
    {"role": "user", "content": prompt}
]
text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True,
)
model_inputs = tokenizer([text], return_tensors="pt").to(model.device)

generated_ids = model.generate(
    **model_inputs,
    max_new_tokens=8192
)
output_ids = generated_ids[0][len(model_inputs.input_ids[0]):].tolist() 

try:
    # rindex finding 151668 (</think>)
    index = len(output_ids) - output_ids[::-1].index(151668)
except ValueError:
    index = 0

thinking_content = tokenizer.decode(output_ids[:index], skip_special_tokens=True).strip("\n")
content = tokenizer.decode(output_ids[index:], skip_special_tokens=True).strip("\n")

print("thinking content:", thinking_content)
# thinking content: Okay, let's tackle this evaluation. So, the user wants the assistant to play songs ... Therefore, Response 1 is superior.

print("output content:", content)
# output content: <choice>\n1\n</choice>

choice = extract_choice_from_text(content)
print("final choice:", choice)
# final choice: 1

When processing batched prompts, model inference can be accelerated with the vLLM engine:

from vllm import LLM, SamplingParams
from transformers import AutoTokenizer

# same as recommended settings of Qwen3-4B-Thinking-2507
inference_sampling_params = {
    'temperature': 0.6,
    'top_p': 0.95,
    'top_k': 20,
    'max_tokens': 8192,
}
sampling_params = SamplingParams(**inference_sampling_params)

tokenizer = AutoTokenizer.from_pretrained(model_name)
prompts = [] # replace this with a list of prompts for critique tasks

texts = []
for prompt in prompts:
    messages = [{"role": "user", "content": prompt}]
    text = tokenizer.apply_chat_template(
        messages,
        tokenize=False,
        add_generation_prompt=True
    )
    texts.append(text)

llm = LLM(
    model=model_name,
    gpu_memory_utilization=0.8,
    max_model_len=32768,
    disable_cascade_attn=True,
)

outputs = llm.generate(texts, sampling_params)

for index, output in enumerate(outputs):
    output_text = output.outputs[0].text
    print(f"Model output of sample-{index}: {output_text}")

ToolRM can also perform pointwise and best-of-N critiques across different prompt templates, requiring only minimal revisions:

POINTWISE_CRITIQUE_THINK_TEMPLATE="""<task>
You are an expert evaluator of AI assistant performance. Given a complete user-assistant conversation history and a generated assistant response, you are to conduct a thorough, fact-based, and comprehensive evaluation. Based on specific evidence from your evaluation, provide a concise critique on how the current assistant response should be revised. If the response is entirely correct and requires no changes, output '[correct]' as your critique.
</task>

<evaluation_criteria>
- Available tools must be fully and appropriately leveraged to meet the requirements.
- Tool call names must be valid, correct, and complete.
- Tool call arguments must be valid, correct, and complete.
- Fabrication, including the creation of information or knowledge not provided by the user, conflicting with user input, or not derived from the tools, must be penalized.
- Repetitive or unnecessary tool calls must be penalized.
- Excessive or unnecessary requests for user clarification beyond what is essential must be penalized.
</evaluation_criteria>

<conversation_history>
{chat_history}
</conversation_history>

<current_response>
{assistant_response}
</current_response>

Output your final critique within <critique></critique> XML tags. No explanations should precede or follow the critique. Answer in the following format.
<critique>
{{your_critique}}
</critique>
"""

BoN_CRITIQUE_THINK_TEMPLATE="""<task>
You are an expert evaluator of AI assistant performance. Given a complete user-assistant conversation history and {N} generated assistant responses, you are to conduct a thorough, fact-based, and comprehensive comparison. Based on specific evidence from your evaluation, make a clear choice of which response is superior. If multiple responses are identical and equally the best, select the one with the smallest number.
</task>

<evaluation_criteria>
- Available tools must be fully and appropriately leveraged to meet the requirements.
- Tool call names must be valid, correct, and complete.
- Tool call arguments must be valid, correct, and complete.
- Fabrication, including the creation of information or knowledge not provided by the user, conflicting with user input, or not derived from the tools, must be penalized.
- Repetitive or unnecessary tool calls must be penalized.
- Excessive or unnecessary requests for user clarification beyond what is essential must be penalized.
</evaluation_criteria>

<conversation_history>
{chat_history}
</conversation_history>

{N_assistant_response}

Output your choice (a number between 1 and {N}) within <choice></choice> XML tags. No explanations should precede or follow the choice. Answer in the following format.
<choice>
{{your_choice}}
</choice>
"""

For deployment, you can use vllm>=0.8.4 to create an OpenAI-compatible API endpoint:

vllm serve RioLee/ToolRM-Qwen3-4B-Thinking-2507 \
  --max-model-len 32768 \
  --enable-reasoning \
  --reasoning-parser deepseek_r1

Note

ToolRM was trained with a maximum input length of 16,384; overly long prompts may cause unpredictable behavior.
Swapping the order of assistant responses during evaluation is recommended to mitigate position bias in generative reward models.

🚦 Licenses

ToolRM is a research project developed by Alibaba Cloud and licensed under the CC BY-NC-SA 4.0 License.

📝 Citation

If you find our work helpful, feel free to give us a cite.

@misc{li2025modelcritiqueallrewarding,
      title={One Model to Critique Them All: Rewarding Agentic Tool-Use via Efficient Reasoning}, 
      author={Renhao Li and Jianhong Tu and Yang Su and Hamid Alinejad-Rokny and Derek F. Wong and Junyang Lin and Min Yang},
      year={2025},
      eprint={2510.26167},
      archivePrefix={arXiv},
      primaryClass={cs.AI},
      url={https://arxiv.org/abs/2510.26167}, 
}