Update README.md

7af52cb verified 4 months ago

5.99 kB

	---
	library_name: transformers
	tags:
	- reward
	- RM
	- Code
	- CodeScaler
	license: mit
	datasets:
	- LARK-Lab/CodeScalerPair-51K
	language:
	- en
	base_model:
	- Skywork/Skywork-Reward-V2-Qwen3-4B
	---

	<h2 align="center">
	CodeScaler: Scaling Code LLM Training and Test-Time Inference via Execution-Free Reward Models
	</h2>

	<p align="center">
	<a href="">
	<img
	src="https://img.shields.io/badge/Paper-Arxiv-red?logo=arxiv&logoColor=red"
	alt="CodeScaler Paper on arXiv"
	/>
	<a href="https://github.com/LARK-AI-Lab/CodeScaler">
	<img
	src="https://img.shields.io/badge/GitHub-Code-181717?logo=github&logoColor=white"
	alt="GitHub Code"
	/>
	</a>
	<a href="https://lark-ai-lab.github.io/codescaler.github.io/">
	<img
	src="https://img.shields.io/badge/GitHub-Page-4078c0?logo=github&logoColor=white"
	alt="GitHub Page"
	/>
	</a>
	<a href="https://huggingface.co/collections/LARK-Lab/codescaler">
	<img
	src="https://img.shields.io/badge/Datasets-Hugging%20Face%20Data-orange?logo=huggingface&logoColor=yellow"
	alt="Datasets on Hugging Face"
	/>
	</a>
	<a href="https://huggingface.co/collections/LARK-Lab/codescaler">
	<img
	src="https://img.shields.io/badge/CodeScaler-Hugging%20Face%20Model-FFCC00?logo=huggingface&logoColor=yellow"
	alt="CodeScaler on Hugging Face"
	/>
	</a>


	</p>

	## Overview


	We propose CodeScaler, an execution-free reward model designed to scale both reinforcement learning training and test-time inference for code generation. CodeScaler is trained on carefully curated preference data derived from verified code problems and incorporates syntax-aware code extraction and validity-preserving reward shaping to ensure stable and robust optimization.

	This model is the official CodeScaler-4B trained from Skywork/Skywork-Reward-V2-Qwen3-4B on [LARK-Lab/CodeScalerPair-51K](https://huggingface.co/datasets/LARK-Lab/CodeScalerPair-51K).

	## Performance on RM-Bench
	\| Model \| Code \| Chat \| Math \| Safety \| Easy \| Normal \| Hard \| Avg \|
	\| ------------------------------------ \| ---- \| ----- \| ----- \| ------ \| ----- \| ------ \| ---- \| ---- \|
	\| Skywork/Skywork-Reward-Llama-3.1-8B \| 54.5 \| 69.5 \| 60.6 \| 95.7 \| 89 \| 74.7 \| 46.6 \| 70.1 \|
	\| TIGER-Lab/AceCodeRM-7B \| 66.9 \| 66.7 \| 65.3 \| 89.9 \| 79.9 \| 74.4 \| 62.2 \| 72.2 \|
	\| TIGER-Lab/AceCoder-RM-32B \| 72.1 \| 73.7 \| 70.5 \| 88 \| 84.5 \| 78.3 \| 65.5 \| 76.1 \|
	\| Skywork/Skywork-Reward-V2-Qwen3-1.7B \| 72.3 \| 69.6 \| 71.4 \| 92.9 \| 92.8 \| 82.3 \| 54.5 \| 76.6 \|
	\| Skywork/Skywork-Reward-V2-Qwen3-4B \| 74.4 \| 78.2 \| 73.6 \| 95.7 \| 92.1 \| 85 \| 64.4 \| 80.5 \|
	\| Skywork/Skywork-Reward-V2-Qwen3-8B \| 73.6 \| 80.6 \| 75 \| 96.5 \| 91.8 \| 85.5 \| 67 \| 80.5 \|
	\| CodeScaler-1.7B \| 73.1 \| 74.4 \| 74.7 \| 93.1 \| 91.7 \| 83.2 \| 61.5 \| 78.8 \|
	\| CodeScaler-4B (this model) \| 76.3 \| 80.4 \| 79 \| 95.8 \| 92.9 \| 86.5 \| 69.2 \| 82.9 \|
	\| CodeScaler-8B \| 76.9 \| 83 \| 79.9 \| 96.4 \| 92.5 \| 87.9 \| 71.8 \| 84.1 \|

	## Usage

	### RM Scoring
	````python
	import torch
	from transformers import AutoTokenizer, AutoModelForSequenceClassification



	device = "cuda" if torch.cuda.is_available() else "cpu"

	model_path = 'LARK-Lab/CodeScaler-4B'

	tokenizer = AutoTokenizer.from_pretrained(model_path)
	reward_model = AutoModelForSequenceClassification.from_pretrained(model_path).to(device)
	reward_model.eval()

	question = """\
	Given an integer array nums and an integer k, return the total number of continuous subarrays whose sum equals k.
	A subarray is a contiguous part of the array.
	For example:
	```
	Input:
	nums = [1, 1, 1], k = 2

	Output:
	2
	```
	"""

	program_correct = """\
	from collections import defaultdict

	def subarraySum(nums, k):
	prefix = 0
	count = 0
	freq = defaultdict(int)
	freq[0] = 1 # Important: subarray starting from index 0

	for num in nums:
	prefix += num

	if prefix - k in freq:
	count += freq[prefix - k]

	freq[prefix] += 1

	return count
	"""

	program_wrong = """\
	def subarraySum(nums, k):
	left = 0
	curr_sum = 0
	count = 0

	for right in range(len(nums)):
	curr_sum += nums[right]

	while curr_sum > k and left <= right:
	curr_sum -= nums[left]
	left += 1

	if curr_sum == k:
	count += 1

	return count
	"""


	convs = [
	[
	{
	"content": question,
	"role": "user",
	},
	{
	"role": "assistant",
	"content": program
	}
	] for program in [program_correct, program_wrong]
	]


	texts = [
	tokenizer.apply_chat_template(conv, tokenize=False)
	for conv in convs
	]

	toks = tokenizer(
	texts,
	truncation=True,
	padding=True,
	max_length=2048,
	return_tensors="pt",
	)

	with torch.no_grad():
	outputs = reward_model(
	input_ids=toks["input_ids"].to(device),
	attention_mask=toks["attention_mask"].to(device),
	)
	scores = outputs.logits.squeeze(-1).cpu().tolist()


	print("RM Scores:", scores)
	# RM Scores: [12.552595138549805, 3.382493019104004]

	````

	### RL Training
	Please refer to [https://github.com/LARK-AI-Lab/CodeScaler](https://github.com/LARK-AI-Lab/CodeScaler) for rl training details.

	## Citation
	If you find our work helpful, please consider citing:
	```
	@misc{zhu2026codescalerscalingcodellm,
	title={CodeScaler: Scaling Code LLM Training and Test-Time Inference via Execution-Free Reward Models},
	author={Xiao Zhu and Xinyu Zhou and Boyu Zhu and Hanxu Hu and Mingzhe Du and Haotian Zhang and Huiming Wang and Zhijiang Guo},
	year={2026},
	eprint={2602.17684},
	archivePrefix={arXiv},
	primaryClass={cs.LG},
	url={https://arxiv.org/abs/2602.17684},
	}
	```