Voozh

CodeScaler: Scaling Code LLM Training and Test-Time Inference via Execution-Free Reward Models

👁 GitHub Code
👁 GitHub Page
👁 Datasets on Hugging Face
👁 CodeScaler on Hugging Face

Overview

We propose CodeScaler, an execution-free reward model designed to scale both reinforcement learning training and test-time inference for code generation. CodeScaler is trained on carefully curated preference data derived from verified code problems and incorporates syntax-aware code extraction and validity-preserving reward shaping to ensure stable and robust optimization.

This model is the official CodeScaler-1.7B trained from Skywork/Skywork-Reward-V2-Qwen3-1.7B on LARK-Lab/CodeScalerPair-51K.

Performance on RM-Bench

Model	Code	Chat	Math	Safety	Easy	Normal	Hard	Avg
Skywork/Skywork-Reward-Llama-3.1-8B	54.5	69.5	60.6	95.7	89	74.7	46.6	70.1
TIGER-Lab/AceCodeRM-7B	66.9	66.7	65.3	89.9	79.9	74.4	62.2	72.2
TIGER-Lab/AceCoder-RM-32B	72.1	73.7	70.5	88	84.5	78.3	65.5	76.1
Skywork/Skywork-Reward-V2-Qwen3-1.7B	72.3	69.6	71.4	92.9	92.8	82.3	54.5	76.6
Skywork/Skywork-Reward-V2-Qwen3-4B	74.4	78.2	73.6	95.7	92.1	85	64.4	80.5
Skywork/Skywork-Reward-V2-Qwen3-8B	73.6	80.6	75	96.5	91.8	85.5	67	80.5
CodeScaler-1.7B (this model)	73.1	74.4	74.7	93.1	91.7	83.2	61.5	78.8
CodeScaler-4B	76.3	80.4	79	95.8	92.9	86.5	69.2	82.9
CodeScaler-8B	76.9	83	79.9	96.4	92.5	87.9	71.8	84.1

Usage

RM Scoring

import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification



device = "cuda" if torch.cuda.is_available() else "cpu"

model_path = 'LARK-Lab/CodeScaler-1.7B'

tokenizer = AutoTokenizer.from_pretrained(model_path)
reward_model = AutoModelForSequenceClassification.from_pretrained(model_path).to(device)
reward_model.eval()

question = """\
Given an integer array nums and an integer k, return the total number of continuous subarrays whose sum equals k.
A subarray is a contiguous part of the array.
For example:
```
Input:
nums = [1, 1, 1], k = 2

Output:
2
```
"""

program_correct = """\
from collections import defaultdict

def subarraySum(nums, k):
 prefix = 0
 count = 0
 freq = defaultdict(int)
 freq[0] = 1 # Important: subarray starting from index 0

 for num in nums:
 prefix += num

 if prefix - k in freq:
 count += freq[prefix - k]

 freq[prefix] += 1

 return count
"""

program_wrong = """\
def subarraySum(nums, k):
 left = 0
 curr_sum = 0
 count = 0

 for right in range(len(nums)):
 curr_sum += nums[right]

 while curr_sum > k and left <= right:
 curr_sum -= nums[left]
 left += 1

 if curr_sum == k:
 count += 1

 return count
"""


convs = [
 [
 {
 "content": question,
 "role": "user",
 },
 {
 "role": "assistant",
 "content": program
 }
 ] for program in [program_correct, program_wrong]
]


texts = [
 tokenizer.apply_chat_template(conv, tokenize=False)
 for conv in convs
]

toks = tokenizer(
 texts,
 truncation=True,
 padding=True,
 max_length=2048,
 return_tensors="pt",
)

with torch.no_grad():
 outputs = reward_model(
 input_ids=toks["input_ids"].to(device),
 attention_mask=toks["attention_mask"].to(device),
 )
 scores = outputs.logits.squeeze(-1).cpu().tolist()


print("RM Scores:", scores)
# RM Scores: [12.513851165771484, -0.46548914909362793]

RL Training

Please refer to https://github.com/LARK-AI-Lab/CodeScaler for rl training details.

Citation

If you find our work helpful, please consider citing:

@misc{zhu2026codescalerscalingcodellm,
 title={CodeScaler: Scaling Code LLM Training and Test-Time Inference via Execution-Free Reward Models}, 
 author={Xiao Zhu and Xinyu Zhou and Boyu Zhu and Hanxu Hu and Mingzhe Du and Haotian Zhang and Huiming Wang and Zhijiang Guo},
 year={2026},
 eprint={2602.17684},
 archivePrefix={arXiv},
 primaryClass={cs.LG},
 url={https://arxiv.org/abs/2602.17684}, 
}

Downloads last month: 4

Safetensors

Model size

2B params

Tensor type

BF16

Model tree for LARK-Lab/CodeScaler-1.7B

Base model

Qwen/Qwen3-1.7B-Base

Finetuned

Qwen/Qwen3-1.7B

Finetuned

Skywork/Skywork-Reward-V2-Qwen3-1.7B

Finetuned

(1)

this model

Dataset used to train LARK-Lab/CodeScaler-1.7B

Collection including LARK-Lab/CodeScaler-1.7B

5 items • Updated Mar 2 • 6

Paper for LARK-Lab/CodeScaler-1.7B

Paper • 2602.17684 • Published Feb 4 • 22

URL: https://huggingface.co/LARK-Lab/CodeScaler-1.7B

⇱ LARK-Lab/CodeScaler-1.7B · Hugging Face

CodeScaler: Scaling Code LLM Training and Test-Time Inference via Execution-Free Reward Models

Overview

Performance on RM-Bench

Usage

RM Scoring

RL Training

Citation

Model tree for LARK-Lab/CodeScaler-1.7B

Dataset used to train LARK-Lab/CodeScaler-1.7B

Collection including LARK-Lab/CodeScaler-1.7B

Paper for LARK-Lab/CodeScaler-1.7B