NeuralTxt Reward Model
A small, fast reward model fine-tuned on research paper Q&A data. Scores responses on a 0-1 scale by comparing them against a reference answer. Note that this is an answer-equivalence model that is trained to match correctness and completeness of input answer. Do not use this for style-matching tasks.
Purpose
RL training for language models needs fast reward signals. External LLM-as-judge is accurate but slow and expensive. Lexical metrics (F1, ROUGE-L) are fast but fooled 78% of the time by factually wrong responses as long as lexical overlap is high. Traditional autoregressive reward models trained directly on preference datasets are large and eat up a lot of GPU.
This model balances speed and accuracy — 22M params, runs on CPU, 5% hallucination rate on hard confound examples, 97% accuracy on human preference data.. Note that unlike autoregressive reward models, neuraltxt-reward-22M needs a reference answer to generate rewards.
Quick Start
Install the neural-txt library via pip or uv
pip install neural-txt
And then use it thusly:
from neuraltxt import NeuralTxtReward
reward = NeuralTxtReward()
score = reward.score(
response="Attention is all you need.",
reference="All you need is attention.",
)
You can also do batch_scores and ranks:
reference = "Attention is all you need."
responses = [
"All you need is attention.",
"You do not need attention."
]
scores = reward.batch_score(responses, reference)
ranked = reward.rank(responses, reference)
Alternatively, you can just use this simple script.
import torch
import torch.nn as nn
from transformers import AutoTokenizer, AutoModel
def meanmax_pool(hidden, mask):
mask_f = mask.unsqueeze(-1).float()
mean = (hidden * mask_f).sum(1) / mask_f.sum(1).clamp(min=1e-9)
mx = hidden.masked_fill(mask_f == 0, float("-inf")).max(1).values
return torch.cat([mean, mx], dim=-1)
# Load encoder and tokenizer
encoder = AutoModel.from_pretrained("paperbd/neuraltxt-reward-tiny")
tokenizer = AutoTokenizer.from_pretrained("paperbd/neuraltxt-reward-tiny")
# Load trained head (~3KB, 768-dim input from mean+max concat)
head = nn.Sequential(nn.Dropout(0.2), nn.Linear(768, 1))
head.load_state_dict(torch.load("head_weights.pt", map_location="cpu",
weights_only=True))
def score(reference, response):
text = f"{reference} [SEP] {response}"
enc = tokenizer(text, return_tensors="pt", truncation=True, max_length=512)
with torch.no_grad():
out = encoder(**enc)
pooled = meanmax_pool(out.last_hidden_state, enc["attention_mask"])
return max(0.0, min(1.0, head(pooled).item()))
print(score("Attention is all you need.", "You do not need attention."))
# → 0.75-0.85
Benchmark Scores
Model compared against RewardBert (149M params) and Word F1 across 4 datasets:
| Dataset | Description | Ours | RewardBert | Word F1 |
|---|---|---|---|---|
| paperbd/paper_answers_reward | Paper instruction Q&A, n=1,780 | 0.68 | 0.44 | — |
| sft_v2 100-pt eval | Narrow quality eval, n=100 | 0.70 | 0.44 | 0.81 |
| Answer equivalence | Binary same-meaning, n=4,446 | 0.93 | 0.86 | 0.89 |
| Confound resistance | Factually wrong, high overlap (low is better) | 6% fooled | 53% | 78% |
Spearman correlation vs human judge. Confound = % of factually-wrong responses scored >0.7 (ROC-AUC for answer equivalence).
One-Word Swap Detection (150 real test-split references)
How well the model detects a single word change that flips meaning, while ignoring meaning-preserving swaps:
| Perturbation | Score Drop | Noticed |
|---|---|---|
| Antonym swap | +0.315 | 81% |
| Negation flip | +0.366 | 85% |
| Number swap | +0.315 | 76% |
| Random word (untrained) | +0.150 | 41% |
| Synonym swap (control) | -0.004 | 3% ✓ |
A swap is "noticed" when score drops >0.15 vs the exact-match score for the same reference. The model discriminates which word changed — synonym swaps cause zero drop while meaning-flips are strongly penalized.
Architecture
- Base:
sentence-transformers/all-MiniLM-L6-v2(22M params) - Head: Dropout(0.2) → Linear(768, 1) — no sigmoid, raw score in [0,1]
- Pooling: mean+max concatenation (384 + 384 = 768-dim, preserves both semantic average and single-token signals)
- Input:
"{reference} [SEP] {response}" - Training: ~38K records, last 5 encoder layers unfrozen (8.87M trainable params), MSE loss
Training Data
paperbd/paper_answers_reward — ~6K scored response pairs from research paper Q&A tasks, augmented with:
- Confound training: Factually wrong responses with high word overlap (scored 3.0 on 1-5 scale)
- Preference distillation: Worst-ranked responses from LLM judge comparisons (scored 1.0)
- Contrastive minimal-edit augmentation: One-word synonym/filler swaps (keeps meaning, score 4.5) vs antonym/negation/number swaps (flips meaning, score 3.0) — teaches the model to attend to which word changed, not just edit distance
- Cross-domain: 10K Feedback-Collection + 9K answer equivalence + 5.7K STS-B sentence similarity examples
Limitations
- Best on research/Q&A text, weaker on creative/conversational
- Requires a reference answer; not a standalone quality scorer
- Underscores systematically (mean ~0.37 vs target ~0.50) — ranking is good, calibration may need bias correction
- Some one-word factual swaps still go undetected (e.g., untrained vocabulary words noticed 41% of the time), though trained swap types (antonyms, negations, numbers) are caught 76-85% of the time
License
MIT
- Downloads last month
- 426
Model tree for paperbd/neuraltxt-reward-tiny
Base model
nreimers/MiniLM-L6-H384-uncased