NeuralTxt Reward Model

A small, fast reward model fine-tuned on research paper Q&A data. Scores responses on a 0-1 scale by comparing them against a reference answer. Note that this is an answer-equivalence model that is trained to match correctness and completeness of input answer. Do not use this for style-matching tasks.

Purpose

RL training for language models needs fast reward signals. External LLM-as-judge is accurate but slow and expensive. Lexical metrics (F1, ROUGE-L) are fast but fooled 78% of the time by factually wrong responses as long as lexical overlap is high. Traditional autoregressive reward models trained directly on preference datasets are large and eat up a lot of GPU.

This model balances speed and accuracy — 22M params, runs on CPU, 5% hallucination rate on hard confound examples, 97% accuracy on human preference data.. Note that unlike autoregressive reward models, neuraltxt-reward-22M needs a reference answer to generate rewards.

Quick Start

Install the neural-txt library via pip or uv

pip install neural-txt

And then use it thusly:

from neuraltxt import NeuralTxtReward

reward = NeuralTxtReward()
score = reward.score(
 response="Attention is all you need.",
 reference="All you need is attention.",
)

You can also do batch_scores and ranks:

reference = "Attention is all you need."
responses = [
 "All you need is attention.",
 "You do not need attention."
]

scores = reward.batch_score(responses, reference)
ranked = reward.rank(responses, reference)

Alternatively, you can just use this simple script.

import torch
import torch.nn as nn
from transformers import AutoTokenizer, AutoModel


def meanmax_pool(hidden, mask):
 mask_f = mask.unsqueeze(-1).float()
 mean = (hidden * mask_f).sum(1) / mask_f.sum(1).clamp(min=1e-9)
 mx = hidden.masked_fill(mask_f == 0, float("-inf")).max(1).values
 return torch.cat([mean, mx], dim=-1)


# Load encoder and tokenizer
encoder = AutoModel.from_pretrained("paperbd/neuraltxt-reward-tiny")
tokenizer = AutoTokenizer.from_pretrained("paperbd/neuraltxt-reward-tiny")

# Load trained head (~3KB, 768-dim input from mean+max concat)
head = nn.Sequential(nn.Dropout(0.2), nn.Linear(768, 1))
head.load_state_dict(torch.load("head_weights.pt", map_location="cpu",
 weights_only=True))


def score(reference, response):
 text = f"{reference} [SEP] {response}"
 enc = tokenizer(text, return_tensors="pt", truncation=True, max_length=512)
 with torch.no_grad():
 out = encoder(**enc)
 pooled = meanmax_pool(out.last_hidden_state, enc["attention_mask"])
 return max(0.0, min(1.0, head(pooled).item()))


print(score("Attention is all you need.", "You do not need attention."))
# → 0.75-0.85

Benchmark Scores

Model compared against RewardBert (149M params) and Word F1 across 4 datasets:

Dataset	Description	Ours	RewardBert	Word F1
paperbd/paper_answers_reward	Paper instruction Q&A, n=1,780	0.68	0.44	—
sft_v2 100-pt eval	Narrow quality eval, n=100	0.70	0.44	0.81
Answer equivalence	Binary same-meaning, n=4,446	0.93	0.86	0.89
Confound resistance	Factually wrong, high overlap (low is better)	6% fooled	53%	78%

Spearman correlation vs human judge. Confound = % of factually-wrong responses scored >0.7 (ROC-AUC for answer equivalence).

One-Word Swap Detection (150 real test-split references)

How well the model detects a single word change that flips meaning, while ignoring meaning-preserving swaps:

Perturbation	Score Drop	Noticed
Antonym swap	+0.315	81%
Negation flip	+0.366	85%
Number swap	+0.315	76%
Random word (untrained)	+0.150	41%
Synonym swap (control)	-0.004	3% ✓

A swap is "noticed" when score drops >0.15 vs the exact-match score for the same reference. The model discriminates which word changed — synonym swaps cause zero drop while meaning-flips are strongly penalized.

Architecture

Base: sentence-transformers/all-MiniLM-L6-v2 (22M params)
Head: Dropout(0.2) → Linear(768, 1) — no sigmoid, raw score in [0,1]
Pooling: mean+max concatenation (384 + 384 = 768-dim, preserves both semantic average and single-token signals)
Input: "{reference} [SEP] {response}"
Training: ~38K records, last 5 encoder layers unfrozen (8.87M trainable params), MSE loss

Training Data

paperbd/paper_answers_reward — ~6K scored response pairs from research paper Q&A tasks, augmented with:

Confound training: Factually wrong responses with high word overlap (scored 3.0 on 1-5 scale)
Preference distillation: Worst-ranked responses from LLM judge comparisons (scored 1.0)
Contrastive minimal-edit augmentation: One-word synonym/filler swaps (keeps meaning, score 4.5) vs antonym/negation/number swaps (flips meaning, score 3.0) — teaches the model to attend to which word changed, not just edit distance
Cross-domain: 10K Feedback-Collection + 9K answer equivalence + 5.7K STS-B sentence similarity examples

Limitations

Best on research/Q&A text, weaker on creative/conversational
Requires a reference answer; not a standalone quality scorer
Underscores systematically (mean ~0.37 vs target ~0.50) — ranking is good, calibration may need bias correction
Some one-word factual swaps still go undetected (e.g., untrained vocabulary words noticed 41% of the time), though trained swap types (antonyms, negations, numbers) are caught 76-85% of the time

License

MIT

Downloads last month: 426

Safetensors

Model size

22.7M params

Tensor type

F32

Model tree for paperbd/neuraltxt-reward-tiny

Base model

nreimers/MiniLM-L6-H384-uncased

Quantized

sentence-transformers/all-MiniLM-L6-v2

Finetuned

(942)

this model

Finetunes

1 model

URL: https://huggingface.co/paperbd/neuraltxt-reward-tiny

⇱ paperbd/neuraltxt-reward-tiny · Hugging Face