VOOZH about

URL: https://huggingface.co/paperbd/neuraltxt-reward-tiny

⇱ paperbd/neuraltxt-reward-tiny · Hugging Face


NeuralTxt Reward Model

A small, fast reward model fine-tuned on research paper Q&A data. Scores responses on a 0-1 scale by comparing them against a reference answer. Note that this is an answer-equivalence model that is trained to match correctness and completeness of input answer. Do not use this for style-matching tasks.

Purpose

RL training for language models needs fast reward signals. External LLM-as-judge is accurate but slow and expensive. Lexical metrics (F1, ROUGE-L) are fast but fooled 78% of the time by factually wrong responses as long as lexical overlap is high. Traditional autoregressive reward models trained directly on preference datasets are large and eat up a lot of GPU.

This model balances speed and accuracy — 22M params, runs on CPU, 5% hallucination rate on hard confound examples, 97% accuracy on human preference data.. Note that unlike autoregressive reward models, neuraltxt-reward-22M needs a reference answer to generate rewards.

Quick Start

Install the neural-txt library via pip or uv

pip install neural-txt

And then use it thusly:

from neuraltxt import NeuralTxtReward

reward = NeuralTxtReward()
score = reward.score(
 response="Attention is all you need.",
 reference="All you need is attention.",
)

You can also do batch_scores and ranks:

reference = "Attention is all you need."
responses = [
 "All you need is attention.",
 "You do not need attention."
]

scores = reward.batch_score(responses, reference)
ranked = reward.rank(responses, reference)

Alternatively, you can just use this simple script.

import torch
import torch.nn as nn
from transformers import AutoTokenizer, AutoModel


def meanmax_pool(hidden, mask):
 mask_f = mask.unsqueeze(-1).float()
 mean = (hidden * mask_f).sum(1) / mask_f.sum(1).clamp(min=1e-9)
 mx = hidden.masked_fill(mask_f == 0, float("-inf")).max(1).values
 return torch.cat([mean, mx], dim=-1)


# Load encoder and tokenizer
encoder = AutoModel.from_pretrained("paperbd/neuraltxt-reward-tiny")
tokenizer = AutoTokenizer.from_pretrained("paperbd/neuraltxt-reward-tiny")

# Load trained head (~3KB, 768-dim input from mean+max concat)
head = nn.Sequential(nn.Dropout(0.2), nn.Linear(768, 1))
head.load_state_dict(torch.load("head_weights.pt", map_location="cpu",
 weights_only=True))


def score(reference, response):
 text = f"{reference} [SEP] {response}"
 enc = tokenizer(text, return_tensors="pt", truncation=True, max_length=512)
 with torch.no_grad():
 out = encoder(**enc)
 pooled = meanmax_pool(out.last_hidden_state, enc["attention_mask"])
 return max(0.0, min(1.0, head(pooled).item()))


print(score("Attention is all you need.", "You do not need attention."))
# → 0.75-0.85

Benchmark Scores

Model compared against RewardBert (149M params) and Word F1 across 4 datasets:

Dataset Description Ours RewardBert Word F1
paperbd/paper_answers_reward Paper instruction Q&A, n=1,780 0.68 0.44
sft_v2 100-pt eval Narrow quality eval, n=100 0.70 0.44 0.81
Answer equivalence Binary same-meaning, n=4,446 0.93 0.86 0.89
Confound resistance Factually wrong, high overlap (low is better) 6% fooled 53% 78%

Spearman correlation vs human judge. Confound = % of factually-wrong responses scored >0.7 (ROC-AUC for answer equivalence).

One-Word Swap Detection (150 real test-split references)

How well the model detects a single word change that flips meaning, while ignoring meaning-preserving swaps:

Perturbation Score Drop Noticed
Antonym swap +0.315 81%
Negation flip +0.366 85%
Number swap +0.315 76%
Random word (untrained) +0.150 41%
Synonym swap (control) -0.004 3%

A swap is "noticed" when score drops >0.15 vs the exact-match score for the same reference. The model discriminates which word changed — synonym swaps cause zero drop while meaning-flips are strongly penalized.

Architecture

  • Base: sentence-transformers/all-MiniLM-L6-v2 (22M params)
  • Head: Dropout(0.2) → Linear(768, 1) — no sigmoid, raw score in [0,1]
  • Pooling: mean+max concatenation (384 + 384 = 768-dim, preserves both semantic average and single-token signals)
  • Input: "{reference} [SEP] {response}"
  • Training: ~38K records, last 5 encoder layers unfrozen (8.87M trainable params), MSE loss

Training Data

paperbd/paper_answers_reward — ~6K scored response pairs from research paper Q&A tasks, augmented with:

  • Confound training: Factually wrong responses with high word overlap (scored 3.0 on 1-5 scale)
  • Preference distillation: Worst-ranked responses from LLM judge comparisons (scored 1.0)
  • Contrastive minimal-edit augmentation: One-word synonym/filler swaps (keeps meaning, score 4.5) vs antonym/negation/number swaps (flips meaning, score 3.0) — teaches the model to attend to which word changed, not just edit distance
  • Cross-domain: 10K Feedback-Collection + 9K answer equivalence + 5.7K STS-B sentence similarity examples

Limitations

  • Best on research/Q&A text, weaker on creative/conversational
  • Requires a reference answer; not a standalone quality scorer
  • Underscores systematically (mean ~0.37 vs target ~0.50) — ranking is good, calibration may need bias correction
  • Some one-word factual swaps still go undetected (e.g., untrained vocabulary words noticed 41% of the time), though trained swap types (antonyms, negations, numbers) are caught 76-85% of the time

License

MIT

Downloads last month
426
Safetensors
Model size
22.7M params
Tensor type
F32
·

Model tree for paperbd/neuraltxt-reward-tiny

Finetuned
(942)
this model
Finetunes
1 model

Datasets used to train paperbd/neuraltxt-reward-tiny