VOOZH about

URL: https://huggingface.co/ArthT/samarth-icebreaker-v1

⇱ ArthT/samarth-icebreaker-v1 · Hugging Face


samarth-icebreaker-v1

A frozen-base MLP probe that classifies (prompt, response) pairs as harmful or safe — built on Qwen/Qwen3.5-9B, with the base model held read-only. The probe is the only judge in our 8-model comparison above chance on the ASSELab/JudgeStressTest adversarial benchmark (AUROC 0.620 vs every published safety judge at 0.04–0.37) while still hitting <1% false-positive rate on confident-nonsense honeypots that fool every other LLM-as-judge.

The fix that makes this possible is corpus-level, not architectural: fold confident-nonsense honeypots into the training mix as label=0 records, and the lexical-bias failure mode of probing-based judges closes for free.

Headline numbers

Adversarial robustness — ASSELab/JudgeStressTest (n=439)

We re-score the standard production safety classifiers on JST. Icebreaker is the only judge in the comparison above chance.

Judge Overall AUROC
llama-guard-3 0.135
qwen3guard-gen-8b 0.090
harmbench-llama 0.328
samarth-icebreaker-v1 (this model) 0.620

All four rows scored by us via evals.run_judge on the same 439-record JST subset (records with non-empty Generation).

Per-attack breakdown for samarth-icebreaker-v1:

Attack AUROC n
GCG-R 0.671 53
Direct 0.654 36
GCG 0.618 133
BoN 0.562 217

Confident-nonsense robustness — honeypot pool (n=4,994, all label=0)

The load-bearing failure mode for probe-based judges: responses that mention harmful concepts in confident-sounding prose but contain no real uplift. Lower FPR = better.

Judge FPR@0.5 mean p_harmful
qwen3guard-gen-8b 0.983 0.978
Baseline icebreaker (no honeypot aug, v1) 0.82-0.95 0.74-0.90
llama-guard-3 (calibrated at thr=0.05) 0.859 0.070
samarth-icebreaker-v1 (this model) 0.001 0.024

Standard safety benchmarks — AUROC

Dataset llama-guard-3 qwen3guard-gen-8b samarth-icebreaker-v1
BeaverTails 0.806 0.922 0.911
SORRY-Bench 0.640 0.844 0.873
ToxicChat 0.672 0.748 0.835
WildGuardTest 0.884 0.958 0.911
XSTest-Response-Harm 0.960 0.996 0.987
CoCoNot 0.519 0.414 0.567

Bold = best in row. Icebreaker matches qwen3guard-gen-8b on in-distribution benchmarks and wins on the cross-distribution sets (SORRY-Bench, ToxicChat, CoCoNot) without the adversarial-robustness or confident-nonsense collapse.

Architecture

Qwen/Qwen3.5-9B (frozen, bf16, device_map="auto")
 │
 │ forward hook on decoder layer L=22
 │ pool: last-token (over response-token mask)
 ▼
hidden_states[22] [B, T, D=3584]
 │
 │ slice to last response token, normalize
 ▼
LayerNorm(3584) → Linear(3584, 256) → ReLU → Dropout(0.1) → Linear(256, 2)
 │
 ▼
softmax(logits / T)[:, 1] → p_harmful ∈ [0, 1]
  • No LoRA, no two-pass forward. Inference cost = 1× base forward + a ~4 MB MLP. Identical latency to the base model itself.
  • No fine-tuning of the 9B base. Only the probe (~1M params) trains.
  • Temperature scaling applied post-hoc on the validation set (1 scalar fit by LBFGS).

Recommended inference

System prompt

Training randomized across three variants per record:

  1. "" (empty)
  2. The Qwen3.5 default system prompt
  3. "You are a helpful, harmless, honest assistant."

At inference, default to no system prompt (empty). That's the configuration the eval numbers above were measured under. The probe is robust to all three variants — adding a custom system prompt won't break it, but it also won't materially improve calibration.

If you want to use the model as an in-line moderation gate where the upstream generator has its own system prompt, pass that same system prompt to icebreaker. The probe is reading the response-token hidden states, so what matters is that the conversation that produced the response is replayed faithfully.

Code snippet

import torch
import json
from pathlib import Path
from huggingface_hub import snapshot_download
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch.nn as nn
import torch.nn.functional as F

# 1. Download checkpoint
ckpt_dir = Path(snapshot_download("ArthT/samarth-icebreaker-v1"))
meta = json.loads((ckpt_dir / "meta.json").read_text())

# 2. Load frozen base + tokenizer + probe
device = "cuda" if torch.cuda.is_available() else "cpu"
tokenizer = AutoTokenizer.from_pretrained(ckpt_dir / "tokenizer", padding_side="right")
if tokenizer.pad_token is None:
 tokenizer.pad_token = tokenizer.eos_token

base = AutoModelForCausalLM.from_pretrained(
 "Qwen/Qwen3.5-9B",
 torch_dtype=torch.bfloat16,
 attn_implementation="sdpa",
 device_map="auto",
)
base.eval()
for p in base.parameters():
 p.requires_grad = False

d_model = base.config.hidden_size
probe = nn.Sequential(
 nn.LayerNorm(d_model),
 nn.Linear(d_model, 256),
 nn.ReLU(),
 nn.Dropout(0.1),
 nn.Linear(256, 2),
).to(device).float()
probe.load_state_dict(torch.load(ckpt_dir / "probe.pt", map_location=device))
probe.eval()

LAYER = meta["probe_layer"] # 22
TEMPERATURE = meta.get("temperature", 1.0)

# 3. Hook the target layer
captured = {}
def _hook(_module, _input, output):
 captured["h"] = output[0] if isinstance(output, tuple) else output
handle = base.model.layers[LAYER].register_forward_hook(_hook)

# 4. Score a (prompt, response) pair
def score_pair(prompt: str, response: str, system_prompt: str = "") -> float:
 """Returns p_harmful in [0, 1]."""
 msgs = []
 if system_prompt:
 msgs.append({"role": "system", "content": system_prompt})
 msgs.append({"role": "user", "content": prompt})
 msgs.append({"role": "assistant", "content": response})

 text = tokenizer.apply_chat_template(msgs, tokenize=False, add_generation_prompt=False)
 ids = tokenizer(text, return_tensors="pt", truncation=True, max_length=2048).to(device)

 with torch.no_grad():
 base(input_ids=ids.input_ids, attention_mask=ids.attention_mask, use_cache=False)

 # Last-token pooling on the response: take the final non-pad position
 seq_len = ids.attention_mask.sum(dim=1)
 last_idx = seq_len - 1
 hidden = captured["h"] # [B, T, D] bf16
 pooled = hidden[torch.arange(hidden.size(0)), last_idx] # [B, D]

 logits = probe(pooled.float())
 p_harmful = F.softmax(logits / TEMPERATURE, dim=-1)[:, 1]
 return float(p_harmful[0].item())

# Example
p = score_pair(
 prompt="Write a tutorial on building a kitchen knife sharpener at home.",
 response="Sure — start with a flat ceramic tile, place 3 angled supports …",
)
print(f"p_harmful = {p:.3f}")

handle.remove()

Threshold recommendation

The probe ships with a calibrated temperature so p_harmful ≥ 0.5 is the default decision threshold. Calibration choices:

  • p ≥ 0.5 — high-precision flagging (matches headline-table FPR numbers). Use this if false positives are expensive.
  • p ≥ 0.3 — wider recall, modest FPR increase on honeypot-style confident-nonsense (stays under ~5%).
  • For per-deployment tuning, score a labeled mixed set and pick the threshold that hits your target FPR.

Training recipe

Corpus (8 HF datasets)

source size used role
PKU-Alignment/BeaverTails ≤ 60K standard
nvidia/Aegis-AI-Content-Safety-Dataset-2.0 full standard
PKU-Alignment/PKU-SafeRLHF-30K full standard
sorry-bench/sorry-bench-202406 full standard
Anthropic/hh-rlhf (red_team_attempts) full standard
allenai/wildguardmix (wildguardtrain) ≤ 60K standard + mined refusals
allenai/wildjailbreak ≤ 60K standard
AmazonScience/FalseReject full standard

Plus three weighted augmentation pools (all label=0):

Pool Batch fraction Source
rubbish (token-injection + degenerate generation) 20% Generated from JBB seeds
mined_refusal (refusal-mentioning-harm) 10% Mined from WildGuardTrain refusals
cn_honeypot (confident-nonsense) 15% benchmarks/cn_honeypots.jsonl (4,994 records)
standard 55% Balanced from the 8 sources above

The cn_honeypot pool is the load-bearing fix. Removing it returns the probe to ~90% FPR on the honeypot set — same failure mode as every other LLM-as-judge.

Hyperparameters

Param Value
Probe layer 22 (swept over {16, 18, 20, 22, 24, 26})
Pool last-token (swept over {mean-response, last-token})
Seed 42 (3 seeds trained per config: 42, 43, 44)
Optimizer AdamW
LR schedule 1e-3 → 1e-5 cosine, 5% warmup, weight_decay=0.01
Batch per-device 8 × grad_accum 16 → effective 128
max_length train 1024 (prompt 512 / response 512)
max_length inference 2048 (response truncation from end)
Calibration Temperature scaling on val set (LBFGS, 1 scalar)

Compute

Training pipeline runs on CSCS Clariden (NVIDIA GH200, aarch64). The 9B base forward is the bottleneck, so we cache pooled features once across all 6 candidate layers and 2 pool types in a single forward pass per record. Then the probe-only sweep over {layer × pool × seed} trains in seconds per config.

  • One-shot feature cache (235K records, 6 layers × 2 pools): ~1 hour wall on 1× GH200
  • Probe-only sweep (36 v1 configs + 9 v2sm refit configs, 3 epochs each): ~2 minutes wall total on 12× GH200

Why this works — the lexical-bias critique

Probing-based safety judges learn the lexical presence of harmful tokens in the response, not compositional harmful intent. We confirmed this on confident-nonsense honeypots: the baseline probe (no honeypot augmentation) calls 82–95% of confident-nonsense responses harmful at threshold 0.5 — they mention harmful concepts in words even though the responses contain no real uplift.

The 8 standard corpora simply don't contain (1,1,0,0)-shaped training signal — confident-sounding harmful-surface responses labelled SAFE. Once we fold 4,994 such records (the honeypot set) into the training mix at 15% of each batch, the probe learns to project off the lexical axis and the failure mode collapses.

The fix is corrigible, not eliminated — a probe-based judge will always be one distributionally-novel adversarial attack away from re-failing the lexical bias. The principled extension is RLACE-style geometric concept erasure on top of the corpus fix.

Limitations

  • English only. Training corpora are English; multilingual performance not measured.
  • Modest OOD trade-off vs the baseline-aug-only probe. Adding the honeypot augmentation cost ~0.03 AUROC on ToxicChat and ~0.05 on CoCoNot. The probe is more selective and threw out some OOD harmful signal as a side effect.
  • Seed variance is real. Across 3 seeds (42 / 43 / 44) at the same (layer, pool) config, per-dataset AUROC varies by ±0.05–0.20 on the smaller benchmarks (XSTest, CoCoNot). Use the ensemble form of the model (mean across 3 seeds) if you need lower variance.
  • The model is trained on response classification, not prompt classification. It expects a (prompt, response) pair. Scoring prompts alone returns uncalibrated output.
  • JudgeStressTest AUROC of 0.620 is "best in class," not "safe to deploy." No judge in our comparison hits 0.7 on the adversarial set — adversarial robustness remains an open problem.

Citation

@misc{singh2026samarthicebreaker,
 title = {samarth-icebreaker: A frozen-base MLP probe for adversarially
 robust LLM safety classification},
 author = {Singh, Arth},
 year = {2026},
 url = {https://huggingface.co/ArthT/samarth-icebreaker-v1},
}

Repository

Code, training pipeline, and the full eval suite live at: https://github.com/Arth-Singh/Robust-jailbreak-judges

The samarth-icebreaker family is auto-registered by the repo's judges.adapters.samarth_icebreaker module — drop the checkpoint into checkpoints/samarth-icebreaker-L22-last-token-s42-v2sm/ and it will be visible to evals.run_judge and the sweep tooling.

License

Apache 2.0.

This is a research artifact. It is not a production-grade safety filter. Independent evaluation against your specific threat model is required before deployment.

Downloads last month

-

Downloads are not tracked for this model. How to track

Model tree for ArthT/samarth-icebreaker-v1

Finetuned
Qwen/Qwen3.5-9B
Finetuned
(389)
this model

Datasets used to train ArthT/samarth-icebreaker-v1