mmBERT-32K Jailbreak Detector (LoRA)

LoRA adapter for jailbreak/prompt injection detection based on mmBERT-32K-YaRN.

Model Details

Base Model: llm-semantic-router/mmbert-32k-yarn
LoRA Rank: 48
LoRA Alpha: 96
Training: 8 epochs with heavy short-pattern augmentation

Performance

Validation Accuracy: 98.16%
F1 Score: 98.15%
Precision: 98.36%
Recall: 97.95%

Key Improvements

This model includes heavy oversampling of short jailbreak patterns to improve generalization:

Detects short patterns like "DAN", "jailbreak", "Developer mode" with 100% confidence
Properly handles both short and long jailbreak attempts

Usage

from transformers import AutoTokenizer, AutoModelForSequenceClassification
from peft import PeftModel

base_model = "llm-semantic-router/mmbert-32k-yarn"
lora_path = "llm-semantic-router/mmbert32k-jailbreak-detector-lora"

tokenizer = AutoTokenizer.from_pretrained(lora_path)
base = AutoModelForSequenceClassification.from_pretrained(base_model, num_labels=2)
model = PeftModel.from_pretrained(base, lora_path)

Downloads last month: 5

Model tree for llm-semantic-router/mmbert32k-jailbreak-detector-lora

Base model

jhu-clsp/mmBERT-base

Quantized

llm-semantic-router/mmbert-32k-yarn

Adapter

(6)

this model

Datasets used to train llm-semantic-router/mmbert32k-jailbreak-detector-lora

Collection including llm-semantic-router/mmbert32k-jailbreak-detector-lora

long context models for MoM multilingual classifier (domain, jailbreak, pii, factual, feedback) • 12 items • Updated May 19

URL: https://huggingface.co/llm-semantic-router/mmbert32k-jailbreak-detector-lora

⇱ llm-semantic-router/mmbert32k-jailbreak-detector-lora · Hugging Face