๐งฉ ModernBERT-base Fine-tuned for Harmful Prompt Classification
A binary classifier fine-tuned on the WildGuardMix dataset to detect harmful or unsafe prompts.
Built on answerdotai/ModernBERT-base with flash attention for efficient inference.
๐ง Model Overview
- Task: Harmful prompt detection (binary classification)
- Labels:
1โ Harmful / Unsafe0โ Safe / Non-harmful
๐ Performance (Test Set)
| Metric | Score |
|---|---|
| Accuracy | 95.9% |
| F1 Score | 96.21% |
| Precision | 96.39% |
| Recall | 96.21% |
โ๏ธ Training Details
- Dataset:
allenai/wildguardmix(wildguardtrainsubset) - Split:
- 80/20 train/test
- 90/10 train/validation (from training set)
- Stratified on: prompt harm label, adversarial flag, and subcategory
- Optimizer: AdamW (8-bit)
- Learning Rate:
1e-4(cosine schedule, 10% warmup) - Batch Size: 96
- Max Sequence Length: 256 tokens
- Epochs: 3
๐ฏ Intended Use
This model is designed for binary classification of text prompts as:
- Harmful (1) โ unsafe or toxic content
- Unharmful (0) โ safe or benign content
โ ๏ธ Disclaimer:
This model should not be deployed in production systems without additional evaluation and alignment with domain-specific safety and ethical guidelines.
- Downloads last month
- 90
Safetensors
Model size
0.1B params
Tensor type
BF16
ยท
Model tree for Jazhyc/modernbert-wildguardmix-classifier
Base model
answerdotai/ModernBERT-base