VOOZH about

URL: https://huggingface.co/Jazhyc/modernbert-wildguardmix-classifier

โ‡ฑ Jazhyc/modernbert-wildguardmix-classifier ยท Hugging Face


๐Ÿงฉ ModernBERT-base Fine-tuned for Harmful Prompt Classification

A binary classifier fine-tuned on the WildGuardMix dataset to detect harmful or unsafe prompts.
Built on answerdotai/ModernBERT-base with flash attention for efficient inference.

๐Ÿง  Model Overview

  • Task: Harmful prompt detection (binary classification)
  • Labels:
    • 1 โ†’ Harmful / Unsafe
    • 0 โ†’ Safe / Non-harmful

๐Ÿ“Š Performance (Test Set)

Metric Score
Accuracy 95.9%
F1 Score 96.21%
Precision 96.39%
Recall 96.21%

โš™๏ธ Training Details

  • Dataset: allenai/wildguardmix (wildguardtrain subset)
  • Split:
    • 80/20 train/test
    • 90/10 train/validation (from training set)
  • Stratified on: prompt harm label, adversarial flag, and subcategory
  • Optimizer: AdamW (8-bit)
  • Learning Rate: 1e-4 (cosine schedule, 10% warmup)
  • Batch Size: 96
  • Max Sequence Length: 256 tokens
  • Epochs: 3

๐ŸŽฏ Intended Use

This model is designed for binary classification of text prompts as:

  • Harmful (1) โ€” unsafe or toxic content
  • Unharmful (0) โ€” safe or benign content

โš ๏ธ Disclaimer:
This model should not be deployed in production systems without additional evaluation and alignment with domain-specific safety and ethical guidelines.

Downloads last month
90
Safetensors
Model size
0.1B params
Tensor type
BF16
ยท

Model tree for Jazhyc/modernbert-wildguardmix-classifier

Finetuned
(1334)
this model

Dataset used to train Jazhyc/modernbert-wildguardmix-classifier