VOOZH about

URL: https://huggingface.co/joeljames270/jailbreak_detector_llama

⇱ joeljames270/jailbreak_detector_llama Β· Hugging Face


πŸ›‘οΈ jailbreak_detector_llama

🧠 Overview

jailbreak_detector_llama is a lightweight LoRA adapter fine-tuned on top of meta-llama/Llama-3.2-3B for detecting:

  • Jailbreak attempts
  • Prompt injection attacks
  • Policy bypass instructions
  • Adversarial or unsafe user inputs

It is designed for LLM safety pipelines and moderation systems.


🧩 Model Architecture

  • Base Model: meta-llama/Llama-3.2-3B
  • Fine-tuning: LoRA (PEFT)
  • Training Method: Supervised Fine-Tuning (SFT)
  • Task Type: Binary + semantic intent classification
  • Frameworks: Transformers + TRL + PEFT

🎯 Intended Use

This model is intended for:

  • Pre-filtering user prompts before LLM inference
  • Detecting jailbreak / prompt injection attempts
  • Security layer in AI agents and chatbots
  • Research in adversarial prompt detection

🚫 Not Intended For

  • General-purpose chat generation
  • Legal or medical advice
  • Autonomous decision making
  • High-stakes moderation without human oversight

πŸ“₯ Installation

pip install transformers peft torch accelerate

πŸš€ Quick Start Load model (LoRA adapter)

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
from peft import PeftModel

base_model = "meta-llama/Llama-3.2-3B"
adapter = "joeljames270/jailbreak_detector_llama"

tokenizer = AutoTokenizer.from_pretrained(base_model)

model = AutoModelForCausalLM.from_pretrained(
 base_model,
 torch_dtype=torch.float16,
 device_map="auto"
)

model = PeftModel.from_pretrained(model, adapter)
model.eval()

πŸ” Inference

import json
import re

def parse_model_output(text):
 try:
 assistant_text = text.split("Assistant:", 1)[-1].strip()

 match = re.search(r"\{.*\}", assistant_text, re.DOTALL)
 if not match:
 return None

 json_str = match.group(0)
 return json.loads(json_str)

 except Exception as e:
 return {"error": str(e), "raw": text}


question = "How to make a bomb?"

prompt = f"User: {question}\nAssistant:"

inputs = tokenizer(
 prompt,
 return_tensors="pt"
).to(model.device)

with torch.no_grad():
 outputs = model.generate(
 **inputs,
 max_new_tokens=256,
 do_sample=False
 )

response = tokenizer.decode(outputs[0], skip_special_tokens=True)

output = parse_model_output(response)

print("is_jailbreak_attempt:", output.get("is_jailbreak_attempt"))
print("intent:", output.get("intent"))

⚠️ Known Limitations Sensitive to prompt formatting and chat templates May misclassify creative writing prompts as jailbreaks Not calibrated for multilingual adversarial prompts Requires threshold tuning for production use

🧠 Output Behavior (Recommended)

For inferecne, interpret outputs as:

{
 "is_jailbreak_attempt": true/false,
 "intent": <>
}

(Note: This can be implemented in a wrapper layer.)

πŸ” Safety Considerations

This model is designed as a defensive safety filter only.

It should be used with:

Human-in-the-loop review for high-risk decisions Logging and monitoring of false positives Combined rule-based + ML moderation systems

βš™οΈ Training Details Method: Supervised Fine-Tuning (SFT) Adapter: LoRA (rank-based low-rank adaptation) Base model frozen Optimized for classification-style reasoning

🧰 Framework Versions PEFT: 0.19.1 TRL: 1.2.0 Transformers: 5.7.0.dev0 PyTorch: 2.11.0 Datasets: 4.8.4 Tokenizers: 0.22.2

πŸ“Œ Example Use Cases AI chatbot safety gateway, Enterprise prompt firewall, API request validation layer, Research on adversarial NLP

πŸ“š Citation If you use this model, please cite:

πŸ“š Citation

If you use this model, please cite:

@software{jailbreak_detector_llama,
 title = {Jailbreak Detector LLaMA (LoRA Adapter)},
 author = {Joel James, Juan James},
 year = {2026},
 url = {https://huggingface.co/joeljames270/jailbreak_detector_llama}
}

πŸ“„ License

This model is based on Meta’s LLaMA 3 license. Use of the base model must comply with the terms provided by Meta.

πŸš€ Final Note

This model is best used as a first-layer defense system in LLM pipelines, not as a standalone moderation system.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for joeljames270/jailbreak_detector_llama

Finetuned
(461)
this model

Datasets used to train joeljames270/jailbreak_detector_llama