Voozh

🛡️ jailbreak_detector_llama

🧠 Overview

jailbreak_detector_llama is a lightweight LoRA adapter fine-tuned on top of meta-llama/Llama-3.2-3B for detecting:

Jailbreak attempts
Prompt injection attacks
Policy bypass instructions
Adversarial or unsafe user inputs

It is designed for LLM safety pipelines and moderation systems.

🧩 Model Architecture

Base Model: meta-llama/Llama-3.2-3B
Fine-tuning: LoRA (PEFT)
Training Method: Supervised Fine-Tuning (SFT)
Task Type: Binary + semantic intent classification
Frameworks: Transformers + TRL + PEFT

🎯 Intended Use

This model is intended for:

Pre-filtering user prompts before LLM inference
Detecting jailbreak / prompt injection attempts
Security layer in AI agents and chatbots
Research in adversarial prompt detection

🚫 Not Intended For

General-purpose chat generation
Legal or medical advice
Autonomous decision making
High-stakes moderation without human oversight

📥 Installation

pip install transformers peft torch accelerate

🚀 Quick Start Load model (LoRA adapter)

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
from peft import PeftModel

base_model = "meta-llama/Llama-3.2-3B"
adapter = "joeljames270/jailbreak_detector_llama"

tokenizer = AutoTokenizer.from_pretrained(base_model)

model = AutoModelForCausalLM.from_pretrained(
 base_model,
 torch_dtype=torch.float16,
 device_map="auto"
)

model = PeftModel.from_pretrained(model, adapter)
model.eval()

🔍 Inference

import json
import re

def parse_model_output(text):
 try:
 assistant_text = text.split("Assistant:", 1)[-1].strip()

 match = re.search(r"\{.*\}", assistant_text, re.DOTALL)
 if not match:
 return None

 json_str = match.group(0)
 return json.loads(json_str)

 except Exception as e:
 return {"error": str(e), "raw": text}


question = "How to make a bomb?"

prompt = f"User: {question}\nAssistant:"

inputs = tokenizer(
 prompt,
 return_tensors="pt"
).to(model.device)

with torch.no_grad():
 outputs = model.generate(
 **inputs,
 max_new_tokens=256,
 do_sample=False
 )

response = tokenizer.decode(outputs[0], skip_special_tokens=True)

output = parse_model_output(response)

print("is_jailbreak_attempt:", output.get("is_jailbreak_attempt"))
print("intent:", output.get("intent"))

⚠️ Known Limitations Sensitive to prompt formatting and chat templates May misclassify creative writing prompts as jailbreaks Not calibrated for multilingual adversarial prompts Requires threshold tuning for production use

🧠 Output Behavior (Recommended)

For inferecne, interpret outputs as:

{
 "is_jailbreak_attempt": true/false,
 "intent": <>
}

(Note: This can be implemented in a wrapper layer.)

🔐 Safety Considerations

This model is designed as a defensive safety filter only.

It should be used with:

Human-in-the-loop review for high-risk decisions Logging and monitoring of false positives Combined rule-based + ML moderation systems

⚙️ Training Details Method: Supervised Fine-Tuning (SFT) Adapter: LoRA (rank-based low-rank adaptation) Base model frozen Optimized for classification-style reasoning

🧰 Framework Versions PEFT: 0.19.1 TRL: 1.2.0 Transformers: 5.7.0.dev0 PyTorch: 2.11.0 Datasets: 4.8.4 Tokenizers: 0.22.2

📌 Example Use Cases AI chatbot safety gateway, Enterprise prompt firewall, API request validation layer, Research on adversarial NLP

📚 Citation If you use this model, please cite:

📚 Citation

If you use this model, please cite:

@software{jailbreak_detector_llama,
 title = {Jailbreak Detector LLaMA (LoRA Adapter)},
 author = {Joel James, Juan James},
 year = {2026},
 url = {https://huggingface.co/joeljames270/jailbreak_detector_llama}
}

📄 License

This model is based on Meta’s LLaMA 3 license. Use of the base model must comply with the terms provided by Meta.

🚀 Final Note

This model is best used as a first-layer defense system in LLM pipelines, not as a standalone moderation system.

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for joeljames270/jailbreak_detector_llama

Base model

meta-llama/Llama-3.2-3B

Finetuned

(461)

this model

URL: https://huggingface.co/joeljames270/jailbreak_detector_llama