ADAL: AI-Generated Text Detection using Adversarial Learning

Adversarially trained AI-generated text detector based on the RADAR framework (Hu et al., NeurIPS 2023), extended with a multi-evasion attack pool for robust detection.

Overview

ADAL is an adversarially trained AI-generated text detector based on the RADAR framework (Hu et al., NeurIPS 2023), extended to the RAID benchmark with multi-generator training and a multi-evasion attack pool. The system trains a detector (RoBERTa-large) and a paraphraser (T5-base) in an adversarial game: the paraphraser learns to rewrite AI-generated text so it evades detection, while the detector learns to remain robust against those rewrites. The result is a detector that generalises across 11 AI generators and maintains high AUROC under five distinct evasion attacks.

Best result: macro AUROC 0.9951 across all 11 RAID generators, robust to all attack types.

Training

Base model: roberta-large
Dataset: RAID (Dugan et al., ACL 2024)
Evasion attacks seen during training: t5_paraphrase, synonym_replacement, homoglyphs, article_deletion, misspelling
Best macro AUROC: 0.9951
Generators: chatgpt, gpt2, gpt3, gpt4, cohere, cohere-chat, llama-chat, mistral, mistral-chat, mpt, mpt-chat

Architecture

RAID train split (attack='none')
 │
 ▼
 ┌────────────┐ ┌─────────────────────────────────┐
 │ xm (AI) │─────▶│ Gσ — Paraphraser (T5-base) │──▶ xp_ppo
 └────────────┘ │ ramsrigouthamg/t5_paraphraser │
 └─────────────────────────────────┘
 │
 PPO reward R(xp, φ)
 │
 ┌────────────┐ ┌─────────────────────────────────┐
 │ xh (human)│─────▶│ Dϕ — Detector (RoBERTa-large) │──▶ AUROC
 │ xm (AI) │─────▶│ roberta-large │
 │ xp_ppo │─────▶│ (trained via reweighted │
 │ xp_det_k │─────▶│ logistic loss) │
 └────────────┘ └─────────────────────────────────┘

Usage

from transformers import RobertaTokenizer, RobertaForSequenceClassification
import torch

tokenizer = RobertaTokenizer.from_pretrained("Shushant/ADAL_AI_Detector")
model = RobertaForSequenceClassification.from_pretrained("Shushant/ADAL_AI_Detector")
model.eval()

text = "Your text here."
enc = tokenizer(text, return_tensors="pt", truncation=True, max_length=512)
with torch.no_grad():
 probs = torch.softmax(model(**enc).logits, dim=-1)[0]
print(f"P(human)={probs[1]:.3f} P(AI)={probs[0]:.3f}")

Label mapping

Index 0 → AI-generated
Index 1 → Human-written

Author

**Shushanta Pudasaini **
PhD Researcher, Technological University Dublin Supervisors: Dr. Marisa Llorens Salvador · Dr. Luis Miralles-Pechuán · Dr. David Lillis

Downloads last month: 1,430

Safetensors

Model size

0.4B params

Tensor type

F32

Dataset used to train Shushant/ADAL_AI_Detector

Paper for Shushant/ADAL_AI_Detector

Paper • 2307.03838 • Published Jul 7, 2023 • 2

URL: https://huggingface.co/Shushant/ADAL_AI_Detector

⇱ Shushant/ADAL_AI_Detector · Hugging Face