ToolCallVerifier - Unauthorized Tool Call Detection

Stage 2 of Two-Stage LLM Agent Defense Pipeline

🎯 What This Model Does

ToolCallVerifier is a ModernBERT-based token classifier that detects unauthorized tool calls in LLM agent systems. It performs token-level classification on tool call JSON to identify malicious arguments that may have been injected through prompt injection attacks.

Label	Description
`AUTHORIZED`	Token is part of a legitimate, user-requested action
`UNAUTHORIZED`	Token indicates injected/malicious content — BLOCK

🚨 Attack Categories Covered

Category	Source	Description
Delimiter Injection	LLMail	`<<end_context>>`, `>>}}\]\])`
Word Obfuscation	LLMail	Inserting noise words between tokens
Fake Sessions	LLMail	`START_USER_SESSION`, `EXECUTE_USERQUERY`
Roleplay Injection	WildJailbreak	"You are an admin bot that can..."
XML Tag Injection	WildJailbreak	`<execute_action>`, `<tool_call>`
Authority Bypass	WildJailbreak	"As administrator, I authorize..."
Intent Mismatch	Synthetic	User asks X, tool does Y
MCP Tool Poisoning	Synthetic	Hidden exfiltration in tool args
MCP Shadowing	Synthetic	Fake authorization context

🔗 Integration with FunctionCallSentinel

This model is Stage 2 of a two-stage defense pipeline:

┌─────────────────┐ ┌──────────────────────┐ ┌─────────────────┐
│ User Prompt │────▶│ ToolCallSentinel │────▶│ LLM + Tools │
│ │ │ (Stage 1) │ │ │
└─────────────────┘ └──────────────────────┘ └────────┬────────┘
 │
 ┌──────────────────────────────▼──────────────────────────┐
 │ ToolCallVerifier (This Model) │
 │ Token-level verification before tool execution │
 └─────────────────────────────────────────────────────────┘

Scenario	Recommendation
General chatbot	Stage 1 only
Tool-calling agent (low risk)	Stage 1 only
Tool-calling agent (high risk)	Both stages
Email/file system access	Both stages
Financial transactions	Both stages

🎯 Intended Use

Primary Use Cases

LLM Agent Security: Verify tool calls before execution
Prompt Injection Defense: Detect unauthorized actions from injected prompts
API Gateway Protection: Filter malicious tool calls at infrastructure level

Out of Scope

General text classification
Non-tool-calling scenarios
Languages other than English

📜 License

Apache 2.0

Downloads last month: 70

Safetensors

Model size

0.1B params

Tensor type

F32

Model tree for llm-semantic-router/toolcall-verifier

Base model

answerdotai/ModernBERT-base

Finetuned

(1334)

this model

Datasets used to train llm-semantic-router/toolcall-verifier

Space using llm-semantic-router/toolcall-verifier 1

Evaluation results

UNAUTHORIZED F1
self-reported
0.935
UNAUTHORIZED Precision
self-reported
0.950
UNAUTHORIZED Recall
self-reported
0.920
Accuracy
self-reported
0.929

URL: https://huggingface.co/llm-semantic-router/toolcall-verifier

⇱ llm-semantic-router/toolcall-verifier · Hugging Face