AgentDoG: A Diagnostic Guardrail Framework for AI Agent Safety and Security

DDongrui LiuQQihan RenCChen QianSShuai ShaoYYuejin XieYYu LiZZhonghao YangHHaoyu LuoPPeng WangQQingyu LiuBBinxin HuLLing TangJJilin MeiDDadi GuoLLeitao YuanJJunyao YangGGuanxu ChenQQihao LinYYi YuBBo ZhangJJiaxuan GuoJJie ZhangWWenqi ShaoHHuiqi DengZZhiheng XiWWenjie WangWWenxuan WangWWen ShenZZhikai ChenHHaoyu XieJJialing TaoJJuntao DaiJJiaming JiZZhongjie BaLLinfeng ZhangYYong LiuQQuanshi ZhangLLei ZhuZZhihua WeiHHui XueCChaochao LuJJing ShaoXXia Hu

Published: January 26, 2026
Authors: 43

View on arXiv Download PDF

Abstract

The rise of AI agents introduces complex safety and security challenges arising from autonomous tool use and environmental interactions. Current guardrail models lack agentic risk awareness and transparency in risk diagnosis. To introduce an agentic guardrail that covers complex and numerous risky behaviors, we first propose a unified three-dimensional taxonomy that orthogonally categorizes agentic risks by their source (where), failure mode (how), and consequence (what). Guided by this structured and hierarchical taxonomy, we introduce a new fine-grained agentic safety benchmark (ATBench) and a Diagnostic Guardrail framework for agent safety and security (AgentDoG). AgentDoG provides fine-grained and contextual monitoring across agent trajectories. More Crucially, AgentDoG can diagnose the root causes of unsafe actions and seemingly safe but unreasonable actions, offering provenance and transparency beyond binary labels to facilitate effective agent alignment. AgentDoG variants are available in three sizes (4B, 7B, and 8B parameters) across Qwen and Llama model families. Extensive experimental results demonstrate that AgentDoG achieves state-of-the-art performance in agentic safety moderation in diverse and complex interactive scenarios. All models and datasets are openly released.

Keywords

agentic guardrailthree-dimensional taxonomyagentic safety benchmarkDiagnostic Guardrail frameworkagent safety and securityagent trajectoriesroot cause diagnosisfine-grained monitoringmodel variantsstate-of-the-art performance

More in AI Safety & Alignment

View all

The Devil Behind Moltbook: Anthropic Safety is Always Vanishing in Self-Evolving AI Societies

Chenxu Wang, Chaozhuo Li +11

The emergence of multi-agent systems built from large language models (LLMs) offers a promising paradigm for scalable collective intelligence and self-evolution. Ideally, such systems would achieve co...

Feb 10182

RubricBench: Aligning Model-Generated Rubrics with Human Standards

Qiyuan Zhang, Junyi Zhou +9

As Large Language Model (LLM) alignment evolves from simple completions to complex, highly sophisticated generation, Reward Models are increasingly shifting toward rubric-guided evaluation to mitigate...

Mar 244

THINKSAFE: Self-Generated Safety Alignment for Reasoning Models

Seanie Lee, Sangwoo Park +7

Large reasoning models (LRMs) achieve remarkable performance by leveraging reinforcement learning (RL) on reasoning tasks to generate long chain-of-thought (CoT) reasoning. However, this over-optimiza...

Jan 3028

Frontier AI Risk Management Framework in Practice: A Risk Analysis Technical Report v1.5

Dongrui Liu, Yi Yu +19

To understand and identify the unprecedented risks posed by rapidly advancing artificial intelligence (AI) models, Frontier AI Risk Management Framework in Practice presents a comprehensive assessment...

Feb 1626

Accurate Failure Prediction in Agents Does Not Imply Effective Failure Prevention

Rakshith Vasudev, Melisa Russak +2

Proactive interventions by LLM critic models are often assumed to improve reliability, yet their effects at deployment time are poorly understood. We show that a binary LLM critic with strong offline ...

Feb 325

More AI Safety & Alignment papers