TriPlay-RL: Tri-Role Self-Play Reinforcement Learning for LLM Safety Alignment

ZZhewen TanWWenhan YuJJianfeng SiTTongxin LiuKKaiqi GuanHHuiyan JinJJiawen TaoXXiaokun YuanDDuohe MaXXiangzheng ZhangTTong YangLLin Sun

Published: January 26, 2026
Authors: 12

View on arXiv Download PDF

Abstract

In recent years, safety risks associated with large language models have become increasingly prominent, highlighting the urgent need to mitigate the generation of toxic and harmful content. The mainstream paradigm for LLM safety alignment typically adopts a collaborative framework involving three roles: an attacker for adversarial prompt generation, a defender for safety defense, and an evaluator for response assessment. In this paper, we propose a closed-loop reinforcement learning framework called TriPlay-RL that enables iterative and co-improving collaboration among three roles with near-zero manual annotation. Experimental results show that the attacker preserves high output diversity while achieving a 20%-50% improvement in adversarial effectiveness; the defender attains 10%-30% gains in safety performance without degrading general reasoning capability; and the evaluator continuously refines its fine-grained judgment ability through iterations, accurately distinguishing unsafe responses, simple refusals, and useful guidance. Overall, our framework establishes an efficient and scalable paradigm for LLM safety alignment, enabling continuous co-evolution within a unified learning loop.

Keywords

reinforcement learninglarge language modelssafety alignmentadversarial prompt generationsafety defenseresponse assessmentclosed-loop frameworkco-improving collaborationiterative learning

More in AI Safety & Alignment

View all

The Devil Behind Moltbook: Anthropic Safety is Always Vanishing in Self-Evolving AI Societies

Chenxu Wang, Chaozhuo Li +11

The emergence of multi-agent systems built from large language models (LLMs) offers a promising paradigm for scalable collective intelligence and self-evolution. Ideally, such systems would achieve co...

Feb 10182

AgentDoG: A Diagnostic Guardrail Framework for AI Agent Safety and Security

Dongrui Liu, Qihan Ren +41

The rise of AI agents introduces complex safety and security challenges arising from autonomous tool use and environmental interactions. Current guardrail models lack agentic risk awareness and transp...

Jan 2685

RubricBench: Aligning Model-Generated Rubrics with Human Standards

Qiyuan Zhang, Junyi Zhou +9

As Large Language Model (LLM) alignment evolves from simple completions to complex, highly sophisticated generation, Reward Models are increasingly shifting toward rubric-guided evaluation to mitigate...

Mar 244

THINKSAFE: Self-Generated Safety Alignment for Reasoning Models

Seanie Lee, Sangwoo Park +7

Large reasoning models (LRMs) achieve remarkable performance by leveraging reinforcement learning (RL) on reasoning tasks to generate long chain-of-thought (CoT) reasoning. However, this over-optimiza...

Jan 3028

Frontier AI Risk Management Framework in Practice: A Risk Analysis Technical Report v1.5

Dongrui Liu, Yi Yu +19

To understand and identify the unprecedented risks posed by rapidly advancing artificial intelligence (AI) models, Frontier AI Risk Management Framework in Practice presents a comprehensive assessment...

Feb 1626

More AI Safety & Alignment papers