AI Agents

Learning Query-Specific Rubrics from Human Preferences for DeepResearch Report Generation

CChangze LvJJie ZhouWWentao ZhaoJJingwen XuZZisu HuangMMuzhao TianSShihan DouTTao GuiLLe TianXXiao ZhouXXiaoqing ZhengXXuanjing HuangJJie Zhou
Published
February 3, 2026
Authors
13
Word Count
10,502
Code
Includes code

Revolutionizing research report generation with human-aligned rubrics.

Abstract

Nowadays, training and evaluating DeepResearch-generated reports remain challenging due to the lack of verifiable reward signals. Accordingly, rubric-based evaluation has become a common practice. However, existing approaches either rely on coarse, pre-defined rubrics that lack sufficient granularity, or depend on manually constructed query-specific rubrics that are costly and difficult to scale. In this paper, we propose a pipeline to train human-preference-aligned query-specific rubric generators tailored for DeepResearch report generation. We first construct a dataset of DeepResearch-style queries annotated with human preferences over paired reports, and train rubric generators via reinforcement learning with a hybrid reward combining human preference supervision and LLM-based rubric evaluation. To better handle long-horizon reasoning, we further introduce a Multi-agent Markov-state (MaMs) workflow for report generation. We empirically show that our proposed rubric generators deliver more discriminative and better human-aligned supervision than existing rubric design strategies. Moreover, when integrated into the MaMs training framework, DeepResearch systems equipped with our rubric generators consistently outperform all open-source baselines on the DeepResearch Bench and achieve performance comparable to that of leading closed-source models.

Key Takeaways

  • 1

    Automated generation of human-preference-aligned rubrics for report evaluation.

  • 2

    Utilizes a large-scale human preference dataset for training.

  • 3

    Enhances performance of DeepResearch report generation models.

Limitations

  • Pairwise comparisons may not capture complex preferences.

  • Limited to specific DeepResearch tasks and domains.

Keywords

reinforcement learninghybrid rewardhuman preference supervisionLLM-based rubric evaluationMulti-agent Markov-stateDeepResearch benchopen-source baselinesclosed-source models

More in AI Agents

View all
Learning Query-Specific Rubrics from Human Preferences for DeepResearch Report Generation | Paperchime