Large Language Models

Retrieval-Infused Reasoning Sandbox: A Benchmark for Decoupling Retrieval and Reasoning Capabilities

SShuangshuang YingZZheyu WangYYunjian PengJJin ChenYYuhao WuHHongbin LinDDingyu HeSSiyi LiuGGengchen YuYYinZhu PiaoYYuchen WuXXin GuiZZhongyuan PengXXin LiXXeron DuLLibo QinYYiXin CaoGGe ZhangSStephen Huang
Published
January 29, 2026
Authors
19

Abstract

Despite strong performance on existing benchmarks, it remains unclear whether large language models can reason over genuinely novel scientific information. Most evaluations score end-to-end RAG pipelines, where reasoning is confounded with retrieval and toolchain choices, and the signal is further contaminated by parametric memorization and open-web volatility. We introduce DeR2, a controlled deep-research sandbox that isolates document-grounded reasoning while preserving core difficulties of deep search: multi-step synthesis, denoising, and evidence-based conclusion making. DeR2 decouples evidence access from reasoning via four regimes--Instruction-only, Concepts (gold concepts without documents), Related-only (only relevant documents), and Full-set (relevant documents plus topically related distractors)--yielding interpretable regime gaps that operationalize retrieval loss vs. reasoning loss and enable fine-grained error attribution. To prevent parametric leakage, we apply a two-phase validation that requires parametric failure without evidence while ensuring oracle-concept solvability. To ensure reproducibility, each instance provides a frozen document library (drawn from 2023-2025 theoretical papers) with expert-annotated concepts and validated rationales. Experiments across a diverse set of state-of-the-art foundation models reveal substantial variation and significant headroom: some models exhibit mode-switch fragility, performing worse with the Full-set than with Instruction-only, while others show structural concept misuse, correctly naming concepts but failing to execute them as procedures.

Keywords

large language modelsdocument-grounded reasoningretrieval-augmented generationdeep searchmulti-step synthesisdenoisingevidence-based conclusion makingparametric memorizationopen-web volatilitycontrolled deep-research sandboxretrieval lossreasoning losserror attributiontwo-phase validationparametric failureoracle-concept solvabilityfrozen document libraryexpert-annotated conceptsvalidated rationales

More in Large Language Models

View all
Retrieval-Infused Reasoning Sandbox: A Benchmark for Decoupling Retrieval and Reasoning Capabilities | Paperchime