T2S-Bench & Structure-of-Thought: Benchmarking and Prompting Comprehensive Text-to-Structure Reasoning

QQinsi WangHHancheng YeJJinhee KimJJinghan KeYYifei WangMMartin KuoZZishan ShaoDDongting LiYYueqian LinTTing JiangCChiyue WeiQQi QianWWei WenHHelen LiYYiran Chen

Published: March 4, 2026
Authors: 15
Word Count: 24,329
Code: Includes code

View on arXiv Download PDF

Structure of Thought prompting and T2S-Bench benchmark improve LLM text processing through explicit intermediate representations.

Abstract

Think about how human handles complex reading tasks: marking key points, inferring their relationships, and structuring information to guide understanding and responses. Likewise, can a large language model benefit from text structure to enhance text-processing performance? To explore it, in this work, we first introduce Structure of Thought (SoT), a prompting technique that explicitly guides models to construct intermediate text structures, consistently boosting performance across eight tasks and three model families. Building upon this insight, we present T2S-Bench, the first benchmark designed to evaluate and improve text-to-structure capabilities of models. T2S-Bench includes 1.8K samples across 6 scientific domains and 32 structural types, rigorously constructed to ensure accuracy, fairness, and quality. Evaluation on 45 mainstream models reveals substantial improvement potential: the average accuracy on the multi-hop reasoning task is only 52.1%, and even the most advanced model achieves 58.1% node accuracy in end-to-end extraction. Furthermore, on Qwen2.5-7B-Instruct, SoT alone yields an average +5.7% improvement across eight diverse text-processing tasks, and fine-tuning on T2S-Bench further increases this gain to +8.6%. These results highlight the value of explicit text structuring and the complementary contributions of SoT and T2S-Bench. Dataset and eval code have been released at https://t2s-bench.github.io/T2S-Bench-Page/.

Key Takeaways

1
Structure of Thought prompting consistently improves LLM performance by 5-10% across diverse text-processing tasks.
2
T2S-Bench provides the first comprehensive benchmark with 1.8K samples across 32 structural types for evaluating text-to-structure capabilities.
3
Fine-tuning on T2S-Bench increases downstream task performance by up to 8.6%, demonstrating structured text processing value.

Limitations

Even state-of-the-art models achieve only 58.1% node accuracy on end-to-end extraction tasks, indicating substantial remaining challenges.
Current approaches remain task-specific and heavily reliant on particular input structures, limiting generalization across diverse text tasks.

Keywords

Structure of Thoughtprompting techniquetext-to-structure capabilitiesmulti-hop reasoningend-to-end extractionlanguage model performancetext processing tasksT2S-Benchscientific domainsstructural types

More in Large Language Models

View all

OPUS: Towards Efficient and Principled Data Selection in Large Language Model Pre-training in Every Iteration

Shaobo Wang, Xuan Ouyang +10

As high-quality public text approaches exhaustion, a phenomenon known as the Data Wall, pre-training is shifting from more tokens to better tokens. However, existing methods either rely on heuristic s...

Feb 5260

Less is Enough: Synthesizing Diverse Data in Feature Space of LLMs

Zhongzhi Li, Xuansheng Wu +3

The diversity of post-training data is critical for effective downstream performance in large language models (LLMs). Many existing approaches to constructing post-training data quantify diversity usi...

Feb 11204

Does Your Reasoning Model Implicitly Know When to Stop Thinking?

Zixuan Huang, Xin Xia +12

Recent advancements in large reasoning models (LRMs) have greatly improved their capabilities on complex reasoning tasks through Long Chains of Thought (CoTs). However, this approach often results in ...

Feb 9170

Idea2Story: An Automated Pipeline for Transforming Research Concepts into Complete Scientific Narratives

Tengyue Xu, Zhuoyang Qian +17

Autonomous scientific discovery with large language model (LLM)-based agents has recently made substantial progress, demonstrating the ability to automate end-to-end research workflows. However, exist...

Jan 28143

Rewarding the Rare: Uniqueness-Aware RL for Creative Problem Solving in LLMs

Zhiyuan Hu, Yucheng Wang +8

Reinforcement learning (RL) has become a central paradigm for post-training large language models (LLMs), particularly for complex reasoning tasks, yet it often suffers from exploration collapse: poli...

Jan 13129

More Large Language Models papers