Large Language Models

CGPT: Cluster-Guided Partial Tables with LLM-Generated Supervision for Table Retrieval

TTsung-Hsiang ChouCChen-Jui YuSShui-Hsiang HsuYYao-Chung Fan
Published
January 22, 2026
Authors
4
Word Count
3,053

CGPT enhances table retrieval with LLM and clustering.

Abstract

General-purpose embedding models have demonstrated strong performance in text retrieval but remain suboptimal for table retrieval, where highly structured content leads to semantic compression and query-table mismatch. Recent LLM-based retrieval augmentation methods mitigate this issue by generating synthetic queries, yet they often rely on heuristic partial-table selection and seldom leverage these synthetic queries as supervision to improve the embedding model. We introduce CGPT, a training framework that enhances table retrieval through LLM-generated supervision. CGPT constructs semantically diverse partial tables by clustering table instances using K-means and sampling across clusters to broaden semantic coverage. An LLM then generates synthetic queries for these partial tables, which are used in hard-negative contrastive fine-tuning to refine the embedding model. Experiments across four public benchmarks (MimoTable, OTTQA, FetaQA, and E2E-WTQ) show that CGPT consistently outperforms retrieval baselines, including QGpT, with an average R@1 improvement of 16.54 percent. In a unified multi-domain corpus setting, CGPT further demonstrates strong cross-domain generalization and remains effective even when using smaller LLMs for synthetic query generation. These results indicate that semantically guided partial-table construction, combined with contrastive training from LLM-generated supervision, provides an effective and scalable paradigm for large-scale table retrieval. Our code is available at https://github.com/yumeow0122/CGPT.

Key Takeaways

  • 1

    CGPT improves table retrieval accuracy by 16.54% on average.

  • 2

    Uses LLM-generated queries and cluster-guided partial tables.

  • 3

    Shows strong performance across multiple benchmarks and domains.

Limitations

  • Depends on the quality of LLM-generated queries.

  • Requires careful tuning of clustering and sampling strategies.

Keywords

embedding modelstable retrievalLLM-based retrieval augmentationsynthetic querieshard-negative contrastive fine-tuningK-means clusteringsemantic compressionquery-table mismatchcross-domain generalization

More in Large Language Models

View all
CGPT: Cluster-Guided Partial Tables with LLM-Generated Supervision for Table Retrieval | Paperchime