Latest Large Language Models Research Papers

Research on large language models including GPT, Claude, Llama, and other transformer-based architectures for natural language understanding and generation.

189 Papers
Showing 20 of 20 papers

Thinking to Recall: How Reasoning Unlocks Parametric Knowledge in LLMs

Zorik Gekhman, Roee Aharoni, Eran Ofek +3 more

While reasoning in LLMs plays a natural role in math, code generation, and multi-hop factual questions, its effect on simple, single-hop factual questions remains unclear. Such questions do not require step-by-step logical decomposition, making the utility of reasoning highly counterintuitive. Never...

large language modelsparametric knowledge recallreasoningcomputational buffer effectfactual priming+3 more
Mar 10, 202641

How Far Can Unsupervised RLVR Scale LLM Training?

Bingxiang He, Yuxin Zuo, Zeyuan Liu +18 more

Unsupervised reinforcement learning with verifiable rewards (URLVR) offers a pathway to scale LLM training beyond the supervision bottleneck by deriving rewards without ground truth labels. Recent works leverage model intrinsic signals, showing promising early gains, yet their potential and limitati...

unsupervised reinforcement learningverifiable rewardslarge language model trainingintrinsic signalsreward derivation+5 more
Mar 9, 202637

Unlocking Data Value in Finance: A Study on Distillation and Difficulty-Aware Training

Chuxue Cao, Honglin Lin, Zhanping Zhong +5 more

Large Language Models (LLMs) have demonstrated strong general capabilities, yet their deployment in finance remains challenging due to dense domain-specific terminology, stringent numerical reasoning requirements, and low tolerance for factual errors. We conduct a controlled empirical study showing ...

Chain-of-ThoughtSFTRLCoT distillationdifficulty-aware sampling+5 more
Mar 7, 202611

Lost in Stories: Consistency Bugs in Long Story Generation by LLMs

Junjie Li, Xinrui Guo, Yuhao Wu +3 more

What happens when a storyteller forgets its own story? Large Language Models (LLMs) can now generate narratives spanning tens of thousands of words, but they often fail to maintain consistency throughout. When generating long-form narratives, these models can contradict their own established facts, ...

large language modelslong-form story generationnarrative consistencystory generation benchmarkscontradiction detection+1 more
Mar 6, 202674

Timer-S1: A Billion-Scale Time Series Foundation Model with Serial Scaling

Yong Liu, Xingjian Su, Shiyu Wang +7 more

We introduce Timer-S1, a strong Mixture-of-Experts (MoE) time series foundation model with 8.3B total parameters, 0.75B activated parameters for each token, and a context length of 11.5K. To overcome the scalability bottleneck in existing pre-trained time series foundation models, we perform Serial ...

Mixture-of-ExpertsTimeMoE blocksTimeSTP blocksSerial-Token Predictionlong-term predictions+5 more
Mar 5, 202613

Progressive Residual Warmup for Language Model Pretraining

Tianhao Chen, Xin Xu, Lu Yin +4 more

Transformer architectures serve as the backbone for most modern Large Language Models, therefore their pretraining stability and convergence speed are of central concern. Motivated by the logical dependency of sequentially stacked layers, we propose Progressive Residual Warmup (ProRes) for language ...

Transformer architecturesLarge Language Modelspretraining stabilityconvergence speedprogressive residual warmup+5 more
Mar 5, 202615

V_1: Unifying Generation and Self-Verification for Parallel Reasoners

Harman Singh, Xiuyu Li, Kusha Sareen +14 more

Test-time scaling for complex reasoning tasks shows that leveraging inference-time compute, by methods such as independently sampling and aggregating multiple solutions, results in significantly better task outcomes. However, a critical bottleneck is verification: sampling is only effective if corre...

test-time scalingcomplex reasoning tasksinference-time computesamplingaggregation+15 more
Mar 4, 202613

T2S-Bench & Structure-of-Thought: Benchmarking and Prompting Comprehensive Text-to-Structure Reasoning

Qinsi Wang, Hancheng Ye, Jinhee Kim +12 more

Think about how human handles complex reading tasks: marking key points, inferring their relationships, and structuring information to guide understanding and responses. Likewise, can a large language model benefit from text structure to enhance text-processing performance? To explore it, in this wo...

Structure of Thoughtprompting techniquetext-to-structure capabilitiesmulti-hop reasoningend-to-end extraction+5 more
Mar 4, 2026109

Believe Your Model: Distribution-Guided Confidence Calibration

Xizhong Yang, Haotian Zhang, Huiming Wang +1 more

Large Reasoning Models have demonstrated remarkable performance with the advancement of test-time scaling techniques, which enhances prediction accuracy by generating multiple candidate responses and selecting the most reliable answer. While prior work has analyzed that internal model signals like c...

Gaussian Mixture Modelsconfidence scoresanswer selectiondistributional priorsreject filter+4 more
Mar 4, 202638

PRISM: Pushing the Frontier of Deep Think via Process Reward Model-Guided Inference

Rituraj Sharma, Weiyuan Chen, Noah Provenzano +1 more

DEEPTHINK methods improve reasoning by generating, refining, and aggregating populations of candidate solutions, which enables strong performance on complex mathematical and scientific tasks. However, existing frameworks often lack reliable correctness signals during inference, which creates a popul...

DEEPTHINKcandidate solutionspopulation enhancementreasoningProcess Reward Model+5 more
Mar 3, 202618

MemSifter: Offloading LLM Memory Retrieval via Outcome-Driven Proxy Reasoning

Jiejun Tan, Zhicheng Dou, Liancheng Zhang +3 more

As Large Language Models (LLMs) are increasingly used for long-duration tasks, maintaining effective long-term memory has become a critical challenge. Current methods often face a trade-off between cost and accuracy. Simple storage methods often fail to retrieve relevant information, while complex i...

Large Language Modelslong-term memorymemory retrievalproxy modelReinforcement Learning+6 more
Mar 3, 202626

How Controllable Are Large Language Models? A Unified Evaluation across Behavioral Granularities

Ziwen Xu, Kewei Xu, Haoming Xu +8 more

Large Language Models (LLMs) are increasingly deployed in socially sensitive domains, yet their unpredictable behaviors, ranging from misaligned intent to inconsistent personality, pose significant risks. We introduce SteerEval, a hierarchical benchmark for evaluating LLM controllability across thre...

Large Language Modelscontrollabilityhierarchical benchmarklanguage featuressentiment+4 more
Mar 3, 202621

CHIMERA: Compact Synthetic Data for Generalizable LLM Reasoning

Xinyu Zhu, Yihao Feng, Yanchao Sun +5 more

Large Language Models (LLMs) have recently exhibited remarkable reasoning capabilities, largely enabled by supervised fine-tuning (SFT)- and reinforcement learning (RL)-based post-training on high-quality reasoning data. However, reproducing and extending these capabilities in open and scalable sett...

Chain-of-Thoughtsupervised fine-tuningreinforcement learninglarge language modelssynthetic reasoning dataset+7 more
Mar 1, 202632

Spectral Condition for μP under Width-Depth Scaling

Chenyu Zheng, Rongzhen Wang, Xinyu Zhang +1 more

Generative foundation models are increasingly scaled in both width and depth, posing significant challenges for stable feature learning and reliable hyperparameter (HP) transfer across model sizes. While maximal update parameterization (μP) has provided a principled solution to both problems for wid...

generative foundation modelswidth-depth scalingmaximal update parameterizationspectral frameworkresidual networks+6 more
Feb 28, 202614

Qwen3-Coder-Next Technical Report

Ruisheng Cao, Mouxiang Chen, Jiawei Chen +17 more

We present Qwen3-Coder-Next, an open-weight language model specialized for coding agents. Qwen3-Coder-Next is an 80-billion-parameter model that activates only 3 billion parameters during inference, enabling strong coding capability with efficient inference. In this work, we explore how far strong t...

language modelparameter-efficient fine-tuningagentic trainingverifiable coding tasksexecutable environments+4 more
Feb 28, 202633

LK Losses: Direct Acceptance Rate Optimization for Speculative Decoding

Alexander Samarin, Sergei Krutikov, Anton Shevtsov +3 more

Speculative decoding accelerates autoregressive large language model (LLM) inference by using a lightweight draft model to propose candidate tokens that are then verified in parallel by the target model. The speedup is significantly determined by the acceptance rate, yet standard training minimizes ...

speculative decodingautoregressive large language modeldraft modelcandidate tokenstarget model+5 more
Feb 27, 202616

CiteAudit: You Cited It, But Did You Read It? A Benchmark for Verifying Scientific References in the LLM Era

Zhengqing Yuan, Kaiwen Shi, Zheyuan Zhang +3 more

Scientific research relies on accurate citation for attribution and integrity, yet large language models (LLMs) introduce a new risk: fabricated references that appear plausible but correspond to no real publications. Such hallucinated citations have already been observed in submissions and accepted...

large language modelshallucinated citationscitation checkingclaim extractionevidence retrieval+6 more
Feb 26, 202615

Humans and LLMs Diverge on Probabilistic Inferences

Gaurav Kamath, Sreenath Madathil, Sebastian Schuster +2 more

Human reasoning often involves working over limited information to arrive at probabilistic conclusions. In its simplest form, this involves making an inference that is not strictly entailed by a premise, but rather only likely given the premise. While reasoning LLMs have demonstrated strong performa...

probabilistic inferencesreasoning LLMslogical reasoningmathematical taskshuman-like distributions+1 more
Feb 26, 202611

Recovered in Translation: Efficient Pipeline for Automated Translation of Benchmarks and Datasets

Hanna Yukhymenko, Anton Alexandrov, Martin Vechev

The reliability of multilingual Large Language Model (LLM) evaluation is currently compromised by the inconsistent quality of translated benchmarks. Existing resources often suffer from semantic drift and context loss, which can lead to misleading performance metrics. In this work, we present a full...

Large Language Modelmultilingual evaluationbenchmark translationsemantic driftcontext loss+6 more
Feb 25, 202636
Page 1 of 10Next