Latest Natural Language Processing Research Papers

Research on text understanding, sentiment analysis, named entity recognition, and linguistic AI applications.

20 Papers

Showing 20 of 20 papers

OpenAutoNLU: Open Source AutoML Library for NLU

Grigory Arshinov, Aleksandr Boriskin, Sergey Senichev +4 more

OpenAutoNLU is an open-source automated machine learning library for natural language understanding (NLU) tasks, covering both text classification and named entity recognition (NER). Unlike existing solutions, we introduce data-aware training regime selection that requires no manual configuration fr...

automated machine learningnatural language understandingtext classificationnamed entity recognitiondata-aware training regime selection+3 more

Mar 2, 202640

InfoNCE Induces Gaussian Distribution

Roy Betser, Eyal Gofer, Meir Yossef Levi +1 more

Contrastive learning has become a cornerstone of modern representation learning, allowing training with massive unlabeled data for both task-specific and general (foundation) models. A prototypical loss in contrastive training is InfoNCE and its variants. In this work, we show that the InfoNCE objec...

contrastive learningInfoNCErepresentation learningGaussian structuremultivariate Gaussian distribution+6 more

Feb 27, 202611

ManCAR: Manifold-Constrained Latent Reasoning with Adaptive Test-Time Computation for Sequential Recommendation

Kun Yang, Yuxuan Zhu, Yazhe Chen +7 more

Sequential recommendation increasingly employs latent multi-step reasoning to enhance test-time computation. Despite empirical gains, existing approaches largely drive intermediate reasoning states via target-dominant objectives without imposing explicit feasibility constraints. This results in late...

latent multi-step reasoninglatent driftcollaborative manifoldglobal interaction graphlocal intent prior+5 more

Feb 23, 202623

jina-embeddings-v5-text: Task-Targeted Embedding Distillation

Mohammad Kalim Akram, Saba Sturua, Nastia Havriushenko +4 more

Text embedding models are widely used for semantic similarity tasks, including information retrieval, clustering, and classification. General-purpose models are typically trained with single- or multi-stage processes using contrastive loss functions. We introduce a novel training regimen that combin...

text embedding modelscontrastive lossmodel distillationsemantic similarityinformation retrieval+5 more

Feb 17, 202615

Revisiting the Platonic Representation Hypothesis: An Aristotelian View

Fabian Gröger, Shuo Wen, Maria Brbić

The Platonic Representation Hypothesis suggests that representations from neural networks are converging to a common statistical model of reality. We show that the existing metrics used to measure representational similarity are confounded by network scale: increasing model depth or width can system...

representational similarityneural networksspectral measuresneighborhood similaritypermutation-based null-calibration framework+2 more

Feb 16, 20269

Sanity Checks for Sparse Autoencoders: Do SAEs Beat Random Baselines?

Anton Korznikov, Andrey Galichin, Alexey Dontsov +3 more

Sparse Autoencoders (SAEs) have emerged as a promising tool for interpreting neural networks by decomposing their activations into sparse sets of human-interpretable features. Recent work has introduced multiple SAE variants and successfully scaled them to frontier models. Despite much excitement, a...

Sparse Autoencodersneural networksactivationsexplained varianceinterpretability+2 more

Feb 15, 202655

STATe-of-Thoughts: Structured Action Templates for Tree-of-Thoughts

Zachary Bamberger, Till R. Saenger, Gilad Morad +3 more

Inference-Time-Compute (ITC) methods like Best-of-N and Tree-of-Thoughts are meant to produce output candidates that are both high-quality and diverse, but their use of high-temperature sampling often fails to achieve meaningful output diversity. Moreover, existing ITC methods offer limited control ...

Best-of-NTree-of-Thoughtshigh-temperature samplinginference-time computetextual interventions+5 more

Feb 15, 202618

Self-Improving Multilingual Long Reasoning via Translation-Reasoning Integrated Training

Junxiao Liu, Zhijun Wang, Yixiao Li +6 more

Long reasoning models often struggle in multilingual settings: they tend to reason in English for non-English questions; when constrained to reasoning in the question language, accuracies drop substantially. The struggle is caused by the limited abilities for both multilingual question understanding...

multilingual reasoningtranslation reasoning integrated trainingcross-lingual question alignmentCOMETFLORES-200+1 more

Feb 5, 202616

Semantic Search over 9 Million Mathematical Theorems

Luke Alexander, Eric Leonen, Sophie Szeto +4 more

Searching for mathematical results remains difficult: most existing tools retrieve entire papers, while mathematicians and theorem-proving agents often seek a specific theorem, lemma, or proposition that answers a query. While semantic search has seen rapid progress, its behavior on large, highly te...

semantic searchtheorem retrievalnatural-language descriptionretrieval representationlanguage model choice+4 more

Feb 5, 202617

No One-Size-Fits-All: Building Systems For Translation to Bashkir, Kazakh, Kyrgyz, Tatar and Chuvash Using Synthetic And Original Data

Dmitry Karpov

We explore machine translation for five Turkic language pairs: Russian-Bashkir, Russian-Kazakh, Russian-Kyrgyz, English-Tatar, English-Chuvash. Fine-tuning nllb-200-distilled-600M with LoRA on synthetic data achieved chrF++ 49.71 for Kazakh and 46.94 for Bashkir. Prompting DeepSeek-V3.2 with retriev...

nllb-200LoRAsynthetic datachrF++DeepSeek-V3.2+2 more

Feb 4, 20264

PolySAE: Modeling Feature Interactions in Sparse Autoencoders via Polynomial Decoding

Panagiotis Koromilas, Andreas D. Demou, James Oldfield +2 more

Sparse autoencoders (SAEs) have emerged as a promising method for interpreting neural network representations by decomposing activations into sparse combinations of dictionary atoms. However, SAEs assume that features combine additively through linear reconstruction, an assumption that cannot captur...

sparse autoencodersdictionary atomsfeature interactionspolynomial decodingtensor factorization+3 more

Feb 1, 20268

PingPong: A Natural Benchmark for Multi-Turn Code-Switching Dialogues

Mohammad Rifqi Farhansyah, Hanif Muhammad Zhafran, Farid Adilazuarda +6 more

Code-switching is a widespread practice among the world's multilingual majority, yet few benchmarks accurately reflect its complexity in everyday communication. We present PingPong, a benchmark for natural multi-party code-switching dialogues covering five language-combination variations, some of wh...

code-switchingmultilingual discoursenatural language processinglanguage modelsdialogue understanding

Jan 24, 20265

STAR: Semantic Table Representation with Header-Aware Clustering and Adaptive Weighted Fusion

Shui-Hsiang Hsu, Tsung-Hsiang Chou, Chen-Jui Yu +1 more

Table retrieval is the task of retrieving the most relevant tables from large-scale corpora given natural language queries. However, structural and semantic discrepancies between unstructured text and structured tables make embedding alignment particularly challenging. Recent methods such as QGpT at...

semantic clusteringweighted fusiontable retrievalsemantic representationsynthetic queries+3 more

Jan 22, 20269

A Hybrid Protocol for Large-Scale Semantic Dataset Generation in Low-Resource Languages: The Turkish Semantic Relations Corpus

Ebubekir Tosun, Mehmet Emin Buldur, Özay Ezerceli +1 more

We present a hybrid methodology for generating large-scale semantic relationship datasets in low-resource languages, demonstrated through a comprehensive Turkish semantic relations corpus. Our approach integrates three phases: (1) FastText embeddings with Agglomerative Clustering to identify semanti...

FastText embeddingsAgglomerative Clusteringsemantic relationship classificationsemantic clustersdownstream tasks+2 more

Jan 19, 20263

Beyond Cosine Similarity: Taming Semantic Drift and Antonym Intrusion in a 15-Million Node Turkish Synonym Graph

Ebubekir Tosun, Mehmet Emin Buldur, Özay Ezerceli +1 more

Neural embeddings have a notorious blind spot: they can't reliably tell synonyms apart from antonyms. Consequently, increasing similarity thresholds often fails to prevent opposites from being grouped together. We've built a large-scale semantic clustering system specifically designed to tackle this...

semantic clusteringneural embeddingssynonymyantonymyco-hyponymy+7 more

Jan 19, 20263

Patient-Similarity Cohort Reasoning in Clinical Text-to-SQL

Yifei Shen, Yilun Zhao, Justice Ou +2 more

Real-world clinical text-to-SQL requires reasoning over heterogeneous EHR tables, temporal windows, and patient-similarity cohorts to produce executable queries. We introduce CLINSQL, a benchmark of 633 expert-annotated tasks on MIMIC-IV v3.1 that demands multi-table joins, clinically meaningful fil...

text-to-SQLEHR tablesclinical reasoningmulti-table joinstemporal windows+9 more

Jan 14, 20264

Cluster Workload Allocation: Semantic Soft Affinity Using Natural Language Processing

Leszek Sliwko, Jolanta Mizeria-Pietraszko

Cluster workload allocation often requires complex configurations, creating a usability gap. This paper introduces a semantic, intent-driven scheduling paradigm for cluster systems using Natural Language Processing. The system employs a Large Language Model (LLM) integrated via a Kubernetes schedule...

Jan 14, 2026

TranslateGemma Technical Report

Mara Finkelstein, Isaac Caswell, Tobias Domhan +17 more

We present TranslateGemma, a suite of open machine translation models based on the Gemma 3 foundation models. To enhance the inherent multilingual capabilities of Gemma 3 for the translation task, we employ a two-stage fine-tuning process. First, supervised fine-tuning is performed using a rich mixt...

Jan 13, 20261

Benchmarking Small Language Models and Small Reasoning Language Models on System Log Severity Classification

Yahya Masri, Emily Ma, Zifu Wang +2 more

System logs are crucial for monitoring and diagnosing modern computing infrastructure, but their scale and complexity require reliable and efficient automated interpretation. Since severity levels are predefined metadata in system log messages, having a model merely classify them offers limited stan...

Jan 12, 2026

A Rising Tide Lifts All Boats: MTQE Rewards for Idioms Improve General Translation Quality

Ishika Agarwal, Zhenlin He, Dhruva Patil +1 more

Non-compositional expressions (e.g., idioms, proverbs, and metaphors) pose significant challenges for neural machine translation systems because their meanings cannot be derived from individual words alone. These expressions encode rich, cultural meaning, and have both figurative and literal meaning...

Jan 9, 2026

View all categories

Latest Natural Language Processing Research | Natural Language Processing Papers