No One-Size-Fits-All: Building Systems For Translation to Bashkir, Kazakh, Kyrgyz, Tatar and Chuvash Using Synthetic And Original Data

DDmitry Karpov

Published: February 4, 2026
Authors: 1
Word Count: 3,340

Innovative methods improve machine translation for rare languages.

Abstract

We explore machine translation for five Turkic language pairs: Russian-Bashkir, Russian-Kazakh, Russian-Kyrgyz, English-Tatar, English-Chuvash. Fine-tuning nllb-200-distilled-600M with LoRA on synthetic data achieved chrF++ 49.71 for Kazakh and 46.94 for Bashkir. Prompting DeepSeek-V3.2 with retrieved similar examples achieved chrF++ 39.47 for Chuvash. For Tatar, zero-shot or retrieval-based approaches achieved chrF++ 41.6, while for Kyrgyz the zero-shot approach reached 45.6. We release the dataset and the obtained weights.

Key Takeaways

1
Synthetic data enhances translation for low-resource Turkic languages.
2
Low-Rank Adaptation (LoRA) efficiently fine-tunes models.
3
Prompting techniques improve translation quality with limited data.

Limitations

Relies on Yandex.Translate for synthetic data generation.
May introduce biases if not carefully filtered.

Keywords

nllb-200LoRAsynthetic datachrF++DeepSeek-V3.2retrieval-based approacheszero-shot

More in Natural Language Processing

View all

Sanity Checks for Sparse Autoencoders: Do SAEs Beat Random Baselines?

Anton Korznikov, Andrey Galichin +4

Sparse Autoencoders (SAEs) have emerged as a promising tool for interpreting neural networks by decomposing their activations into sparse sets of human-interpretable features. Recent work has introduc...

Feb 1555

OpenAutoNLU: Open Source AutoML Library for NLU

Grigory Arshinov, Aleksandr Boriskin +5

OpenAutoNLU is an open-source automated machine learning library for natural language understanding (NLU) tasks, covering both text classification and named entity recognition (NER). Unlike existing s...

Mar 240

ManCAR: Manifold-Constrained Latent Reasoning with Adaptive Test-Time Computation for Sequential Recommendation

Kun Yang, Yuxuan Zhu +8

Sequential recommendation increasingly employs latent multi-step reasoning to enhance test-time computation. Despite empirical gains, existing approaches largely drive intermediate reasoning states vi...

Feb 2323

STATe-of-Thoughts: Structured Action Templates for Tree-of-Thoughts

Zachary Bamberger, Till R. Saenger +4

Inference-Time-Compute (ITC) methods like Best-of-N and Tree-of-Thoughts are meant to produce output candidates that are both high-quality and diverse, but their use of high-temperature sampling often...

Feb 1518

Semantic Search over 9 Million Mathematical Theorems

Luke Alexander, Eric Leonen +5

Searching for mathematical results remains difficult: most existing tools retrieve entire papers, while mathematicians and theorem-proving agents often seek a specific theorem, lemma, or proposition t...

Feb 517

More Natural Language Processing papers