Natural Language Processing

No One-Size-Fits-All: Building Systems For Translation to Bashkir, Kazakh, Kyrgyz, Tatar and Chuvash Using Synthetic And Original Data

DDmitry Karpov
Published
February 4, 2026
Authors
1
Word Count
3,340

Innovative methods improve machine translation for rare languages.

Abstract

We explore machine translation for five Turkic language pairs: Russian-Bashkir, Russian-Kazakh, Russian-Kyrgyz, English-Tatar, English-Chuvash. Fine-tuning nllb-200-distilled-600M with LoRA on synthetic data achieved chrF++ 49.71 for Kazakh and 46.94 for Bashkir. Prompting DeepSeek-V3.2 with retrieved similar examples achieved chrF++ 39.47 for Chuvash. For Tatar, zero-shot or retrieval-based approaches achieved chrF++ 41.6, while for Kyrgyz the zero-shot approach reached 45.6. We release the dataset and the obtained weights.

Key Takeaways

  • 1

    Synthetic data enhances translation for low-resource Turkic languages.

  • 2

    Low-Rank Adaptation (LoRA) efficiently fine-tunes models.

  • 3

    Prompting techniques improve translation quality with limited data.

Limitations

  • Relies on Yandex.Translate for synthetic data generation.

  • May introduce biases if not carefully filtered.

Keywords

nllb-200LoRAsynthetic datachrF++DeepSeek-V3.2retrieval-based approacheszero-shot

More in Natural Language Processing

View all
No One-Size-Fits-All: Building Systems For Translation to Bashkir, Kazakh, Kyrgyz, Tatar and Chuvash Using Synthetic And Original Data | Paperchime