Fluent Alignment with Disfluent Judges: Post-Training for Lower-Resource Languages

Abstract

We propose a post-training method for lower-resource languages that preserves the fluency of language models even when aligned by disfluent reward models. Preference optimization is now a well-researched topic, but previous work has mostly addressed models for English and Chinese. Lower-resource languages lack both datasets written by native speakers and instruction-tuned language models capable of generating fluent synthetic data. To address this, we focus on developing a fluent preference-aligned language model without any instruction-tuning data in the target language. Our approach uses an on-policy training method, which we compare with two common alternatives: supervised finetuning on machine-translated data and multilingual finetuning. We conduct a case study on Norwegian Bokmål and evaluate fluency through native-speaker assessments. The results show that the on-policy aspect is crucial and outperforms the alternatives without relying on any hard-to-obtain data.

Cite

Text

Samuel et al. "Fluent Alignment with Disfluent Judges: Post-Training for Lower-Resource Languages." International Conference on Learning Representations, 2026.

Markdown

[Samuel et al. "Fluent Alignment with Disfluent Judges: Post-Training for Lower-Resource Languages." International Conference on Learning Representations, 2026.](https://mlanthology.org/iclr/2026/samuel2026iclr-fluent/)

BibTeX

@inproceedings{samuel2026iclr-fluent,
  title     = {{Fluent Alignment with Disfluent Judges: Post-Training for Lower-Resource Languages}},
  author    = {Samuel, David and Øvrelid, Lilja and Velldal, Erik and Kutuzov, Andrey},
  booktitle = {International Conference on Learning Representations},
  year      = {2026},
  url       = {https://mlanthology.org/iclr/2026/samuel2026iclr-fluent/}
}