OccGen: Selection of Real-World Multilingual Parallel Data Balanced in Gender Within Occupations

Abstract

This paper describes the OCCGEN toolkit, which allows extracting multilingual parallel data balanced in gender within occupations. OCCGEN can extract datasets that reflect gender diversity (beyond binary) more fairly in society to be further used to explicitly mitigate occupational gender stereotypes. We propose two use cases that extract evaluation datasets for machine translation in four high-resourcelanguages from different linguistic families and in a low-resource African language. Our analysis of these use cases shows that translation outputs in high-resource languages tend to worsen in feminine subsets (compared to masculine). This can be explained because less attention is paid to the source sentence. Then, more attention is given to the target prefix overgeneralizing to the most frequent masculine forms.

Cite

Text

Costa-jussà et al. "OccGen: Selection of Real-World Multilingual Parallel Data Balanced in Gender Within Occupations." Neural Information Processing Systems, 2022.

Markdown

[Costa-jussà et al. "OccGen: Selection of Real-World Multilingual Parallel Data Balanced in Gender Within Occupations." Neural Information Processing Systems, 2022.](https://mlanthology.org/neurips/2022/costajussa2022neurips-occgen/)

BibTeX

@inproceedings{costajussa2022neurips-occgen,
  title     = {{OccGen: Selection of Real-World Multilingual Parallel Data Balanced in Gender Within Occupations}},
  author    = {Costa-jussà, Marta and Basta, Christine and Domingo, Oriol and Rubungo, André},
  booktitle = {Neural Information Processing Systems},
  year      = {2022},
  url       = {https://mlanthology.org/neurips/2022/costajussa2022neurips-occgen/}
}