This paper discusses the adaptation of multilingual encoders to Swiss German, a language with dialectal variation and limited public training data. The researchers used continued pre-training on Swiss German data, achieving 97.5% of the performance of a full monolithic adaptation. They found that a character-level model was the most effective for retrieving Swiss German sentences from Standard German queries. Despite its effectiveness, a joint modular and tokenization-free approach underperformed the individual strategies. The researchers have made their code and models available for further study and application.
Publication date: 25 Jan 2022
Project Page: https://github.com/ZurichNLP/swiss-german-text-encoders
Paper: https://arxiv.org/pdf/2401.14400