The paper addresses the challenge of extending Large Language Models (LLMs) to non-English languages, especially those using non-Latin scripts. The authors propose using the romanized form of text as an interface for LLMs, hypothesizing that its frequent informal use and shared tokens with English enhance cross-lingual alignment. The study focuses on Hindi and demonstrates that romanized text not only improves inference efficiency but also achieves competitive performance with limited pre-training. The paper also introduces a novel multi-script prompting approach, combining romanized and native texts, showing promising results in enhancing task performance. The findings suggest the potential of romanization in bridging the language gap for LLM applications.
Publication date: 26 Jan 2024
Project Page: Not provided
Paper: https://arxiv.org/pdf/2401.14280