The study addresses the challenge of extending Large Language Models (LLMs) to non-English languages, specifically those using non-Latin scripts. An innovative approach that utilizes the romanized form of text as an interface for LLMs is proposed. This approach hypothesizes that the frequent informal use of romanized text and its shared tokens with English can enhance cross-lingual alignment. The research focuses on Hindi, demonstrating through Hindi-to-English translation and sentiment analysis tasks that romanized text not only significantly improves inference efficiency due to its lower fertility compared to native text but also achieves competitive performance with limited pre-training. The study also introduces a novel multi-script prompting approach, which combines romanized and native texts, showing promise in further enhancing task performance.

 

Publication date: 26 Jan 2024
Project Page: Not Provided
Paper: https://arxiv.org/pdf/2401.14280