This study explores the extension of Large Language Models (LLMs) to non-English languages, specifically those using non-Latin scripts. The proposed approach uses the romanized form of text as an interface for LLMs, hypothesizing its frequent informal use and shared tokens with English improve cross-lingual alignment. The study demonstrates through Hindi-to-English translation and sentiment analysis tasks that romanized text not only significantly improves inference efficiency but also achieves competitive performance with limited pre-training. A novel multi-script prompting approach, combining romanized and native texts, shows potential in further enhancing task performance.

 

Publication date: 26 Jan 2024
Project Page: Not provided
Paper: https://arxiv.org/pdf/2401.14280