This academic article discusses a study that seeks to extend the capabilities of large language models (LLMs) to non-English languages. The study proposes an innovative approach that uses the romanized form of text as an interface for LLMs, with a focus on Hindi. The researchers found that romanized text not only significantly improves inference efficiency but also achieves competitive performance with limited pre-training. The study also introduces a novel multi-script prompting approach that combines romanized and native texts, which shows promise in further enhancing task performance. The findings suggest the potential of romanization in bridging the language gap for LLM applications.

 

Publication date: 26 Jan 2024
Project Page: Not provided
Paper: https://arxiv.org/pdf/2401.14280