RomanSetu: Efficiently unlocking multilingual capabilities of Large Language Models models via Romanization

The paper addresses the challenge of extending Large Language Models (LLMs) to non-English languages, especially those using non-Latin scripts. The authors propose using the romanized form of text as an interface for LLMs, hypothesizing that its frequent informal use and shared tokens with English enhance cross-lingual alignment. The study focuses on Hindi and demonstrates that romanized text not only improves inference efficiency but also achieves competitive performance with limited pre-training. The paper also introduces a novel multi-script prompting approach, combining romanized and native texts, showing promising results in enhancing task performance. The findings suggest the potential of romanization in bridging the language gap for LLM applications.

Publication date: 26 Jan 2024
Project Page: Not provided
Paper: https://arxiv.org/pdf/2401.14280

Post Views: 291

RomanSetu: Efficiently unlocking multilingual capabilities of Large Language Models models via Romanization

root

Leave a Reply Cancel reply

Press ESC to close

Share Article:

root

A Comparative Analysis of Noise Reduction Methods in Sentiment Analysis on Noisy Bengali Texts

Improving Natural Language Capability of Code Large Language Model

Leave a Reply Cancel reply

Please allow ads on our site