The paper discusses the creation of the Aya Dataset, a multilingual instruction-following dataset spanning 65 languages. The researchers collaborated with fluent speakers worldwide to collect natural instances of instructions and completions. This dataset is significant as it bridges the language gap in AI, specifically in the area of instruction fine-tuning (IFT). The Aya Dataset, the Aya Collection, the Aya Evaluation Suite, and the Aya Annotation Platform have been developed and open-sourced, making it the most extensive multilingual collection to date. The initiative is a valuable case study in participatory research involving collaborators from 119 countries.

 

Publication date: 12 Feb 2024
Project Page: Not Provided
Paper: https://arxiv.org/pdf/2402.06619