This article introduces CroissantLLM, a 1.3 billion language model trained on 3 trillion English and French tokens. It is a high-performance, fully open-source bilingual model that operates on consumer-grade local hardware. The model was trained with a 1:1 English-to-French pre-training data ratio, a custom tokenizer, and bilingual fine-tuning datasets. To assess performance, a novel benchmark, FrenchBench, was created consisting of various classification and generation tasks. The authors also released codebases, checkpoints, fine-tuned Chat models, and strong translation models. The model meets 81% of the transparency criteria, outperforming many open initiatives. This work aims to enrich the NLP landscape and strengthen the understanding of multilingualism in language models.

 

Publication date: 1 Feb 2024
Project Page: https://arxiv.org/abs/2402.00786
Paper: https://arxiv.org/pdf/2402.00786