The technical report discusses the multilingual E5 text embedding models released by Microsoft. The models are trained using contrastive pre-training on a billion multilingual text pairs and then fine-tuned with labeled datasets. The report introduces an instruction-tuned embedding model that performs on par with English-only models. The models were evaluated on the MTEB benchmark and the MIRACL multilingual retrieval benchmark, showing competitive performance.
Publication date: 8 Feb 2024
Project Page: https://github.com/microsoft/unilm/tree/master/e5
Paper: https://arxiv.org/pdf/2402.05672