This article presents L3Cube-IndicNews, a multilingual text classification corpus aimed at creating a high-quality dataset for Indian regional languages. The focus is on news headlines and articles in 10 prominent Indic languages. The datasets are designed to handle different document lengths and are classified into Short Headlines, Long Document, and Long Paragraph. The research significantly contributes to expanding the available text classification datasets and enables the development of topic classification models for Indian regional languages. The datasets and models are shared publicly for further research.
Publication date: 5 Jan 2024
Project Page: https://github.com/l3cube-pune/indic-nlp
Paper: https://arxiv.org/pdf/2401.02254