The research paper presents L3Cube-IndicNews, a multilingual text classification corpus specifically focusing on news headlines and articles in Indian regional languages. This includes Hindi, Bengali, Marathi, Telugu, Tamil, Gujarati, Kannada, Odia, Malayalam, and Punjabi. These datasets are designed to handle different document lengths and maintain consistent labeling for in-depth length-based analysis. The datasets are evaluated using models like monolingual BERT, multilingual Indic Sentence BERT (IndicSBERT), and IndicBERT. This research contributes to the available text classification datasets and aids in developing topic classification models for Indian regional languages.

 

Publication date: 5 Jan 2024
Project Page: https://github.com/l3cube-pune/indic-nlp
Paper: https://arxiv.org/pdf/2401.02254