This article explores the problem of dataset selection for training large-scale models. The authors argue that traditional methods, which filter data based on human notions of quality, often do not improve and can even hurt model performance. Instead, they propose a new approach that frames dataset selection as an optimization problem. This method aims to select the subset of data that maximizes model performance, considering the learning algorithm and target tasks. Their results show significant improvements in language model performance on both pre-specified and previously unseen tasks.

 

Publication date: 23 Jan 2024
Project Page: ?
Paper: https://arxiv.org/pdf/2401.12926