The paper discusses the challenges in evaluating large language models, including data contamination, prompt sensitivity, and benchmark creation cost. To address these, the authors propose a lossless data compression-based evaluation method, testing models’ predictive abilities post-training. The study collected test data over 83 months (2017-2023), split it according to models’ training data cutoff, and measured the compression performance on the testing period as a measure of generalization on unseen data. The performance gap between the training and testing periods was used as a measure of robustness. The experiments tested 14 large language models on sources including Wikipedia, news articles, code, arXiv papers, and multi-modal data. Results indicated models like Mistral and Llama-2 balanced performance and robustness well, while many struggled to generalize on news and code data.
Publication date: 2 Feb 2024
Project Page: not provided
Paper: https://arxiv.org/pdf/2402.00861