The article addresses the challenges in evaluating large language models, including data contamination, sensitivity to prompts, and the high cost of benchmark creation. The authors propose a new approach that uses lossless data compression to evaluate how well these models can generalize after their training period. The performance of 14 large language models was tested on various sources like Wikipedia, news articles, code, arXiv papers, and multi-modal data. Findings suggest that while many models’ compression rate reduces significantly after their cutoff date, models such as Mistral and Llama-2 show a good balance between performance and robustness.

 

Publication date: 2 Feb 2024
Project Page: Not provided
Paper: https://arxiv.org/pdf/2402.00861