Evaluating Large Language Models for Generalization and Robustness via Data Compression

The paper discusses the challenges in evaluating large language models, including data contamination, prompt sensitivity, and benchmark creation cost. To address these, the authors propose a lossless data compression-based evaluation method, testing models’ predictive abilities post-training. The study collected test data over 83 months (2017-2023), split it according to models’ training data cutoff, and measured the compression performance on the testing period as a measure of generalization on unseen data. The performance gap between the training and testing periods was used as a measure of robustness. The experiments tested 14 large language models on sources including Wikipedia, news articles, code, arXiv papers, and multi-modal data. Results indicated models like Mistral and Llama-2 balanced performance and robustness well, while many struggled to generalize on news and code data.

Publication date: 2 Feb 2024
Project Page: not provided
Paper: https://arxiv.org/pdf/2402.00861

Post Views: 284

Press ESC to close

Share Article:

root

Maintaining User Trust Through Multistage Uncertainty Aware Inference

X-CBA: Explainability Aided CatBoosted Anomal-E for Intrusion Detection System

Please allow ads on our site