The paper presents a systematic study of example difficulty scores, which are used in dataset pruning and defect identification. The authors analyze the consistency of these scores across different training runs, scoring methods, and model architectures. They find that these scores are noisy over individual runs of a model, strongly correlated with a single notion of difficulty, and reveal examples that range from being highly sensitive to insensitive to the inductive biases of certain model architectures. They also propose a simple method for fingerprinting model architectures using a few sensitive examples. The findings guide practitioners in maximizing the consistency of their scores and provide comprehensive baselines for evaluating scores in the future.
Publication date: 4 Jan 2024
Project Page: https://arxiv.org/abs/2401.01867v1
Paper: https://arxiv.org/pdf/2401.01867