The paper presents SUMMaug, a novel approach to data augmentation for document classification. Despite the use of pretrained language models in natural language understanding tasks, comprehending lengthy text remains a challenge due to the data sparseness problem. To address this, the authors propose a simple yet effective method of summarization-based data augmentation for document classification. The method involves summarizing the input of original training examples and merging the original labels to match the summarized input. This approach allows for curriculum learning, improving the model’s ability to understand lengthy texts. The effectiveness of SUMMaug is confirmed through experiments on two datasets, showing superior performance in terms of accuracy and robustness compared to existing methods.
Publication date: 1 Dec 2023
Project Page: https://github.com/etsurin/summaug
Paper: https://arxiv.org/pdf/2312.00513