This paper presents two lightweight models for crowd counting, which is the estimation of the number of people in a crowd from an image or a video. The models, ASFNet-S and ASFNet-B, use MobileNet and MobileViT backbones respectively and incorporate an adjacent feature fusion technique to extract diverse scale features from a pre-trained model. They offer improved performance while maintaining a compact and efficient design. The models are compared with state-of-the-art methods and found to give comparable results, while being more computationally efficient. The paper also includes a comparative and an extensive ablation study, as well as pruning to demonstrate the effectiveness of the models.
Publication date: 12 Jan 2024
Project Page: Not provided
Paper: https://arxiv.org/pdf/2401.05968