This article introduces HPC2lusterScape, a visualization system designed to enhance the efficiency and transparency of shared high-performance computing (HPC) clusters used for large-scale AI models. The system provides a comprehensive overview of both system-level and application-level information, allowing for improved resource utilization and issue identification through customizable violation rules. It also includes diagnostic tools for investigating workload imbalances and synchronization bottlenecks in large-scale distributed deep learning experiments. The paper discusses the challenges and prerequisites for efficient HPC operation and highlights the contributions of the visualization system in addressing these issues and optimizing resource utilization.

 

Publication date: 5 Oct 2023
Project Page: Not provided
Paper: https://arxiv.org/pdf/2310.02120