TY - GEN
T1 - End-to-end I/O monitoring on a leading supercomputer
AU - Yang, Bin
AU - Ji, Xu
AU - Ma, Xiaosong
AU - Wang, Xiyang
AU - Zhang, Tianyu
AU - Zhu, Xiupeng
AU - El-Sayed, Nosayba
AU - Lan, Haidong
AU - Yang, Yibo
AU - Zhai, Jidong
AU - Liu, Weiguo
AU - Xue, Wei
N1 - Publisher Copyright:
© 2019 by The USENIX Association. All Rights Reserved.
PY - 2019
Y1 - 2019
N2 - This paper presents an effort to overcome the complexities of production system I/O performance monitoring. We design Beacon, an end-to-end I/O resource monitoring and diagnosis system, for the 40960-node Sunway TaihuLight supercomputer, current ranked world No.3. Beacon simultaneously collects and correlates I/O tracing/profiling data from all the compute nodes, forwarding nodes, storage nodes and metadata servers. With mechanisms such as aggressive online+offline trace compression and distributed caching/storage, it delivers scalable, low-overhead, and sustainable I/O diagnosis under production use. Higher-level per-application I/O performance behaviors are reconstructed from system-level monitoring data to reveal correlations between system performance bottlenecks, utilization symptoms, and application behaviors. Beacon further provides query, statistics, and visualization utilities to users and administrators, allowing comprehensive and in-depth analysis without requiring any code/script modification. With its deployment on TaihuLight for around 18 months, we demonstrate Beacon's effectiveness with real-world use cases for I/O performance issue identification and diagnosis. It has successfully helped center administrators identify obscure design or configuration flaws, system anomaly occurrences, I/O performance interference, and resource under- or over-provisioning problems. Several of the exposed problems have already been fixed, with others being currently addressed. In addition, we demonstrate Beacon's generality by its recent extension to monitor interconnection networks, another contention point on supercomputers. Both Beacon codes and part of collected monitoring data are released.
AB - This paper presents an effort to overcome the complexities of production system I/O performance monitoring. We design Beacon, an end-to-end I/O resource monitoring and diagnosis system, for the 40960-node Sunway TaihuLight supercomputer, current ranked world No.3. Beacon simultaneously collects and correlates I/O tracing/profiling data from all the compute nodes, forwarding nodes, storage nodes and metadata servers. With mechanisms such as aggressive online+offline trace compression and distributed caching/storage, it delivers scalable, low-overhead, and sustainable I/O diagnosis under production use. Higher-level per-application I/O performance behaviors are reconstructed from system-level monitoring data to reveal correlations between system performance bottlenecks, utilization symptoms, and application behaviors. Beacon further provides query, statistics, and visualization utilities to users and administrators, allowing comprehensive and in-depth analysis without requiring any code/script modification. With its deployment on TaihuLight for around 18 months, we demonstrate Beacon's effectiveness with real-world use cases for I/O performance issue identification and diagnosis. It has successfully helped center administrators identify obscure design or configuration flaws, system anomaly occurrences, I/O performance interference, and resource under- or over-provisioning problems. Several of the exposed problems have already been fixed, with others being currently addressed. In addition, we demonstrate Beacon's generality by its recent extension to monitor interconnection networks, another contention point on supercomputers. Both Beacon codes and part of collected monitoring data are released.
UR - http://www.scopus.com/inward/record.url?scp=85076144499&partnerID=8YFLogxK
M3 - Conference contribution
AN - SCOPUS:85076144499
T3 - Proceedings of the 16th USENIX Symposium on Networked Systems Design and Implementation, NSDI 2019
SP - 379
EP - 394
BT - Proceedings of the 16th USENIX Symposium on Networked Systems Design and Implementation, NSDI 2019
PB - USENIX Association
T2 - 16th USENIX Symposium on Networked Systems Design and Implementation, NSDI 2019
Y2 - 26 February 2019 through 28 February 2019
ER -