TY - GEN
T1 - Spread-n-share
T2 - 2019 International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2019
AU - Tang, Xiongchao
AU - Wang, Haojie
AU - Ma, Xiaosong
AU - El-Sayed, Nosayba
AU - Zhai, Jidong
AU - Chen, Wenguang
AU - Aboulnaga, Ashraf
N1 - Publisher Copyright:
© 2019 ACM.
PY - 2019/11/17
Y1 - 2019/11/17
N2 - Traditional batch job schedulers adopt the Compact-n-Exclusive (CE) strategy, packing processes of a parallel job into as few compute nodes as possible. While CE minimizes inter-node network communication, it often brings self-contention among tasks of a resource-intensive application. Recent studies have used virtual containers to balance CPU utilization and memory capacity across physical nodes, but the imbalance in cache and memory bandwidth usage is still under-investigated. In this work, we propose Spread-n-Share (SNS): a new batch scheduling strategy that automatically scales resource-bound applications out onto more nodes to alleviate their performance bottleneck, and co-locate jobs in a resource compatible manner. We implement Uberun, a prototype scheduler to validate SNS, considering shared-cache capacity and memory bandwidth as two types of performance-critical shared resources. Experimental results using 12 diverse cluster workloads show that SNS improves the overall system throughput by 19.8% on average over CE, while achieving an average individual job speedup of 1.8%.
AB - Traditional batch job schedulers adopt the Compact-n-Exclusive (CE) strategy, packing processes of a parallel job into as few compute nodes as possible. While CE minimizes inter-node network communication, it often brings self-contention among tasks of a resource-intensive application. Recent studies have used virtual containers to balance CPU utilization and memory capacity across physical nodes, but the imbalance in cache and memory bandwidth usage is still under-investigated. In this work, we propose Spread-n-Share (SNS): a new batch scheduling strategy that automatically scales resource-bound applications out onto more nodes to alleviate their performance bottleneck, and co-locate jobs in a resource compatible manner. We implement Uberun, a prototype scheduler to validate SNS, considering shared-cache capacity and memory bandwidth as two types of performance-critical shared resources. Experimental results using 12 diverse cluster workloads show that SNS improves the overall system throughput by 19.8% on average over CE, while achieving an average individual job speedup of 1.8%.
UR - http://www.scopus.com/inward/record.url?scp=85076128948&partnerID=8YFLogxK
U2 - 10.1145/3295500.3356152
DO - 10.1145/3295500.3356152
M3 - Conference contribution
AN - SCOPUS:85076128948
T3 - International Conference for High Performance Computing, Networking, Storage and Analysis, SC
BT - Proceedings of SC 2019
PB - IEEE Computer Society
Y2 - 17 November 2019 through 22 November 2019
ER -