TY - GEN
T1 - Random Walks on Huge Graphs at Cache Efficiency
AU - Yang, Ke
AU - Ma, Xiaosong
AU - Thirumuruganathan, Saravanan
AU - Chen, Kang
AU - Wu, Yongwei
N1 - Publisher Copyright:
© 2021 ACM.
PY - 2021/10/26
Y1 - 2021/10/26
N2 - Data-intensive applications dominated by random accesses to large working sets fail to utilize the computing power of modern processors. Graph random walk, an indispensable workhorse for many important graph processing and learning applications, is one prominent case of such applications. Existing graph random walk systems are currently unable to match the GPU-side node embedding training speed. This work reveals that existing approaches fail to effectively utilize the modern CPU memory hierarchy, due to the widely held assumption that the inherent randomness in random walks and the skewed nature of graphs render most memory accesses random. We demonstrate that there is actually plenty of spatial and temporal locality to harvest, by careful partitioning, rearranging, and batching of operations. The resulting system, FlashMob, improves both cache and memory bandwidth utilization by making memory accesses more sequential and regular. We also found that a classical combinatorial optimization problem (and its exact pseudo-polynomial solution) can be applied to complex decision making, for accurate yet efficient data/task partitioning. Our comprehensive experiments over diverse graphs show that our system achieves an order of magnitude performance improvement over the fastest existing system. It processes a 58GB real graph at higher per-step speed than the existing system on a 600KB toy graph fitting in the L2 cache.
AB - Data-intensive applications dominated by random accesses to large working sets fail to utilize the computing power of modern processors. Graph random walk, an indispensable workhorse for many important graph processing and learning applications, is one prominent case of such applications. Existing graph random walk systems are currently unable to match the GPU-side node embedding training speed. This work reveals that existing approaches fail to effectively utilize the modern CPU memory hierarchy, due to the widely held assumption that the inherent randomness in random walks and the skewed nature of graphs render most memory accesses random. We demonstrate that there is actually plenty of spatial and temporal locality to harvest, by careful partitioning, rearranging, and batching of operations. The resulting system, FlashMob, improves both cache and memory bandwidth utilization by making memory accesses more sequential and regular. We also found that a classical combinatorial optimization problem (and its exact pseudo-polynomial solution) can be applied to complex decision making, for accurate yet efficient data/task partitioning. Our comprehensive experiments over diverse graphs show that our system achieves an order of magnitude performance improvement over the fastest existing system. It processes a 58GB real graph at higher per-step speed than the existing system on a 600KB toy graph fitting in the L2 cache.
KW - cache
KW - graph computing
KW - memory
KW - random walk
UR - http://www.scopus.com/inward/record.url?scp=85119091657&partnerID=8YFLogxK
U2 - 10.1145/3477132.3483575
DO - 10.1145/3477132.3483575
M3 - Conference contribution
AN - SCOPUS:85119091657
T3 - SOSP 2021 - Proceedings of the 28th ACM Symposium on Operating Systems Principles
SP - 311
EP - 326
BT - SOSP 2021 - Proceedings of the 28th ACM Symposium on Operating Systems Principles
PB - Association for Computing Machinery, Inc
T2 - 28th ACM Symposium on Operating Systems Principles, SOSP 2021
Y2 - 26 October 2021 through 29 October 2021
ER -