TY - GEN
T1 - Effective and scalable clustering on massive attributed graphs
AU - Yang, Renchi
AU - Shi, Jieming
AU - Yang, Yin
AU - Huang, Keke
AU - Zhang, Shiqi
AU - Xiao, Xiaokui
N1 - Publisher Copyright:
© 2021 ACM.
PY - 2021/4/19
Y1 - 2021/4/19
N2 - Given a graph G where each node is associated with a set of attributes, and a parameter k specifying the number of output clusters, k-attributed graph clustering (k-AGC) groups nodes in G into k disjoint clusters, such that nodes within the same cluster share similar topological and attribute characteristics, while those in different clusters are dissimilar. This problem is challenging on massive graphs, e.g., with millions of nodes and billions of attribute values. For such graphs, existing solutions either incur prohibitively high costs, or produce clustering results with compromised quality. In this paper, we propose , an efficient approach to k-AGC that yields high-quality clusters with costs linear to the size of the input graph G. The main contributions of are twofold: (i) a novel formulation of the k-AGC problem based on an attributed multi-hop conductance quality measure custom-made for this problem setting, which effectively captures cluster coherence in terms of both topological proximities and attribute similarities, and (ii) a linear-time optimization solver that obtains high quality clusters iteratively, based on efficient matrix operations such as orthogonal iterations, an alternative optimization approach, as well as an initialization technique that significantly speeds up the convergence of in practice. Extensive experiments, comparing 11 competitors on 6 real datasets, demonstrate that consistently outperforms all competitors in terms of result quality measured against ground truth labels, while being up to orders of magnitude faster. In particular, on the Microsoft Academic Knowledge Graph dataset with 265.2 million edges and 1.1 billion attribute values, outputs high-quality results for 5-AGC within 1.68 hours using a single CPU core, while none of the 11 competitors finish within 3 days.
AB - Given a graph G where each node is associated with a set of attributes, and a parameter k specifying the number of output clusters, k-attributed graph clustering (k-AGC) groups nodes in G into k disjoint clusters, such that nodes within the same cluster share similar topological and attribute characteristics, while those in different clusters are dissimilar. This problem is challenging on massive graphs, e.g., with millions of nodes and billions of attribute values. For such graphs, existing solutions either incur prohibitively high costs, or produce clustering results with compromised quality. In this paper, we propose , an efficient approach to k-AGC that yields high-quality clusters with costs linear to the size of the input graph G. The main contributions of are twofold: (i) a novel formulation of the k-AGC problem based on an attributed multi-hop conductance quality measure custom-made for this problem setting, which effectively captures cluster coherence in terms of both topological proximities and attribute similarities, and (ii) a linear-time optimization solver that obtains high quality clusters iteratively, based on efficient matrix operations such as orthogonal iterations, an alternative optimization approach, as well as an initialization technique that significantly speeds up the convergence of in practice. Extensive experiments, comparing 11 competitors on 6 real datasets, demonstrate that consistently outperforms all competitors in terms of result quality measured against ground truth labels, while being up to orders of magnitude faster. In particular, on the Microsoft Academic Knowledge Graph dataset with 265.2 million edges and 1.1 billion attribute values, outputs high-quality results for 5-AGC within 1.68 hours using a single CPU core, while none of the 11 competitors finish within 3 days.
KW - Attributed graph
KW - Graph clustering
KW - Random walk
UR - http://www.scopus.com/inward/record.url?scp=85107980962&partnerID=8YFLogxK
U2 - 10.1145/3442381.3449875
DO - 10.1145/3442381.3449875
M3 - Conference contribution
AN - SCOPUS:85107980962
T3 - The Web Conference 2021 - Proceedings of the World Wide Web Conference, WWW 2021
SP - 3675
EP - 3687
BT - The Web Conference 2021 - Proceedings of the World Wide Web Conference, WWW 2021
PB - Association for Computing Machinery, Inc
T2 - 2021 World Wide Web Conference, WWW 2021
Y2 - 19 April 2021 through 23 April 2021
ER -