TY - GEN
T1 - Large-scale frequent subgraph mining in MapReduce
AU - Lin, Wenqing
AU - Xiao, Xiaokui
AU - Ghinita, Gabriel
PY - 2014
Y1 - 2014
N2 - Mining frequent subgraphs from a large collection of graph objects is an important problem in several application domains such as bio-informatics, social networks, computer vision, etc. The main challenge in subgraph mining is efficiency, as (i) testing for graph isomorphisms is computationally intensive, and (ii) the cardinality of the graph collection to be mined may be very large. We propose a two-step filter-and-refinement approach that is suitable to massive parallelization within the scalable MapReduce computing model. We partition the collection of graphs among worker nodes, and each worker applies the filter step to determine a set of candidate subgraphs that are locally frequent in its partition. The union of all such graphs is the input to the refinement step, where each candidate is checked against all partitions and only the globally frequent graphs are retained. We devise a statistical threshold mechanism that allows us to predict which subgraphs have a high chance to become globally frequent, and thus reduce the computational overhead in the refinement step. We also propose effective strategies to avoid redundant computation in each round when searching for candidate graphs, as well as a lightweight graph compression mechanism to reduce the communication cost between machines. Extensive experimental evaluation results on several real-world large graph datasets show that the proposed approach clearly outperforms the existing state-of-the-art and provides a practical solution to the problem of frequent subgraph mining for massive collections of graphs.
AB - Mining frequent subgraphs from a large collection of graph objects is an important problem in several application domains such as bio-informatics, social networks, computer vision, etc. The main challenge in subgraph mining is efficiency, as (i) testing for graph isomorphisms is computationally intensive, and (ii) the cardinality of the graph collection to be mined may be very large. We propose a two-step filter-and-refinement approach that is suitable to massive parallelization within the scalable MapReduce computing model. We partition the collection of graphs among worker nodes, and each worker applies the filter step to determine a set of candidate subgraphs that are locally frequent in its partition. The union of all such graphs is the input to the refinement step, where each candidate is checked against all partitions and only the globally frequent graphs are retained. We devise a statistical threshold mechanism that allows us to predict which subgraphs have a high chance to become globally frequent, and thus reduce the computational overhead in the refinement step. We also propose effective strategies to avoid redundant computation in each round when searching for candidate graphs, as well as a lightweight graph compression mechanism to reduce the communication cost between machines. Extensive experimental evaluation results on several real-world large graph datasets show that the proposed approach clearly outperforms the existing state-of-the-art and provides a practical solution to the problem of frequent subgraph mining for massive collections of graphs.
UR - http://www.scopus.com/inward/record.url?scp=84901750960&partnerID=8YFLogxK
U2 - 10.1109/ICDE.2014.6816705
DO - 10.1109/ICDE.2014.6816705
M3 - Conference contribution
AN - SCOPUS:84901750960
SN - 9781479925544
T3 - Proceedings - International Conference on Data Engineering
SP - 844
EP - 855
BT - 2014 IEEE 30th International Conference on Data Engineering, ICDE 2014
PB - IEEE Computer Society
T2 - 30th IEEE International Conference on Data Engineering, ICDE 2014
Y2 - 31 March 2014 through 4 April 2014
ER -