TY - GEN
T1 - Multi-Tactic distance-based outlier detection
AU - Cao, Lei
AU - Yan, Yizhou
AU - Kuhlman, Caitlin
AU - Wang, Qingyang
AU - Rundensteiner, Elke A.
AU - Eltabakh, Mohamed
N1 - Publisher Copyright:
© 2017 IEEE.
PY - 2017/5/16
Y1 - 2017/5/16
N2 - As datasets increase radically in size, highly scalable algorithms leveraging modern distributed infrastructures need to be developed for detecting outliers in massive datasets. In this work, we present the first distributed distance-based outlier detection approach using the MapReduce-based infrastructure, called DOD. DOD features a single-pass execution framework that minimizes communication overhead. Furthermore, DOD overturns two fundamental assumptions widely adopted in the distributed analytics literature, namely cardinality-based load balancing and one algorithm for all data. The multi-Tactic strategy of DOD achieves a truly balanced workload by taking into account the data characteristics in data partitioning and assigns most appropriate algorithm for each partition based on our theoretical cost models established for distinct classes of detection algorithms. Thus, DOD effectively minimizes the end-To-end execution time. Our experimental study confirms the efficiency of DOD and its scalability to terabytes of data, beating the baseline solutions by a factor of 20x.
AB - As datasets increase radically in size, highly scalable algorithms leveraging modern distributed infrastructures need to be developed for detecting outliers in massive datasets. In this work, we present the first distributed distance-based outlier detection approach using the MapReduce-based infrastructure, called DOD. DOD features a single-pass execution framework that minimizes communication overhead. Furthermore, DOD overturns two fundamental assumptions widely adopted in the distributed analytics literature, namely cardinality-based load balancing and one algorithm for all data. The multi-Tactic strategy of DOD achieves a truly balanced workload by taking into account the data characteristics in data partitioning and assigns most appropriate algorithm for each partition based on our theoretical cost models established for distinct classes of detection algorithms. Thus, DOD effectively minimizes the end-To-end execution time. Our experimental study confirms the efficiency of DOD and its scalability to terabytes of data, beating the baseline solutions by a factor of 20x.
UR - http://www.scopus.com/inward/record.url?scp=85021225790&partnerID=8YFLogxK
U2 - 10.1109/ICDE.2017.143
DO - 10.1109/ICDE.2017.143
M3 - Conference contribution
AN - SCOPUS:85021225790
T3 - Proceedings - International Conference on Data Engineering
SP - 959
EP - 970
BT - Proceedings - 2017 IEEE 33rd International Conference on Data Engineering, ICDE 2017
PB - IEEE Computer Society
T2 - 33rd IEEE International Conference on Data Engineering, ICDE 2017
Y2 - 19 April 2017 through 22 April 2017
ER -