TY - GEN
T1 - BigDansing
T2 - ACM SIGMOD International Conference on Management of Data, SIGMOD 2015
AU - Khayyaty, Zuhair
AU - Ilyasz, Ihab F.
AU - Jindal, Alekh
AU - Madden, Samuel
AU - Ouzzani, Mourad
AU - Papotti, Paolo
AU - Quiané-Ruiz, Jorge Arnulfo
AU - Tang, Nan
AU - Yin, Si
N1 - Publisher Copyright:
Copyright © 2015 ACM.
PY - 2015/5/27
Y1 - 2015/5/27
N2 - Data cleansing approaches have usually focused on detecting and fixing errors with little attention to scaling to big datasets. This presents a serious impediment since data cleansing often involves costly computations such as enumerating pairs of tuples, handling inequality joins, and dealing with user-defined functions. In this paper, we present BigDansing, a Big Data Cleansing system to tackle efficiency, scalability, and ease-of-use issues in data cleansing. The system can run on top of most common general purpose data processing platforms, ranging from DBMSs to MapReduce-like frameworks. A user-friendly programming interface allows users to express data quality rules both declaratively and procedurally, with no requirement of being aware of the underlying distributed platform. BigDansing takes these rules into a series of transformations that enable distributed computations and several optimizations, such as shared scans and specialized joins operators. Experimental results on both synthetic and real datasets show that Big-Dansing outperforms existing baseline systems up to more than two orders of magnitude without sacrificing the quality provided by the repair algorithms.
AB - Data cleansing approaches have usually focused on detecting and fixing errors with little attention to scaling to big datasets. This presents a serious impediment since data cleansing often involves costly computations such as enumerating pairs of tuples, handling inequality joins, and dealing with user-defined functions. In this paper, we present BigDansing, a Big Data Cleansing system to tackle efficiency, scalability, and ease-of-use issues in data cleansing. The system can run on top of most common general purpose data processing platforms, ranging from DBMSs to MapReduce-like frameworks. A user-friendly programming interface allows users to express data quality rules both declaratively and procedurally, with no requirement of being aware of the underlying distributed platform. BigDansing takes these rules into a series of transformations that enable distributed computations and several optimizations, such as shared scans and specialized joins operators. Experimental results on both synthetic and real datasets show that Big-Dansing outperforms existing baseline systems up to more than two orders of magnitude without sacrificing the quality provided by the repair algorithms.
UR - http://www.scopus.com/inward/record.url?scp=84949872769&partnerID=8YFLogxK
U2 - 10.1145/2723372.2747646
DO - 10.1145/2723372.2747646
M3 - Conference contribution
AN - SCOPUS:84949872769
T3 - Proceedings of the ACM SIGMOD International Conference on Management of Data
SP - 1215
EP - 1230
BT - SIGMOD 2015 - Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data
PB - Association for Computing Machinery
Y2 - 31 May 2015 through 4 June 2015
ER -