TY - JOUR
T1 - Horizon
T2 - 47th International Conference on Very Large Data Bases, VLDB 2021
AU - Rezig, El Kindi
AU - Ouzzani, Mourad
AU - Aref, Walid G.
AU - Elmagarmid, Ahmed K.
AU - Mahmood, Ahmed R.
AU - Stonebraker, Michael
N1 - Publisher Copyright:
© 2021, VLDB Endowment. All rights reserved.
PY - 2021
Y1 - 2021
N2 - A large class of data repair algorithms rely on integrity constraints to detect and repair errors. A well-studied class of constraints is Functional Dependencies (FDs, for short). Although there has been an increased interest in developing general data cleaning systems for a myriad of data errors, scalability has been left behind. This is because current systems, assume data cleaning is performed offline and in one iteration. However, developing data science pipelines is highly iterative and requires efficient cleaning techniques to scale to millions of records in seconds/minutes, not days. In our efforts to re-think the data cleaning stack and bring it to the era of data science, we introduce Horizon, an end-to-end FD repair system to address two key challenges: (1) Accuracy: Most existing FD repair techniques aim to produce repairs that minimize changes to the data that may lead to incorrect combinations of attribute values (or patterns). Horizon leverages the interaction between the data patterns induced by the various FDs, and subsequently selects repairs that preserve the most frequent patterns found in the original data, and hence leading to a better repair accuracy. Scalability: Existing data cleaning systems struggle when dealing with large-scale real-world datasets. Horizon features a linear-time repair algorithm that scales to millions of records, and is orders-of-magnitude faster than state-of-the-art cleaning algorithms. A benchmark of Horizon against state-of-the-art cleaning systems on multiple datasets and metrics shows that Horizon consistently outperforms existing techniques in repair quality and scalability.
AB - A large class of data repair algorithms rely on integrity constraints to detect and repair errors. A well-studied class of constraints is Functional Dependencies (FDs, for short). Although there has been an increased interest in developing general data cleaning systems for a myriad of data errors, scalability has been left behind. This is because current systems, assume data cleaning is performed offline and in one iteration. However, developing data science pipelines is highly iterative and requires efficient cleaning techniques to scale to millions of records in seconds/minutes, not days. In our efforts to re-think the data cleaning stack and bring it to the era of data science, we introduce Horizon, an end-to-end FD repair system to address two key challenges: (1) Accuracy: Most existing FD repair techniques aim to produce repairs that minimize changes to the data that may lead to incorrect combinations of attribute values (or patterns). Horizon leverages the interaction between the data patterns induced by the various FDs, and subsequently selects repairs that preserve the most frequent patterns found in the original data, and hence leading to a better repair accuracy. Scalability: Existing data cleaning systems struggle when dealing with large-scale real-world datasets. Horizon features a linear-time repair algorithm that scales to millions of records, and is orders-of-magnitude faster than state-of-the-art cleaning algorithms. A benchmark of Horizon against state-of-the-art cleaning systems on multiple datasets and metrics shows that Horizon consistently outperforms existing techniques in repair quality and scalability.
UR - http://www.scopus.com/inward/record.url?scp=85119699515&partnerID=8YFLogxK
U2 - 10.14778/3476249.3476301
DO - 10.14778/3476249.3476301
M3 - Conference article
AN - SCOPUS:85119699515
SN - 2150-8097
VL - 14
SP - 2546
EP - 2554
JO - Proceedings of the VLDB Endowment
JF - Proceedings of the VLDB Endowment
IS - 11
Y2 - 16 August 2021 through 20 August 2021
ER -