TY - GEN
T1 - Towards an end-to-end human-centric data cleaning framework
AU - Rezig, El Kindi
AU - Ouzzani, Mourad
AU - Elmagarmid, Ahmed K.
AU - Aref, Walid G.
AU - Stonebraker, Michael
N1 - Publisher Copyright:
© 2019 Association for Computing Machinery.
PY - 2019/7/5
Y1 - 2019/7/5
N2 - Data Cleaning refers to the process of detecting and fixing errors in the data. Human involvement is instrumental at several stages of this process such as providing rules or validating computed repairs. There is a plethora of data cleaning algorithms addressing a wide range of data errors (e.g., detecting duplicates, violations of integrity constraints, and missing values). Many of these algorithms involve a human in the loop, however, this latter is usually coupled to the underlying cleaning algorithms. In a real data cleaning pipeline, several data cleaning operations are performed using different tools. A high-level reasoning on these tools, when combined to repair the data, has the potential to unlock useful use cases to involve humans in the cleaning process. Additionally, we believe there is an opportunity to benefit from recent advances in active learning methods to minimize the effort humans have to spend to verify data items produced by tools or humans. There is currently no end-to-end data cleaning framework that systematically involves humans in the cleaning pipeline regardless of the underlying cleaning algorithms. In this paper, we present opportunities that this framework could offer, and highlight key challenges that need to be addressed to realize this vision. We present a design vision and discuss scenarios that motivate the need for this framework to judiciously assist humans in the cleaning process.
AB - Data Cleaning refers to the process of detecting and fixing errors in the data. Human involvement is instrumental at several stages of this process such as providing rules or validating computed repairs. There is a plethora of data cleaning algorithms addressing a wide range of data errors (e.g., detecting duplicates, violations of integrity constraints, and missing values). Many of these algorithms involve a human in the loop, however, this latter is usually coupled to the underlying cleaning algorithms. In a real data cleaning pipeline, several data cleaning operations are performed using different tools. A high-level reasoning on these tools, when combined to repair the data, has the potential to unlock useful use cases to involve humans in the cleaning process. Additionally, we believe there is an opportunity to benefit from recent advances in active learning methods to minimize the effort humans have to spend to verify data items produced by tools or humans. There is currently no end-to-end data cleaning framework that systematically involves humans in the cleaning pipeline regardless of the underlying cleaning algorithms. In this paper, we present opportunities that this framework could offer, and highlight key challenges that need to be addressed to realize this vision. We present a design vision and discuss scenarios that motivate the need for this framework to judiciously assist humans in the cleaning process.
UR - http://www.scopus.com/inward/record.url?scp=85072811207&partnerID=8YFLogxK
U2 - 10.1145/3328519.3329133
DO - 10.1145/3328519.3329133
M3 - Conference contribution
AN - SCOPUS:85072811207
T3 - Proceedings of the ACM SIGMOD International Conference on Management of Data
BT - Proceedings of the Workshop on Human-In-the-Loop Data Analytics, HILDA 2019
PB - Association for Computing Machinery
T2 - 2019 Workshop on Human-In-the-Loop Data Analytics, HILDA 2019, co-located with SIGMOD 2019
Y2 - 5 July 2019
ER -