TY - GEN
T1 - CoClean
T2 - 2020 ACM SIGMOD International Conference on Management of Data, SIGMOD 2020
AU - Musleh, Mashaal
AU - Ouzzani, Mourad
AU - Tang, Nan
AU - Doan, An Hai
N1 - Publisher Copyright:
© 2020 Association for Computing Machinery.
PY - 2020/6/14
Y1 - 2020/6/14
N2 - High quality data is crucial for many applications but real-life data is often dirty. Unfortunately, automated solutions are often not trustable and are thus seldom employed in practice. In real-world scenarios, it is often necessary to resort to manual cleaning for obtaining pristine data. Existing human-in-the-loop solutions, such as Trifacta and OpenRefine, typically involve a single user. This is often error-prone, limited to a single-person expertise, and cannot scale with the ever growing volume, variety and veracity of data. We propose a crowd-in-the-loop cleaning system, called CoClean, built on top of Python Pandas dataframe, a widely used library for data scientists. The core of CoCleanis a new Python library called Collaborative dataframe (CDF) that allows one to share data represented as a dataframe with other users. CDF is responsible for synchronizing and aggregating annotations obtained from different users. The attendees will have the opportunity to experience the following features:(1)Data Assignment: Given a dataframe, the owner can assign it (or a subset of it) to different users. (2)Supporting both lay and power users: lay users can use a GUI for direct manual cleaning of the data, while power users can work on the assigned data through a Jupyter Notebook where they can write scripts to do batch cleaning. (3)Combining machines and humans: Possible errors and repairs generated by machine algorithms can be highlighted as annotations, which can make the life of users easier for manual cleaning. (4)Collaboration Modes: CoClean supports two modes: blind-on(no user can see the annotations from others) and blind-off.
AB - High quality data is crucial for many applications but real-life data is often dirty. Unfortunately, automated solutions are often not trustable and are thus seldom employed in practice. In real-world scenarios, it is often necessary to resort to manual cleaning for obtaining pristine data. Existing human-in-the-loop solutions, such as Trifacta and OpenRefine, typically involve a single user. This is often error-prone, limited to a single-person expertise, and cannot scale with the ever growing volume, variety and veracity of data. We propose a crowd-in-the-loop cleaning system, called CoClean, built on top of Python Pandas dataframe, a widely used library for data scientists. The core of CoCleanis a new Python library called Collaborative dataframe (CDF) that allows one to share data represented as a dataframe with other users. CDF is responsible for synchronizing and aggregating annotations obtained from different users. The attendees will have the opportunity to experience the following features:(1)Data Assignment: Given a dataframe, the owner can assign it (or a subset of it) to different users. (2)Supporting both lay and power users: lay users can use a GUI for direct manual cleaning of the data, while power users can work on the assigned data through a Jupyter Notebook where they can write scripts to do batch cleaning. (3)Combining machines and humans: Possible errors and repairs generated by machine algorithms can be highlighted as annotations, which can make the life of users easier for manual cleaning. (4)Collaboration Modes: CoClean supports two modes: blind-on(no user can see the annotations from others) and blind-off.
KW - data cleaning
KW - data collaboration
KW - data consolidation
KW - data preparation
UR - http://www.scopus.com/inward/record.url?scp=85086237373&partnerID=8YFLogxK
U2 - 10.1145/3318464.3384698
DO - 10.1145/3318464.3384698
M3 - Conference contribution
AN - SCOPUS:85086237373
T3 - Proceedings of the ACM SIGMOD International Conference on Management of Data
SP - 2757
EP - 2760
BT - SIGMOD 2020 - Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data
PB - Association for Computing Machinery
Y2 - 14 June 2020 through 19 June 2020
ER -