TY - GEN
T1 - Building data civilizer pipelines with an advanced workflow engine
AU - Mansour, Essam
AU - Deng, Dong
AU - Fernandez, Raul Castro
AU - Qahtan, Abdulhakim A.
AU - Tao, Wenbo
AU - Abedjan, Ziawasch
AU - Elmagarmid, Ahmed
AU - Ilyas, Ihab F.
AU - Madden, Samuel
AU - Ouzzani, Mourad
AU - Stonebraker, Michael
AU - Tang, Nan
N1 - Publisher Copyright:
© 2018 IEEE.
PY - 2018/10/24
Y1 - 2018/10/24
N2 - In order for an enterprise to gain insight into its internal business and the changing outside environment, it is essential to provide the relevant data for in-depth analysis. Enterprise data is usually scattered across departments and geographic regions and is often inconsistent. Data scientists spend the majority of their time finding, preparing, integrating, and cleaning relevant data sets. Data Civilizer is an end-To-end data preparation system. In this paper, we present the complete system, focusing on our new workflow engine, a superior system for entity matching and consolidation, and new cleaning tools. Our workflow engine allows data scientists to author, execute and retrofit data preparation pipelines of different data discovery and cleaning services. Our end-To-end demo scenario is based on data from the MIT data warehouse and e-commerce data sets.
AB - In order for an enterprise to gain insight into its internal business and the changing outside environment, it is essential to provide the relevant data for in-depth analysis. Enterprise data is usually scattered across departments and geographic regions and is often inconsistent. Data scientists spend the majority of their time finding, preparing, integrating, and cleaning relevant data sets. Data Civilizer is an end-To-end data preparation system. In this paper, we present the complete system, focusing on our new workflow engine, a superior system for entity matching and consolidation, and new cleaning tools. Our workflow engine allows data scientists to author, execute and retrofit data preparation pipelines of different data discovery and cleaning services. Our end-To-end demo scenario is based on data from the MIT data warehouse and e-commerce data sets.
KW - Data Cleaning
KW - Data Discovery
KW - Data Integration
UR - http://www.scopus.com/inward/record.url?scp=85057102546&partnerID=8YFLogxK
U2 - 10.1109/ICDE.2018.00184
DO - 10.1109/ICDE.2018.00184
M3 - Conference contribution
AN - SCOPUS:85057102546
T3 - Proceedings - IEEE 34th International Conference on Data Engineering, ICDE 2018
SP - 1593
EP - 1596
BT - Proceedings - IEEE 34th International Conference on Data Engineering, ICDE 2018
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 34th IEEE International Conference on Data Engineering, ICDE 2018
Y2 - 16 April 2018 through 19 April 2018
ER -