TY - GEN
T1 - Data Quality for Machine Learning Tasks
AU - Gupta, Nitin
AU - Mujumdar, Shashank
AU - Patel, Hima
AU - Masuda, Satoshi
AU - Panwar, Naveen
AU - Bandyopadhyay, Sambaran
AU - Mehta, Sameep
AU - Guttula, Shanmukha
AU - Afzal, Shazia
AU - Sharma Mittal, Ruhi
AU - Munigala, Vitobha
N1 - Publisher Copyright:
© 2021 Owner/Author.
PY - 2021/8/14
Y1 - 2021/8/14
N2 - The quality of training data has a huge impact on the efficiency, accuracy and complexity of machine learning tasks. Data remains susceptible to errors or irregularities that may be introduced during collection, aggregation or annotation stage. This necessitates profiling and assessment of data to understand its suitability for machine learning tasks and failure to do so can result in inaccurate analytics and unreliable decisions. While researchers and practitioners have focused on improving the quality of models, there are limited efforts towards improving the data quality. Assessing the quality of the data across intelligently designed metrics and developing corresponding transformation operations to address the quality gaps helps to reduce the effort of a data scientist for iterative debugging of the ML pipeline to improve model performance. This tutorial highlights the importance of analysing data quality in terms of its value for ML applications. Finding the data quality issues in data helps different personas like data stewards, data scientists, subject matter experts, or machine learning scientists to get relevant data insights and take remedial actions to rectify any issue. This tutorial surveys all the important data quality related approaches for structured, unstructured and spatio-temporal domains discussed in literature, focusing on the intuition behind them, highlighting their strengths and similarities, and illustrates their applicability to real-world problems. Finally we will discuss the interesting work IBM Research is doing in this space.
AB - The quality of training data has a huge impact on the efficiency, accuracy and complexity of machine learning tasks. Data remains susceptible to errors or irregularities that may be introduced during collection, aggregation or annotation stage. This necessitates profiling and assessment of data to understand its suitability for machine learning tasks and failure to do so can result in inaccurate analytics and unreliable decisions. While researchers and practitioners have focused on improving the quality of models, there are limited efforts towards improving the data quality. Assessing the quality of the data across intelligently designed metrics and developing corresponding transformation operations to address the quality gaps helps to reduce the effort of a data scientist for iterative debugging of the ML pipeline to improve model performance. This tutorial highlights the importance of analysing data quality in terms of its value for ML applications. Finding the data quality issues in data helps different personas like data stewards, data scientists, subject matter experts, or machine learning scientists to get relevant data insights and take remedial actions to rectify any issue. This tutorial surveys all the important data quality related approaches for structured, unstructured and spatio-temporal domains discussed in literature, focusing on the intuition behind them, highlighting their strengths and similarities, and illustrates their applicability to real-world problems. Finally we will discuss the interesting work IBM Research is doing in this space.
KW - data quality
KW - machine learning
KW - quality metrics
UR - http://www.scopus.com/inward/record.url?scp=85113837999&partnerID=8YFLogxK
U2 - 10.1145/3447548.3470817
DO - 10.1145/3447548.3470817
M3 - Conference contribution
AN - SCOPUS:85113837999
T3 - Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining
SP - 4040
EP - 4041
BT - KDD 2021 - Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery and Data Mining
PB - Association for Computing Machinery
T2 - 27th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, KDD 2021
Y2 - 14 August 2021 through 18 August 2021
ER -