TY - GEN
T1 - Automatic Assessment of Quality of your Data for AI
AU - Patel, Hima
AU - Gupta, Nitin
AU - Panwar, Naveen
AU - Sharma Mittal, Ruhi
AU - Mehta, Sameep
AU - Guttula, Shanmukha
AU - Mujumdar, Shashank
AU - Afzal, Shazia
AU - Bedathur, Srikanta
AU - Munigala, Vitobha
N1 - Publisher Copyright:
© 2022 ACM.
PY - 2022/1/8
Y1 - 2022/1/8
N2 - The saying Garbage In, Garbage Out resonates perfectly within the machine learning and artificial intelligence community. While there has been considerable ongoing effort for improving the quality of models, there is relatively less focus on systematically analysing the quality of data with respect to its efficacy for machine learning. Assessing the quality of the data across intelligently designed metrics and developing corresponding transformation operations to address the quality gaps helps to reduce the effort of a data scientist for iterative debugging of the ML pipeline to improve model performance. In this tutorial, we emphasize on the importance of data quality and its associated challenges in data, and highlights the importance of analysing data quality in terms of its value for machine learning applications. We will survey on important data-centric approaches to improve the data quality and the ML pipeline. We also will be focusing on the intuition behind them, highlighting their strengths and similarities, and illustrates their applicability to real-world problems. As part of hands on session, we first provide an overview on available data quality analysis tools like: Pandas Profilers, Amazon Deepqu, IBM's Data Quality for AI, etc. We will then showcase how an end users can assess the data quality for their structured (tabular) data using one of the available tool in detail.
AB - The saying Garbage In, Garbage Out resonates perfectly within the machine learning and artificial intelligence community. While there has been considerable ongoing effort for improving the quality of models, there is relatively less focus on systematically analysing the quality of data with respect to its efficacy for machine learning. Assessing the quality of the data across intelligently designed metrics and developing corresponding transformation operations to address the quality gaps helps to reduce the effort of a data scientist for iterative debugging of the ML pipeline to improve model performance. In this tutorial, we emphasize on the importance of data quality and its associated challenges in data, and highlights the importance of analysing data quality in terms of its value for machine learning applications. We will survey on important data-centric approaches to improve the data quality and the ML pipeline. We also will be focusing on the intuition behind them, highlighting their strengths and similarities, and illustrates their applicability to real-world problems. As part of hands on session, we first provide an overview on available data quality analysis tools like: Pandas Profilers, Amazon Deepqu, IBM's Data Quality for AI, etc. We will then showcase how an end users can assess the data quality for their structured (tabular) data using one of the available tool in detail.
UR - http://www.scopus.com/inward/record.url?scp=85122670773&partnerID=8YFLogxK
U2 - 10.1145/3493700.3493774
DO - 10.1145/3493700.3493774
M3 - Conference contribution
AN - SCOPUS:85122670773
T3 - ACM International Conference Proceeding Series
SP - 354
EP - 357
BT - CODS-COMAD 2022 - Proceedings of the 5th Joint International Conference on Data Science and Management of Data (9th ACM IKDD CODS and 27th COMAD)
PB - Association for Computing Machinery
T2 - 5th ACM India Joint 9th ACM IKDD Conference on Data Science and 27th International Conference on Management of Data, CODS-COMAD 2022
Y2 - 7 January 2022 through 10 January 2022
ER -