FAHES: A robust disguised missing values detector

Abdulhakim A. Qahtan*, Ahmed Elmagarmid, Raul Castro Fernandez, Mourad Ouzzani, Nan Tang

*Corresponding author for this work

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

28 Citations (Scopus)

Abstract

Missing values are common in real-world data and may seriously affect data analytics such as simple statistics and hypothesis testing. Generally speaking, there are two types of missing values: explicitly missing values (i.e., NULL values), and implicitly missing values (a.k.a. disguised missing values (DMVs)) such as “11111111" for a phone number and “Some college" for education. While detecting explicitly missing values is trivial, detecting DMVs is not; the essential challenge is the lack of standardization about how DMVs are generated. In this paper, we present FAHES, a robust system for detecting DMVs from two angles: DMVs as detectable outliers and as detectable inliers. For DMVs as outliers, we propose a syntactic outlier detection module for categorical data, and a density-based outlier detection module for numerical values. For DMVs as inliers, we propose a method that detects DMVs which follow either missing-completely-at-random or missing-at-random models. The robustness of FAHES is achieved through an ensemble technique that is inspired by outlier ensembles. Our extensive experiments using real-world data sets show that FAHES delivers better results than existing solutions.

Original languageEnglish
Title of host publicationKDD 2018 - Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining
PublisherAssociation for Computing Machinery
Pages2100-2109
Number of pages10
ISBN (Print)9781450355520
DOIs
Publication statusPublished - 19 Jul 2018
Event24th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD 2018 - London, United Kingdom
Duration: 19 Aug 201823 Aug 2018

Publication series

NameProceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining

Conference

Conference24th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD 2018
Country/TerritoryUnited Kingdom
CityLondon
Period19/08/1823/08/18

Keywords

  • Disguised Missing Value
  • Numerical Outliers
  • Syntactic Outliers
  • Syntactic Patterns

Fingerprint

Dive into the research topics of 'FAHES: A robust disguised missing values detector'. Together they form a unique fingerprint.

Cite this