Temporal rules discovery for web data cleaning

Ziawasch Abedjan, Cuneyt G. Akcora, Mourad Ouzzani, Paolo Papotti, Michael Stonebraker

Research output: Contribution to journalConference articlepeer-review

44 Citations (Scopus)

Abstract

Declarative rules, such as functional dependencies, are widely used for cleaning data. Several systems take them as input for detecting errors and computing a "clean" version of the data. To support domain experts,in specifying these rules, several tools have been proposed to profile the data and mine rules. However, existing discovery techniques have traditionally ignored the time dimension. Recurrent events, such as persons reported in locations, have a duration in which they are valid, and this duration should be part of the rules or the cleaning process would simply fail. In this work, we study the rule discovery problem for temporal web data. Such a discovery process is challenging because of the nature of web data; extracted facts are (i) sparse over time, (ii) reported with delays, and (iii) often reported with errors over the values because of inaccurate sources or non robust extractors. We handle these challenges with a new discovery approach that is more robust to noise. Our solution uses machine learning methods, such as association measures and outlier detection, for the discovery of the rules, together with an aggressive repair of the data in the mining step itself. Our experimental evaluation over real-world data from Recorded Future, an intelligence company that monitors over 700K Web sources, shows that temporal rules improve the quality of the data with an increase of the average precision in the cleaning process from 0.37 to 0.84, and a 40% relative increase in the average F-measure.

Original languageEnglish
Pages (from-to)336-347
Number of pages12
JournalProceedings of the VLDB Endowment
Volume9
Issue number4
DOIs
Publication statusPublished - 2016
Event42nd International Conference on Very Large Data Bases, VLDB 2016 - New Delhi, India
Duration: 5 Sept 20169 Sept 2016

Fingerprint

Dive into the research topics of 'Temporal rules discovery for web data cleaning'. Together they form a unique fingerprint.

Cite this