Record Matching to Improve Data Quality

Vassilios Verykios, Ahmed Khalifa Elmagarmid, Elias N. Houstis

Research output: Book/ReportCommissioned reportpeer-review

Abstract

Data Quality is defined in [TB98] as fitness for use, which implies that quality is relative to the use of data. Problems with data quality tend to fall into two categories: inconsistency among systems and inconsistency with reality. Format/syntax, semantic and value inconsistencies are representative of inconsistency among systems whereas incorrect and missing values are representative of inconsistencies with reality.

In this paper, we address the record matching problem which is related to value inconsistencies and incorrect or missing values. Inconsistencies related to duplicated or partially overlapping information among systems occur if changes in one system are not reflected in the other systems for various reasons such as bad design, lack of trust among systems, etc. The difficulties inherent in attempts to identify entities from different interoperating systems (as they in dependently evolve over time) that refer to the same real life entity are known as the record matching problem. This is a typical problem in multi-system organizations where data residing in diverse systems needs to be merged, either for assessing financial risks or for cutting down costs associated with various projects. The methodology presented in this paper unifies a variety of techniques addressing the record matching problem, which we are considering as a classification task. The techniques used are the following: inductive learning, clustering, fuzzy set theory, and uncertainty reasoning, all of which will improve existing methodologies with regards to the accuracy of the record matching approach as well as the computational complexity of identifying approximate duplicate records in large data sets.
Original languageEnglish
Publication statusPublished - 1999
Externally publishedYes

Fingerprint

Dive into the research topics of 'Record Matching to Improve Data Quality'. Together they form a unique fingerprint.

Cite this