Abstract
Data quality often manifests itself as inconsistencies between systems or inconsistencies with reality. The latter can be explained as a mismatch between real world objects and the way they are represented in databases. The former, as it is exemplified by data heterogeneity - semantic and structural - replicated data, and data integrity problems, is the main cause of duplicate records. Often the same real world entity is represented by two or more records. Many times, duplicate records do not share a common key and/or they contain errors that makes matching them a difficult problem. This specific problem is the subject of this expository study. This study relies on a thorough analysis of the literature, on an intimate knowledge of an application and on many years of working on data quality with sensitive telecommunication systems. This paper also presents a taxonomy for record matching algorithms and proposed solutions. This paper is limited to record matching and addresses the general problem of data quality only in passing. This should be and has been the topic of many other survey papers.
Original language | English |
---|---|
Publication status | Published - 2001 |
Externally published | Yes |