TY - GEN
T1 - Query-time record linkage and fusion over Web databases
AU - Rezig, El Kindi
AU - Dragut, Eduard C.
AU - Ouzzani, Mourad
AU - Elmagarmid, Ahmed K.
N1 - Publisher Copyright:
© 2015 IEEE.
PY - 2015/5/26
Y1 - 2015/5/26
N2 - Data-intensive Web applications usually require integrating data from Web sources at query time. The sources may refer to the same real-world entity in different ways and some may even provide outdated or erroneous data. An important task is to recognize and merge the records that refer to the same real world entity at query time. Most existing duplicate detection and fusion techniques work in the off-line setting and do not meet the online constraint. There are at least two aspects that differentiate online duplicate detection and fusion from its off-line counterpart. (i) The latter assumes that the entire data is available, while the former cannot make such an assumption. (ii) Several query submissions may be required to compute the 'ideal' representation of an entity in the online setting. This paper presents a general framework for the online setting based on an iterative record-based caching technique. A set of frequently requested records is deduplicated off-line and cached for future reference. Newly arriving records in response to a query are deduplicated jointly with the records in the cache, presented to the user and appended to the cache. Experiments with real and synthetic data show the benefit of our solution over traditional record linkage techniques applied to an online setting.
AB - Data-intensive Web applications usually require integrating data from Web sources at query time. The sources may refer to the same real-world entity in different ways and some may even provide outdated or erroneous data. An important task is to recognize and merge the records that refer to the same real world entity at query time. Most existing duplicate detection and fusion techniques work in the off-line setting and do not meet the online constraint. There are at least two aspects that differentiate online duplicate detection and fusion from its off-line counterpart. (i) The latter assumes that the entire data is available, while the former cannot make such an assumption. (ii) Several query submissions may be required to compute the 'ideal' representation of an entity in the online setting. This paper presents a general framework for the online setting based on an iterative record-based caching technique. A set of frequently requested records is deduplicated off-line and cached for future reference. Newly arriving records in response to a query are deduplicated jointly with the records in the cache, presented to the user and appended to the cache. Experiments with real and synthetic data show the benefit of our solution over traditional record linkage techniques applied to an online setting.
UR - http://www.scopus.com/inward/record.url?scp=84940834271&partnerID=8YFLogxK
U2 - 10.1109/ICDE.2015.7113271
DO - 10.1109/ICDE.2015.7113271
M3 - Conference contribution
AN - SCOPUS:84940834271
T3 - Proceedings - International Conference on Data Engineering
SP - 42
EP - 53
BT - 2015 IEEE 31st International Conference on Data Engineering, ICDE 2015
PB - IEEE Computer Society
T2 - 2015 31st IEEE International Conference on Data Engineering, ICDE 2015
Y2 - 13 April 2015 through 17 April 2015
ER -