TY - JOUR
T1 - Deep learning for blocking in entity matching
T2 - 47th International Conference on Very Large Data Bases, VLDB 2021
AU - Thirumuruganathan, Saravanan
AU - Li, Han
AU - Tang, Nan
AU - Ouzzani, Mourad
AU - Govind, Yash
AU - Paulsen, Derek
AU - Fung, Glenn
AU - Doan, Anhai
N1 - Publisher Copyright:
© 2021, VLDB Endowment. All rights reserved.
PY - 2021
Y1 - 2021
N2 - Entity matching (EM) finds data instances that refer to the same real-world entity. Most EM solutions perform blocking then matching. Many works have applied deep learning (DL) to matching, but far fewer works have applied DL to blocking. These blocking works are also limited in that they consider only a simple form of DL and some of them require labeled training data. In this paper, we develop the DeepBlocker framework that significantly advances the state of the art in applying DL to blocking for EM. We first define a large space of DL solutions for blocking, which contains solutions of varying complexity and subsumes most previous works. Next, we develop eight representative solutions in this space. These solutions do not require labeled training data and exploit recent advances in DL (e.g., sequence modeling, transformer, self supervision). We empirically determine which solutions perform best on what kind of datasets (structured, textual, or dirty). We show that the best solutions (among the above eight) outperform the best existing DL solution and the best existing non-DL solutions (including a state-of-the-art industrial non-DL solution), on dirty and textual data, and are comparable on structured data. Finally, we show that the combination of the best DL and non-DL solutions can perform even better, suggesting a new venue for research.
AB - Entity matching (EM) finds data instances that refer to the same real-world entity. Most EM solutions perform blocking then matching. Many works have applied deep learning (DL) to matching, but far fewer works have applied DL to blocking. These blocking works are also limited in that they consider only a simple form of DL and some of them require labeled training data. In this paper, we develop the DeepBlocker framework that significantly advances the state of the art in applying DL to blocking for EM. We first define a large space of DL solutions for blocking, which contains solutions of varying complexity and subsumes most previous works. Next, we develop eight representative solutions in this space. These solutions do not require labeled training data and exploit recent advances in DL (e.g., sequence modeling, transformer, self supervision). We empirically determine which solutions perform best on what kind of datasets (structured, textual, or dirty). We show that the best solutions (among the above eight) outperform the best existing DL solution and the best existing non-DL solutions (including a state-of-the-art industrial non-DL solution), on dirty and textual data, and are comparable on structured data. Finally, we show that the combination of the best DL and non-DL solutions can perform even better, suggesting a new venue for research.
UR - http://www.scopus.com/inward/record.url?scp=85119670722&partnerID=8YFLogxK
U2 - 10.14778/3476249.3476294
DO - 10.14778/3476249.3476294
M3 - Conference article
AN - SCOPUS:85119670722
SN - 2150-8097
VL - 14
SP - 2459
EP - 2472
JO - Proceedings of the VLDB Endowment
JF - Proceedings of the VLDB Endowment
IS - 11
Y2 - 16 August 2021 through 20 August 2021
ER -