Deep learning for blocking in entity matching: A design space exploration

Saravanan Thirumuruganathan*, Han Li, Nan Tang, Mourad Ouzzani, Yash Govind, Derek Paulsen, Glenn Fung, Anhai Doan

*Corresponding author for this work

Research output: Contribution to journalConference articlepeer-review

49 Citations (Scopus)

Abstract

Entity matching (EM) finds data instances that refer to the same real-world entity. Most EM solutions perform blocking then matching. Many works have applied deep learning (DL) to matching, but far fewer works have applied DL to blocking. These blocking works are also limited in that they consider only a simple form of DL and some of them require labeled training data. In this paper, we develop the DeepBlocker framework that significantly advances the state of the art in applying DL to blocking for EM. We first define a large space of DL solutions for blocking, which contains solutions of varying complexity and subsumes most previous works. Next, we develop eight representative solutions in this space. These solutions do not require labeled training data and exploit recent advances in DL (e.g., sequence modeling, transformer, self supervision). We empirically determine which solutions perform best on what kind of datasets (structured, textual, or dirty). We show that the best solutions (among the above eight) outperform the best existing DL solution and the best existing non-DL solutions (including a state-of-the-art industrial non-DL solution), on dirty and textual data, and are comparable on structured data. Finally, we show that the combination of the best DL and non-DL solutions can perform even better, suggesting a new venue for research.

Original languageEnglish
Pages (from-to)2459-2472
Number of pages14
JournalProceedings of the VLDB Endowment
Volume14
Issue number11
DOIs
Publication statusPublished - 2021
Event47th International Conference on Very Large Data Bases, VLDB 2021 - Virtual, Online
Duration: 16 Aug 202120 Aug 2021

Fingerprint

Dive into the research topics of 'Deep learning for blocking in entity matching: A design space exploration'. Together they form a unique fingerprint.

Cite this