TY - GEN
T1 - Approximate XML joins
AU - Guha, Sudipto
AU - Jagadish, H. V.
AU - Koudas, Nick
AU - Srivastava, Divesh
AU - Yu, Ting
N1 - Publisher Copyright:
© 2002 ACM.
PY - 2002/6/3
Y1 - 2002/6/3
N2 - XML is widely recognized as the data interchange standard for tomorrow, because of its ability to represent data from a wide variety sources. Hence, XML is likely to be the format through which data from multiple sources is integrated.In this paper we study the problem of integrating XML data sources through correlations realized as join operations. A challenging aspect of this operation is the XML document structure. Two documents might convey approximately or exactly the same information but may be quite different in structure. Consequently approximate match in structure, in addition to, content has to be folded in the join operation. We quantify approximate match in structure and content using well defined notions of distance. For structure, we propose computationally inexpensive lower and upper bounds for the tree edit distance metric between two trees. We then show how the tree edit distance, and other metrics that quantify distance between trees, can be incorporated in a join framework. We introduce the notion of reference sets to facilitate this operation. Intuitively, a reference set consists of data elements used to project the data space. We characterize what constitutes a good choice of a reference set and we propose sampling based algorithms to identify them. This gives rise to a variety of algorithmic approaches for the problem, which we formulate and analyze. We demonstrate the practical utility of our solutions using large collections of real and synthetic XML data sets.
AB - XML is widely recognized as the data interchange standard for tomorrow, because of its ability to represent data from a wide variety sources. Hence, XML is likely to be the format through which data from multiple sources is integrated.In this paper we study the problem of integrating XML data sources through correlations realized as join operations. A challenging aspect of this operation is the XML document structure. Two documents might convey approximately or exactly the same information but may be quite different in structure. Consequently approximate match in structure, in addition to, content has to be folded in the join operation. We quantify approximate match in structure and content using well defined notions of distance. For structure, we propose computationally inexpensive lower and upper bounds for the tree edit distance metric between two trees. We then show how the tree edit distance, and other metrics that quantify distance between trees, can be incorporated in a join framework. We introduce the notion of reference sets to facilitate this operation. Intuitively, a reference set consists of data elements used to project the data space. We characterize what constitutes a good choice of a reference set and we propose sampling based algorithms to identify them. This gives rise to a variety of algorithmic approaches for the problem, which we formulate and analyze. We demonstrate the practical utility of our solutions using large collections of real and synthetic XML data sets.
UR - http://www.scopus.com/inward/record.url?scp=85134325897&partnerID=8YFLogxK
U2 - 10.1145/564691.564725
DO - 10.1145/564691.564725
M3 - Conference contribution
AN - SCOPUS:85134325897
T3 - Proceedings of the 2002 ACM SIGMOD International Conference on Management of Data, SIGMOD 2002
SP - 287
EP - 298
BT - Proceedings of the 2002 ACM SIGMOD International Conference on Management of Data, SIGMOD 2002
PB - Association for Computing Machinery, Inc
T2 - 2002 ACM SIGMOD International Conference on Management of Data, SIGMOD 2002
Y2 - 3 June 2002 through 6 June 2002
ER -