TY - GEN
T1 - CLIMBER
T2 - 40th IEEE International Conference on Data Engineering, ICDE 2024
AU - Zhang, Liang
AU - Eltabakh, Mohamed Y.
AU - Rundensteiner, Elke A.
AU - Alnuaim, Khalid
N1 - Publisher Copyright:
© 2024 IEEE.
PY - 2024/5/16
Y1 - 2024/5/16
N2 - The terabyte-scale of data series has motivated recent efforts to design fully distributed techniques for supporting operations such as approximate kNN similarity search, which is a building block operation in most analytics services on data series. Unfortunately, these techniques are heavily geared towards achieving scalability at the cost of sacrificing the results' accuracy. State-of-the-art systems DPiSAX and TARDIS report accuracy below 10% and 40%, respectively, which is not practical for many real-world applications. In this paper, we investigate the root problems in these existing techniques that limit their ability to achieve better a trade-off between scalability and accuracy. Then, we propose a framework, called CLIMBER, that encompasses a novel feature extraction mechanism, indexing scheme, and query processing algorithms for supporting approximate similarity search in big data series. For CLIMBER, we propose a new loss-resistant dual representation composed of rank-sensitive and ranking-insensitive signatures capturing data series objects. Based on this representation, we devise a distributed two-level index structure supported by an efficient data partitioning scheme. Our similarity metrics tailored for this dual representation enables meaningful comparison and distance evaluation between the rank-sensitive and ranking-insensitive signatures. Finally, we propose two efficient query processing algorithms, CLIMBER-kNN and CLIMBER-kNN-Adaptive, for answering approximate kNN similarity queries. Our experimental study on real-world and benchmark datasets demonstrates that CLIMBER, unlike existing techniques, features results' accuracy above 80% while retaining the desired scalability to terabytes of data.
AB - The terabyte-scale of data series has motivated recent efforts to design fully distributed techniques for supporting operations such as approximate kNN similarity search, which is a building block operation in most analytics services on data series. Unfortunately, these techniques are heavily geared towards achieving scalability at the cost of sacrificing the results' accuracy. State-of-the-art systems DPiSAX and TARDIS report accuracy below 10% and 40%, respectively, which is not practical for many real-world applications. In this paper, we investigate the root problems in these existing techniques that limit their ability to achieve better a trade-off between scalability and accuracy. Then, we propose a framework, called CLIMBER, that encompasses a novel feature extraction mechanism, indexing scheme, and query processing algorithms for supporting approximate similarity search in big data series. For CLIMBER, we propose a new loss-resistant dual representation composed of rank-sensitive and ranking-insensitive signatures capturing data series objects. Based on this representation, we devise a distributed two-level index structure supported by an efficient data partitioning scheme. Our similarity metrics tailored for this dual representation enables meaningful comparison and distance evaluation between the rank-sensitive and ranking-insensitive signatures. Finally, we propose two efficient query processing algorithms, CLIMBER-kNN and CLIMBER-kNN-Adaptive, for answering approximate kNN similarity queries. Our experimental study on real-world and benchmark datasets demonstrates that CLIMBER, unlike existing techniques, features results' accuracy above 80% while retaining the desired scalability to terabytes of data.
KW - data series
KW - distributed processing
KW - indexing framework
KW - similarity search
UR - http://www.scopus.com/inward/record.url?scp=85200488527&partnerID=8YFLogxK
U2 - 10.1109/ICDE60146.2024.00301
DO - 10.1109/ICDE60146.2024.00301
M3 - Conference contribution
AN - SCOPUS:85200488527
T3 - Proceedings - International Conference on Data Engineering
SP - 3933
EP - 3946
BT - Proceedings - 2024 IEEE 40th International Conference on Data Engineering, ICDE 2024
PB - IEEE Computer Society
Y2 - 13 May 2024 through 17 May 2024
ER -