TY - JOUR
T1 - Adaptive correlation exploitation in big data query optimization
AU - Liu, Yuchen
AU - Liu, Hai
AU - Xiao, Dongqing
AU - Eltabakh, Mohamed Y.
N1 - Publisher Copyright:
© 2018, Springer-Verlag GmbH Germany, part of Springer Nature.
PY - 2018/12/1
Y1 - 2018/12/1
N2 - Correlations among the data attributes are abundant and inherent in most application domains. These correlations, if managed in systematic and efficient ways, would enable various optimization opportunities. Unfortunately, the state-of-art techniques are all heavily tailored toward optimizing factors intrinsic to relational databases, e.g., predicate selectivity, random I/O accesses, and secondary indexes, which are mostly not applicable to the modern big data infrastructures, e.g., Hadoop and Spark. In this paper, we propose the EXORD+ system for exploiting the data’s correlations in big data query optimization. EXORD+ supports two types of correlations; hard (which does not allow for exceptions) and soft (which allows for exceptions). We introduce a three-phase approach for managing soft correlations including: (1) validating and judging the worthiness of soft correlations, (2) selecting and preparing the soft correlations for deployment, and (3) exploiting the correlations in query optimization. EXORD+ introduces a novel cost-benefit model for adaptively selecting the most beneficial soft correlations given a query workload. We show the complexity of this problem (NP-Hard) and propose a heuristic to efficiently solve it in a polynomial time. Moreover, we present incremental maintenance algorithms for efficiently updating the system’s state under data appends and workload changes. EXORD+ prototype is implemented as an extension to the Hive engine on top of Hadoop. The experimental evaluation shows the potential of EXORD+ in achieving more than 10x speedup while introducing minimal storage overheads.
AB - Correlations among the data attributes are abundant and inherent in most application domains. These correlations, if managed in systematic and efficient ways, would enable various optimization opportunities. Unfortunately, the state-of-art techniques are all heavily tailored toward optimizing factors intrinsic to relational databases, e.g., predicate selectivity, random I/O accesses, and secondary indexes, which are mostly not applicable to the modern big data infrastructures, e.g., Hadoop and Spark. In this paper, we propose the EXORD+ system for exploiting the data’s correlations in big data query optimization. EXORD+ supports two types of correlations; hard (which does not allow for exceptions) and soft (which allows for exceptions). We introduce a three-phase approach for managing soft correlations including: (1) validating and judging the worthiness of soft correlations, (2) selecting and preparing the soft correlations for deployment, and (3) exploiting the correlations in query optimization. EXORD+ introduces a novel cost-benefit model for adaptively selecting the most beneficial soft correlations given a query workload. We show the complexity of this problem (NP-Hard) and propose a heuristic to efficiently solve it in a polynomial time. Moreover, we present incremental maintenance algorithms for efficiently updating the system’s state under data appends and workload changes. EXORD+ prototype is implemented as an extension to the Hive engine on top of Hadoop. The experimental evaluation shows the potential of EXORD+ in achieving more than 10x speedup while introducing minimal storage overheads.
KW - Big data
KW - Data correlations
KW - Incremental maintenance
KW - Query optimization
KW - Soft and hard correlations
UR - http://www.scopus.com/inward/record.url?scp=85050820054&partnerID=8YFLogxK
U2 - 10.1007/s00778-018-0515-8
DO - 10.1007/s00778-018-0515-8
M3 - Article
AN - SCOPUS:85050820054
SN - 1066-8888
VL - 27
SP - 873
EP - 898
JO - VLDB Journal
JF - VLDB Journal
IS - 6
ER -