Adaptive correlation exploitation in big data query optimization

Yuchen Liu, Hai Liu, Dongqing Xiao, Mohamed Y. Eltabakh*

*Corresponding author for this work

Research output: Contribution to journalArticlepeer-review

8 Citations (Scopus)

Abstract

Correlations among the data attributes are abundant and inherent in most application domains. These correlations, if managed in systematic and efficient ways, would enable various optimization opportunities. Unfortunately, the state-of-art techniques are all heavily tailored toward optimizing factors intrinsic to relational databases, e.g., predicate selectivity, random I/O accesses, and secondary indexes, which are mostly not applicable to the modern big data infrastructures, e.g., Hadoop and Spark. In this paper, we propose the EXORD+ system for exploiting the data’s correlations in big data query optimization. EXORD+ supports two types of correlations; hard (which does not allow for exceptions) and soft (which allows for exceptions). We introduce a three-phase approach for managing soft correlations including: (1) validating and judging the worthiness of soft correlations, (2) selecting and preparing the soft correlations for deployment, and (3) exploiting the correlations in query optimization. EXORD+ introduces a novel cost-benefit model for adaptively selecting the most beneficial soft correlations given a query workload. We show the complexity of this problem (NP-Hard) and propose a heuristic to efficiently solve it in a polynomial time. Moreover, we present incremental maintenance algorithms for efficiently updating the system’s state under data appends and workload changes. EXORD+ prototype is implemented as an extension to the Hive engine on top of Hadoop. The experimental evaluation shows the potential of EXORD+ in achieving more than 10x speedup while introducing minimal storage overheads.

Original languageEnglish
Pages (from-to)873-898
Number of pages26
JournalVLDB Journal
Volume27
Issue number6
DOIs
Publication statusPublished - 1 Dec 2018
Externally publishedYes

Keywords

  • Big data
  • Data correlations
  • Incremental maintenance
  • Query optimization
  • Soft and hard correlations

Fingerprint

Dive into the research topics of 'Adaptive correlation exploitation in big data query optimization'. Together they form a unique fingerprint.

Cite this