TY - GEN
T1 - Cost estimation across heterogeneous SQL-based big data infrastructures in teradata intellisphere®
AU - Awada, Kassem
AU - Eltabakh, Mohamed Y.
AU - Tang, Conrad
AU - Al-Kateb, Mohammed
AU - Nair, Sanjay
AU - Au, Grace
N1 - Publisher Copyright:
© 2020 Copyright held by the owner/author(s).
PY - 2020
Y1 - 2020
N2 - In big data ecosystems, it is becoming inevitable to query data that span multiple heterogeneous data sources (remote systems) to build meaningful querying and analytical workflows. Existing work that aims at unifying heterogeneous systems into a single architecture lacks the fundamental aspect of efficient cost estimation of SQL-based operators over remote systems. The problem is fundamental because all modern optimizers are cost-based, and without accurate cost estimation for each query operator, the generated plans can be way off the optimal plan. Nevertheless, the problem is mostly overlooked by existing systems because the focus is either on homogeneous distributed RDBMSs in which cost estimation is already extensively studied, or on fully heterogeneous engines in which SQL querying and SQL query optimization are not applicable (or at least are not the core problem). In this paper, we propose a comprehensive remote-system cost estimation module for SQL operators, which is a core module within the Teradata IntelliSphere architecture. The proposed module encompasses three costing approaches, namely logical-operator, sub-operator, and hybrid approaches, which are suitable for black box, open box, and a mix of black and open box systems, respectively. The cost estimation module leverages analytical and deep learning models with novel techniques for efficient extrapolation when needed. The techniques presented in this paper are modular and can be adopted by other systems. Extensive experimental evaluation shows the practicality and efficiency of the proposed system.
AB - In big data ecosystems, it is becoming inevitable to query data that span multiple heterogeneous data sources (remote systems) to build meaningful querying and analytical workflows. Existing work that aims at unifying heterogeneous systems into a single architecture lacks the fundamental aspect of efficient cost estimation of SQL-based operators over remote systems. The problem is fundamental because all modern optimizers are cost-based, and without accurate cost estimation for each query operator, the generated plans can be way off the optimal plan. Nevertheless, the problem is mostly overlooked by existing systems because the focus is either on homogeneous distributed RDBMSs in which cost estimation is already extensively studied, or on fully heterogeneous engines in which SQL querying and SQL query optimization are not applicable (or at least are not the core problem). In this paper, we propose a comprehensive remote-system cost estimation module for SQL operators, which is a core module within the Teradata IntelliSphere architecture. The proposed module encompasses three costing approaches, namely logical-operator, sub-operator, and hybrid approaches, which are suitable for black box, open box, and a mix of black and open box systems, respectively. The cost estimation module leverages analytical and deep learning models with novel techniques for efficient extrapolation when needed. The techniques presented in this paper are modular and can be adopted by other systems. Extensive experimental evaluation shows the practicality and efficiency of the proposed system.
UR - http://www.scopus.com/inward/record.url?scp=85084192817&partnerID=8YFLogxK
U2 - 10.5441/002/edbt.2020.64
DO - 10.5441/002/edbt.2020.64
M3 - Conference contribution
AN - SCOPUS:85084192817
T3 - Advances in Database Technology - EDBT
SP - 534
EP - 545
BT - Advances in Database Technology - EDBT 2020
A2 - Bonifati, Angela
A2 - Zhou, Yongluan
A2 - Vaz Salles, Marcos Antonio
A2 - Bohm, Alexander
A2 - Olteanu, Dan
A2 - Fletcher, George
A2 - Khan, Arijit
A2 - Yang, Bin
PB - OpenProceedings.org
T2 - 23rd International Conference on Extending Database Technology, EDBT 2020
Y2 - 30 March 2020 through 2 April 2020
ER -