Eagle-eyed elephant: Split-oriented indexing in Hadoop

Mohamed Y. Eltabakh, Fatma Özcan, Yannis Sismanis, Peter J. Haas, Hamid Pirahesh, Jan Vondrak

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

29 Citations (Scopus)

Abstract

An increasingly important analytics scenario for Hadoop involves multiple (often ad hoc) grouping and aggregation queries with selection predicates over a slowly changing dataset. These queries are typically expressed via high-level query languages such as Jaql, Pig, and Hive, and are used either directly for business-intelligence applications or to prepare the data for statistical model building and machine learning. In such scenarios it has been increasingly recognized that, as in classical databases, techniques for avoiding access to irrelevant data can dramatically improve query performance. Prior work on Hadoop, however, has simply ported classical techniques to the MapReduce setting, focusing on record-level indexing and key-based partition elimination. Unfortunately, record-level indexing only slightly improves overall query performance, because it does not minimize the number of mapper "waves", which is determined by the number of processed splits. Moreover, key-based partitioning requires data reorganization, which is usually impractical in Hadoop settings. We therefore need to re-envision how data access mechanisms are defined and implemented. To this end, we introduce the Eagle-Eyed Elephant (E3) framework for boosting the efficiency of query processing in Hadoop by avoiding accesses of data splits that are irrelevant to the query at hand. Using novel techniques involving inverted indexes over splits, domain segmentation, materialized views, and adaptive caching, E3 avoids accessing irrelevant splits even in the face of evolving workloads and data. Our experiments show that E3 can achieve up to 20x cost savings with small to moderate storage overheads.

Original languageEnglish
Title of host publicationAdvances in Database Technology - EDBT 2013
Subtitle of host publication16th International Conference on Extending Database Technology, Proceedings
Pages89-100
Number of pages12
DOIs
Publication statusPublished - 2013
Externally publishedYes
Event16th International Conference on Extending Database Technology, EDBT 2013 - Genoa, Italy
Duration: 18 Mar 201322 Mar 2013

Publication series

NameACM International Conference Proceeding Series

Conference

Conference16th International Conference on Extending Database Technology, EDBT 2013
Country/TerritoryItaly
CityGenoa
Period18/03/1322/03/13

Fingerprint

Dive into the research topics of 'Eagle-eyed elephant: Split-oriented indexing in Hadoop'. Together they form a unique fingerprint.

Cite this