Abstract
With the increasing complexity of data-intensive MapReduce workloads, Hadoop must often accommodate hundreds or even thousands of recurring analytics queries that periodically execute over frequently updated datasets, e.g., latest stock transactions, new log files, or recent news feeds. For many applications, such recurring queries come with user-specified service-level agreements (SLAs), commonly expressed as the maximum allowed latency for producing results before their merits decay. The recurring nature of these emerging workloads combined with their SLA constraints make it challenging to share and optimize their execution. While some recent efforts on multi-job optimization in MapReduce have emerged, they focus on only sharing work among ad-hoc jobs on static datasets. Unfortunately, these sharing techniques neither take the recurring nature of the queries into account nor guarantee the satisfaction of the SLA requirements. In this work, we propose the first scalable multi-query sharing engine tailored for recurring workloads in the MapReduce infrastructure, called "Helix". Helix deploys new sliced window-alignment techniques to create sharing opportunities among recurring queries without introducing additional I/O overheads or unnecessary data scans. And then, Helix introduces a cost/benefit model for creating a sharing plan among the recurring queries, and a scheduling strategy for executing them to maximize the SLA satisfaction. Our experimental results over real-world datasets confirm that Helix significantly outperforms the state-of-art techniques by an order of magnitude.
Original language | English |
---|---|
Pages (from-to) | 714-725 |
Number of pages | 12 |
Journal | Proceedings of the VLDB Endowment |
Volume | 8 |
Issue number | 7 7 |
DOIs | |
Publication status | Published - 2015 |
Externally published | Yes |
Event | 41st International Conference on Very Large Data Bases, VLDB 2015 - Kohala Coast, United States Duration: 31 Aug 2015 → 4 Sept 2015 |