Employing checkpoint to improve job scheduling in large-scale systems

Shuangcheng Niu*, Jidong Zhai, Xiaosong Ma, Mingliang Liu, Yan Zhai, Wenguang Chen, Weimin Zheng

*Corresponding author for this work

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

9 Citations (Scopus)

Abstract

The FCFS-based backfill algorithm is widely used in scheduling high-performance computer systems. The algorithm relies on runtime estimate of jobs which is provided by users. However, statistics show the accuracy of user-provided estimate is poor. Users are very likely to provide a much longer runtime estimate than its real execution time. In this paper, we propose an aggressive backfilling approach with checkpoint based preemption to address the inaccuracy in user-provided runtime estimate. The approach is evaluated with real workload traces. The results show that compared with the FCFS-based backfill algorithm, our scheme improves the job scheduling performance in waiting time, slowdown and mean queue length by up to 40%. Meanwhile, only 4% of the jobs need to perform checkpoints.

Original languageEnglish
Title of host publicationJob Scheduling Strategies for Parallel Processing - 16th International Workshop, JSSPP 2012, Revised Selected Papers
PublisherSpringer Verlag
Pages36-55
Number of pages20
ISBN (Print)9783642358661
DOIs
Publication statusPublished - 2013
Externally publishedYes
Event16th Workshop on Job Scheduling Strategies for Parallel Processing, JSSPP 2012 - Shanghai, China
Duration: 25 May 201225 May 2012

Publication series

NameLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Volume7698 LNCS
ISSN (Print)0302-9743
ISSN (Electronic)1611-3349

Conference

Conference16th Workshop on Job Scheduling Strategies for Parallel Processing, JSSPP 2012
Country/TerritoryChina
CityShanghai
Period25/05/1225/05/12

Keywords

  • backfill algorithm
  • check-point/restart
  • job scheduling
  • runtime estimate

Fingerprint

Dive into the research topics of 'Employing checkpoint to improve job scheduling in large-scale systems'. Together they form a unique fingerprint.

Cite this