TY - JOUR
T1 - Improving the availability of supercomputer job input data using temporal replication
AU - Wang, Chao
AU - Zhang, Zhe
AU - Ma, Xiaosong
AU - Vazhkudai, Sudharshan S.
AU - Mueller, Frank
PY - 2009/6
Y1 - 2009/6
N2 - Storage systems in supercomputers are a major reason for service interruptions. RAID solutions alone cannot provide sufficient protection as 1) growing average disk recovery times make RAID groups increasingly vulnerable to disk failures during reconstruction, and 2) RAID does not help with higher-level faults such failed I/O nodes. This paper presents a complementary approach based on the observation that files in the supercomputer scratch space are typically accessed by batch jobs whose execution can be anticipated. Therefore, we propose to transparently, selectively, and temporarily replicate "active" job input data by coordinating the parallel file system with the batch job scheduler. We have implemented the temporal replication scheme in the popular Lustre parallel file system and evaluated it with real-cluster experiments. Our results show that the scheme allows for fast online data reconstruction, with a reasonably low overall space and I/O bandwidth overhead.
AB - Storage systems in supercomputers are a major reason for service interruptions. RAID solutions alone cannot provide sufficient protection as 1) growing average disk recovery times make RAID groups increasingly vulnerable to disk failures during reconstruction, and 2) RAID does not help with higher-level faults such failed I/O nodes. This paper presents a complementary approach based on the observation that files in the supercomputer scratch space are typically accessed by batch jobs whose execution can be anticipated. Therefore, we propose to transparently, selectively, and temporarily replicate "active" job input data by coordinating the parallel file system with the batch job scheduler. We have implemented the temporal replication scheme in the popular Lustre parallel file system and evaluated it with real-cluster experiments. Our results show that the scheme allows for fast online data reconstruction, with a reasonably low overall space and I/O bandwidth overhead.
KW - Batch job scheduler
KW - Parallel file system
KW - Reliability
KW - Supercomputer
KW - Temporal replication
UR - http://www.scopus.com/inward/record.url?scp=67349212880&partnerID=8YFLogxK
U2 - 10.1007/s00450-009-0082-8
DO - 10.1007/s00450-009-0082-8
M3 - Article
AN - SCOPUS:67349212880
SN - 1865-2034
VL - 23
SP - 149
EP - 157
JO - Computer Science - Research and Development
JF - Computer Science - Research and Development
IS - 3-4
ER -