TY - GEN
T1 - Semantics-based distributed I/O for mpiBLAST
AU - Balaji, P.
AU - Feng, W.
AU - Archuleta, J.
AU - Lin, H.
AU - Kettimuthu, R.
AU - Thakur, R.
AU - Ma, X.
PY - 2008
Y1 - 2008
N2 - BLAST is a widely used software toolkit for genomic sequence search. mpiBLAST is a freely available, open-source parallelization of BLAST that uses database segmentation to allow different worker processes to search (in parallel) unique segments of the database. After searching, the workers write their output to a filesystem. While mpiBLAST has been shown to achieve high performance in clusters with fast local filesystems, its I/O processing remains a concern for scalability, especially in systems having limited I/O capabilities such as distributed filesystems spread across a wide-area network. Thus, we present ParaMEDIC - a novel environment that uses applicationspecific semantic information to compress I/O data and improve performance in distributed environments. Specifically, for mpiBLAST, ParaMEDIC partitions worker processes into compute and I/O workers. Compute workers, instead of directly writing the output to the filesystem, the workers process the output using semantic knowledge about the application to generate metadata and write the metadata to the filesystem. I/O workers, which physically reside closer to the actual storage, then process this metadata to re-create the actual output and write it to the filesystem. This approach allows ParaMEDIC to reduce I/O time, thus accelerating mpiBLAST by as much as 25-fold.
AB - BLAST is a widely used software toolkit for genomic sequence search. mpiBLAST is a freely available, open-source parallelization of BLAST that uses database segmentation to allow different worker processes to search (in parallel) unique segments of the database. After searching, the workers write their output to a filesystem. While mpiBLAST has been shown to achieve high performance in clusters with fast local filesystems, its I/O processing remains a concern for scalability, especially in systems having limited I/O capabilities such as distributed filesystems spread across a wide-area network. Thus, we present ParaMEDIC - a novel environment that uses applicationspecific semantic information to compress I/O data and improve performance in distributed environments. Specifically, for mpiBLAST, ParaMEDIC partitions worker processes into compute and I/O workers. Compute workers, instead of directly writing the output to the filesystem, the workers process the output using semantic knowledge about the application to generate metadata and write the metadata to the filesystem. I/O workers, which physically reside closer to the actual storage, then process this metadata to re-create the actual output and write it to the filesystem. This approach allows ParaMEDIC to reduce I/O time, thus accelerating mpiBLAST by as much as 25-fold.
KW - Distributed filesystem
KW - I/O
KW - MpiBLAST
UR - http://www.scopus.com/inward/record.url?scp=69249246712&partnerID=8YFLogxK
U2 - 10.1145/1345206.1345262
DO - 10.1145/1345206.1345262
M3 - Conference contribution
AN - SCOPUS:69249246712
SN - 9781595939609
T3 - Proceedings of the ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPOPP
SP - 293
EP - 294
BT - PPoPP'08 - Proceedings of the 2008 ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming
PB - Association for Computing Machinery
ER -