TY - GEN
T1 - Optimization of data-intensive next generation sequencing in high performance computing
AU - Kathiresan, Nagarajan
AU - Al-Ali, Rashid
AU - Jithesh, Puthen V.
AU - AbuZaid, Tariq
AU - Temanni, Ramzi
AU - Ptitsyn, Andrey
N1 - Publisher Copyright:
© 2015 IEEE.
PY - 2015/12/28
Y1 - 2015/12/28
N2 - Advancement in Next Generation Sequencing (NGS) technology are associated with ever-increasing volume of genomic data every year. These genomic data are efficiently processed by empirical parallelism using High Performance Computing (HPC). The processed data can be used for genome-wide association studies, genetics, personalized medicine and many other areas. There are different kind of algorithms and implementations used in different phases of genome processing. In this paper, we used BWAKIT and GATK based software for processing larger volume of genomic data that are referred as "NGS workflow at SIDRA". We used BWAKIT for genome alignment and GATK for variant discovery in the NGS workflow that required larger computation and huge memory requirement respectively. We observed, the CPU utilization is not more than 45% during variant discovery and hence, it is necessary to understand the optimal selection (in terms of number of threads or cores) of the resources during the NGS workflow automation. We analyzed the performance bottleneck and application optimization in terms of "scalability" (use maximum available CPUs and memory) and "multiple instances of NGS workflow with different genome data within a node" (process more volume of genome data concurrently with limited set of CPUs and memory). We observed that, 40%, 65%, 71% and 76% improvement in performance while processing 2, 4, 8 and 16 samples concurrently using our own scheduling heuristics. As a result, our proposed NGS workflow automation will improve the performance upto 76% compared to application scalability based workflows.
AB - Advancement in Next Generation Sequencing (NGS) technology are associated with ever-increasing volume of genomic data every year. These genomic data are efficiently processed by empirical parallelism using High Performance Computing (HPC). The processed data can be used for genome-wide association studies, genetics, personalized medicine and many other areas. There are different kind of algorithms and implementations used in different phases of genome processing. In this paper, we used BWAKIT and GATK based software for processing larger volume of genomic data that are referred as "NGS workflow at SIDRA". We used BWAKIT for genome alignment and GATK for variant discovery in the NGS workflow that required larger computation and huge memory requirement respectively. We observed, the CPU utilization is not more than 45% during variant discovery and hence, it is necessary to understand the optimal selection (in terms of number of threads or cores) of the resources during the NGS workflow automation. We analyzed the performance bottleneck and application optimization in terms of "scalability" (use maximum available CPUs and memory) and "multiple instances of NGS workflow with different genome data within a node" (process more volume of genome data concurrently with limited set of CPUs and memory). We observed that, 40%, 65%, 71% and 76% improvement in performance while processing 2, 4, 8 and 16 samples concurrently using our own scheduling heuristics. As a result, our proposed NGS workflow automation will improve the performance upto 76% compared to application scalability based workflows.
KW - BWA
KW - Data-Intensive Workload and Concurrent Parallelization
KW - High Performance Computing
KW - Human Genome Sequence
KW - Next Generation Sequencing
KW - Thread Scalability
UR - http://www.scopus.com/inward/record.url?scp=84962844476&partnerID=8YFLogxK
U2 - 10.1109/BIBE.2015.7367654
DO - 10.1109/BIBE.2015.7367654
M3 - Conference contribution
AN - SCOPUS:84962844476
T3 - 2015 IEEE 15th International Conference on Bioinformatics and Bioengineering, BIBE 2015
BT - 2015 IEEE 15th International Conference on Bioinformatics and Bioengineering, BIBE 2015
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 15th IEEE International Conference on Bioinformatics and Bioengineering, BIBE 2015
Y2 - 2 November 2015 through 4 November 2015
ER -