Data-driven Variable-length Segmentation of Biological Sequences: Applications in Metagenomics and Proteomics

Ehsaneddin Asgari, Philipp C. Münch, Till R. Lesker, Alice C. McHardy, Mohammad R.K. Mofrad

Research output: Contribution to conferencePaperpeer-review

Abstract

In this paper, we propose a data-driven segmentation approach for dividing biological sequences into frequent variable-length sub-sequences inspired by Byte-Pair Encoding (BPE) text compression algorithm. In contrast to the recent use of BPE in natural language processing for vocabulary size reduction, we used this idea to increase the size of symbols in the biological sequences replacing the k-mer representations. We investigate the use of this segmentation in 16S rRNA gene processing (Asgari et al., 2019b) and show that this representation can improve the performance of biomarker detection in 16S rRNA processing. Furthermore, we extend the BPE to perform a probabilistic segmentation of protein sequences and show that it can be used for the task of motif discovery and protein sequence embedding (Asgari et al., 2019a).
Original languageEnglish
Publication statusPublished - 2020
Externally publishedYes
EventICML 2020 Workshop on Computational Biology (WCB) -
Duration: 17 Jul 202017 Jul 2020

Workshop

WorkshopICML 2020 Workshop on Computational Biology (WCB)
Period17/07/2017/07/20

Fingerprint

Dive into the research topics of 'Data-driven Variable-length Segmentation of Biological Sequences: Applications in Metagenomics and Proteomics'. Together they form a unique fingerprint.

Cite this