Abstract
In this paper, we propose a data-driven segmentation approach for dividing biological sequences into frequent variable-length sub-sequences inspired by Byte-Pair Encoding (BPE) text compression algorithm. In contrast to the recent use of BPE in natural language processing for vocabulary size reduction, we used this idea to increase the size of symbols in the biological sequences replacing the k-mer representations. We investigate the use of this segmentation in 16S rRNA gene processing (Asgari et al., 2019b) and show that this representation can improve the performance of biomarker detection in 16S rRNA processing. Furthermore, we extend the BPE to perform a probabilistic segmentation of protein sequences and show that it can be used for the task of motif discovery and protein sequence embedding (Asgari et al., 2019a).
Original language | English |
---|---|
Publication status | Published - 2020 |
Externally published | Yes |
Event | ICML 2020 Workshop on Computational Biology (WCB) - Duration: 17 Jul 2020 → 17 Jul 2020 |
Workshop
Workshop | ICML 2020 Workshop on Computational Biology (WCB) |
---|---|
Period | 17/07/20 → 17/07/20 |