TY - JOUR
T1 - An extended de Bruijn graph for feature engineering over biological sequential data
AU - Cakiroglu, Mert Onur
AU - Kurban, Hasan
AU - Sharma, Parichit
AU - Kulekci, M. Oguzhan
AU - Buxton, Elham Khorasani
AU - Raeeszadeh-Sarmazdeh, Maryam
AU - Dalkilic, Mehmet M.
N1 - Publisher Copyright:
© 2024 The Author(s). Published by IOP Publishing Ltd.
PY - 2024/9/1
Y1 - 2024/9/1
N2 - In this study, we introduce a novel de Bruijn graph (dBG) based framework for feature engineering in biological sequential data such as proteins. This framework simplifies feature extraction by dynamically generating high-quality, interpretable features for traditional AI (TAI) algorithms. Our framework accounts for amino acid substitutions by efficiently adjusting the edge weights in the dBG using a secondary trie structure. We extract motifs from the dBG by traversing the heavy edges, and then incorporate alignment algorithms like BLAST and Smith-Waterman to generate features for TAI algorithms. Empirical validation on TIMP (tissue inhibitors of matrix metalloproteinase) data demonstrates significant accuracy improvements over a robust baseline, state-of-the-art PLM models, and those from the popular GLAM2 tool. Furthermore, our framework successfully identified Glycine and Arginine-rich motifs with high coverage, highlighting it is potential in general pattern discovery.
AB - In this study, we introduce a novel de Bruijn graph (dBG) based framework for feature engineering in biological sequential data such as proteins. This framework simplifies feature extraction by dynamically generating high-quality, interpretable features for traditional AI (TAI) algorithms. Our framework accounts for amino acid substitutions by efficiently adjusting the edge weights in the dBG using a secondary trie structure. We extract motifs from the dBG by traversing the heavy edges, and then incorporate alignment algorithms like BLAST and Smith-Waterman to generate features for TAI algorithms. Empirical validation on TIMP (tissue inhibitors of matrix metalloproteinase) data demonstrates significant accuracy improvements over a robust baseline, state-of-the-art PLM models, and those from the popular GLAM2 tool. Furthermore, our framework successfully identified Glycine and Arginine-rich motifs with high coverage, highlighting it is potential in general pattern discovery.
KW - Bioinformatics
KW - Machine learning
KW - de Bruijn graph
UR - http://www.scopus.com/inward/record.url?scp=85199370057&partnerID=8YFLogxK
U2 - 10.1088/2632-2153/ad5fde
DO - 10.1088/2632-2153/ad5fde
M3 - Article
AN - SCOPUS:85199370057
SN - 2632-2153
VL - 5
JO - Machine Learning: Science and Technology
JF - Machine Learning: Science and Technology
IS - 3
M1 - 035020
ER -