An extended de Bruijn graph for feature engineering over biological sequential data

Mert Onur Cakiroglu, Hasan Kurban*, Parichit Sharma, M. Oguzhan Kulekci, Elham Khorasani Buxton, Maryam Raeeszadeh-Sarmazdeh, Mehmet M. Dalkilic

*Corresponding author for this work

Research output: Contribution to journalArticlepeer-review

3 Citations (Scopus)

Abstract

In this study, we introduce a novel de Bruijn graph (dBG) based framework for feature engineering in biological sequential data such as proteins. This framework simplifies feature extraction by dynamically generating high-quality, interpretable features for traditional AI (TAI) algorithms. Our framework accounts for amino acid substitutions by efficiently adjusting the edge weights in the dBG using a secondary trie structure. We extract motifs from the dBG by traversing the heavy edges, and then incorporate alignment algorithms like BLAST and Smith-Waterman to generate features for TAI algorithms. Empirical validation on TIMP (tissue inhibitors of matrix metalloproteinase) data demonstrates significant accuracy improvements over a robust baseline, state-of-the-art PLM models, and those from the popular GLAM2 tool. Furthermore, our framework successfully identified Glycine and Arginine-rich motifs with high coverage, highlighting it is potential in general pattern discovery.

Original languageEnglish
Article number035020
Number of pages14
JournalMachine Learning: Science and Technology
Volume5
Issue number3
DOIs
Publication statusPublished - 1 Sept 2024

Keywords

  • Bioinformatics
  • Machine learning
  • de Bruijn graph

Fingerprint

Dive into the research topics of 'An extended de Bruijn graph for feature engineering over biological sequential data'. Together they form a unique fingerprint.

Cite this