TY - JOUR
T1 - Digitizing DNA Sequences Using Multiset-Based Nucleotide Frequencies for Machine Learning-Based Mutation Detection
AU - Anjum, Sanaa
AU - Kousar, Sajida
AU - Kausar, Nasreen
AU - Aydin, Nezir
AU - Olanrewaju, Oludolapo Akanni
AU - Mncwango, Bongumenzi
N1 - Publisher Copyright:
© 2024 Regional Association for Security and crisis management. All rights reserved.
PY - 2024/10/1
Y1 - 2024/10/1
N2 - Investigating algebraic structures in a non-conventional framework supplements mathematics for hard-nosed practical applications to the fields of theoretical biology and computer science. One such algebraic structure is multigroup whose underlying set is a multiset. The genome is the entire set of DNA instructions found within a cell which contains all the information needed for an individual to develop and function. DNA and RNA are the hereditary materials that play a vital role in the metabolism process of living things, especially protein synthesis. In genomic database DNA sequences are stored in the form of string or text data types. The only data that works with machine learning algorithms is numerical. Thus, it is necessary to transform DNA sequence strings to numerical values. This article is organized in the following manner. Firstly, we prove that standard genetic code is a multigroup and genome architecture of the whole population can be interpreted as the sum of multisets. Next, it is described how a numerical representation of DNA bases relates to its algebraic representation. We further employed Gated Recurrent Unit, Long Short-Term Memory, and Bidirectional Long Short-Term Memory to identify changes between the DNA sequences. Experimental results show that GRU with multiset-based numerical values for DNA bases offers 98% accuracy on testing data. This novel technique will aid in the detection of mutations in complex diseases.
AB - Investigating algebraic structures in a non-conventional framework supplements mathematics for hard-nosed practical applications to the fields of theoretical biology and computer science. One such algebraic structure is multigroup whose underlying set is a multiset. The genome is the entire set of DNA instructions found within a cell which contains all the information needed for an individual to develop and function. DNA and RNA are the hereditary materials that play a vital role in the metabolism process of living things, especially protein synthesis. In genomic database DNA sequences are stored in the form of string or text data types. The only data that works with machine learning algorithms is numerical. Thus, it is necessary to transform DNA sequence strings to numerical values. This article is organized in the following manner. Firstly, we prove that standard genetic code is a multigroup and genome architecture of the whole population can be interpreted as the sum of multisets. Next, it is described how a numerical representation of DNA bases relates to its algebraic representation. We further employed Gated Recurrent Unit, Long Short-Term Memory, and Bidirectional Long Short-Term Memory to identify changes between the DNA sequences. Experimental results show that GRU with multiset-based numerical values for DNA bases offers 98% accuracy on testing data. This novel technique will aid in the detection of mutations in complex diseases.
KW - Gene mutations
KW - Multiset average frequency
KW - Multiset DNA structure
KW - Recurrent neural network
UR - http://www.scopus.com/inward/record.url?scp=85206338773&partnerID=8YFLogxK
U2 - 10.31181/dmame7220241213
DO - 10.31181/dmame7220241213
M3 - Article
AN - SCOPUS:85206338773
SN - 2560-6018
VL - 7
SP - 516
EP - 529
JO - Decision Making: Applications in Management and Engineering
JF - Decision Making: Applications in Management and Engineering
IS - 2
ER -