Life language processing: deep learning-based language-agnostic processing of proteomics, genomics/metagenomics, and human languages

Research output: Types of ThesisDoctoral thesis

Abstract

A broad and simple definition of ‘language’ is a set of sequences constructed from a finite set of symbols. By this definition, biological sequences, human languages, and many sequential phenomena that exist in the world can be viewed as languages. Although this definition is simple, it includes languages employing very complicated grammars in the creation of their sequences of symbols. Examples are biophysical principles governing biological sequences (e.g., DNA, RNA, and protein sequences), as well as grammars of human languages determining the structure of clauses and sentences. This dissertation uses a language-agnostic point of view in the processing of both biological sequences and human languages. Two main strategies are adopted toward this purpose, (i) character-level, or more accurately, subsequence-level processing of languages, which allows for simple modeling of the sequence similarities based on local information or, bag-of-subsequences, (ii) language model based representation learning encoding contextual information of sequence elements using the neural network language models. I propose language-agnostic and subsequence-based language processing using the above-mentioned strategies in addressing three main research problems in proteomics, genomics/metagenomics, and natural languages using the same point-of-view.
Original languageEnglish
Publication statusPublished - 2019
Externally publishedYes

Fingerprint

Dive into the research topics of 'Life language processing: deep learning-based language-agnostic processing of proteomics, genomics/metagenomics, and human languages'. Together they form a unique fingerprint.

Cite this