Reading the book of life: the language of proteins






MRB 202


Malay K Basu (Associate Professor, Department of Pathology and Lab Medicine, University of Kansas Medical Center)



Genomes are remarkably similar to natural language texts. From an information theory perspective, we can think of amino acid residues as letters, protein domains as words, and proteins as sentences consisting of ordered arrangements of protein domains (domain architectures). This work describes our recent efforts toward understanding the linguistic properties of genomes.


Our earlier work showed that the complexity of “grammars” in all major branches of life is close to a universal constant of ~1.2 bits. This is remarkably similar to natural languages; such an--yet unexplained--universal information gain has been observed and used to determine whether a series of symbols represents a language. In this work, we describe the implications of this work and its extension in various areas with a particular emphasis on measuring the proteome complexities in human tissues.


Our work established the similarity between natural languages and genomes and showed, for the first time, that there exists a “quasi-universal grammar” of protein domains and measured the minimal complexity of proteome required for a functional cell. We also describe the proteome complexities in human tissues and their functional and evolutionary implications.