Protein evolution and Astrobiology
The set of proteins in an organism is modified by three distinct processes:
de novo birth of new genes from previously non-coding sequences
gene duplication and loss
descent with modification
All three of these processes are subject to biases from both natural selection and mutation.
De novo gene birth
Protein-coding genes are sometimes born de novo from non-coding sequences (either intergenic or alternative reading frame). This includes both ancient birth events prior to our Last Universal Common Ancestor, and very recent birth events. A major focus of our current research looks for signals of proteins that were born before the genetic code included all 20 of its current amino acids. From this we are working to deduce the nature of very early protein-based life, and to connect this to measurements from the asteroid Bennu (Wehbi et al. 2024).
This builds on our previous work studying recent cases of de novo gene birth. A newborn protein must avoid causing harm, e.g. through aggregation, while providing some benefit to the organism. High levels of intrinsic structural disorder help newborn genes strike this balance between avoiding harm and causing positive effects (Wilson et al. 2017, Willis & Masel 2018. James et al. 2021). An amino acid composition that promotes intrinsic structural disorder is correlated with more benign fitness effects within random peptides (Kosinski et al. 2022).
Our theories of evolvability, specifically pre-adapting selection (Masel 2006, Rajon & Masel 2011), help explain why an overwhelmingly high likelihood of harmfulness of a random peptide does not prevent de novo gene birth. If non-coding sequences are translated just a little bit, this provides an opportunity for the most deleterious amino acid sequences to be eliminated by selection (Wilson & Masel 2011). The pre-screened set of sequences provides the raw material from which de novo protein-coding genes could be co-opted. Consistent with this possibility is the fact that many “non-coding” sequences are often found in association with ribosomes in S. cerevisiae, meaning they are likely translated at low levels. Our use of riboprofiling data as the first demonstration of this pervasive translation (Wilson & Masel 2011) was also the first study to use riboprofiling data for the purpose of gene annotation.
Sometimes only part of protein is recently born de novo. We therefore often focus on the age of a protein domain, rather than the age of a whole genes, because different parts of the gene have different ages (James et al. 2021, 2023). One way a protein can grow via partial de novo birth is when a stop codon is lost, and the 3'UTR, up to a backup stop codon, becomes part of the protein C-terminus (Giacomelli et al. 2007, Andreatta et al. 2015). Low levels of stop codon readthrough prescreen the genetic variation beyond stop codons, raising its intrinsic structural disorder (Kosinski & Masel 2020).
In yeast, elevated levels of stop codon readthrough can be caused by the [PSI+] prion, an epigenetically inherited aggregate of the Sup35 protein, which is a release factor required for translation to terminate at stop codons. When [PSI+] appears, elevated readthrough occurs at every gene in the genome, and a range of pre-existing cryptic genetic variation is phenotypically revealed. As an epigenetically inherited protein aggregate, [PSI+] can easily be lost after some generations. This returns the lineage to its normal [psi-] state and restores translation fidelity. If a subset of revealed phenotypic variation is adaptive, it may have lost its dependence on [PSI+] by this time. This process of genetic assimilation may, for example, involve one or more point mutations in stop codons. This leaves the yeast with a new adaptive trait and with no permanent load of other, deleterious variation. The yeast prion [PSI+] is a wonderful model system for studying evolutionary capacitance, because the relevant molecular biology is well understood. In Saccharomyces, a high proportion of 3′UTR incorporation events involve the inclusion of in-frame 3′UTR through precise mutation of the stop codon, rather than frameshifts. This is compatible with the genetic assimilation of in-frame readthrough products produced by [PSI+] (Giacomelli et al. 2007).
Differential duplication and loss
Like most species, most protein domains that ever existed are now extinct. Each cohort or "phylostratum" of protein domains used to be more diverse, but domains with some properties have been differentially lost (James et al. 2023). This tendency is of the right (small) magnitude to be responsible for patterns by which young animal domains have higher intrinsic structural disorder, and older domains across the tree of life have greater interspersion of their hydrophobic amino acids (James et al. 2023). Measuring and correcting for the way in which ancient proteins today might be unrepresentative is important to accurately inferring the nature of ancient life.
Descent with modification
Descent with modification is important for explaining protein differences between species with high vs. low "effective population size" (Weibel et al. 2023, McShea et al. 2023).
Phylogenetic inference and ancestral sequence reconstruction for proteins rely on mathematical models of the relative rates of amino acid substitutions, usually under the assumption that amino acid frequencies stay constant over time (stationarity) and that fluxes obey detailed balance (time reversibility). We have contributed to the development of time non-reversible amino acid substitution models that not only better fit the data, but allow phylogenetic trees to be rooted without using an outgroup (Dang et al. 2022). We are continuing to improve substitution models by filtering alignment errors out of training data. We are excited to use non-stationary models to estimate amino acid frequencies at ancestral nodes. We hope to characterize the proteome and amino acid frequencies of ancient life on earth all the way back to the last universal common ancestor (LUCA), and indeed, even before. In doing so, we hope to learn more about the environmental conditions of ancient life, and the origins of the genetic code.
Adaptive paths through descent with modification face "frustration" from the fact that the same hydrophobic amino acids that promote functional folding also promote harmful aggregation - this creates a special kind of adaptive landscape (Bertram & Masel 2020).
Publications :
Wehbi S, Wheeler A, Morel B, Minh BQ, Lauretta DS, Masel J. (2024) Identifying the Last Universal Common Ancestor’s protein domains resolves the order in which the amino acids were recruited into the genetic code. bioRxiv
McShea H, Weibel C, Wehbi S, Goodman P, James JE, Wheeler A, Masel J. (2024) The effectiveness of selection in a species affects the direction of amino acid frequency evolution, bioRxiv.
Weibel C, Wheeler A, James JE, Willis SM, McShea H, Masel J (2024). The protein domains of vertebrate species in which selection is more effective have greater intrinsic structural disorder, eLife 12:RP87335.
James JE, Nelson PG, Masel J. (2023) Differential retention of Pfam domains contributes to long-term evolutionary trends, Molecular Biology & Evolution, 40, msad073 .
Dang CC, Minh BQ, McShea H, Masel J, James JE, Vinh LS, Lanfear R (2022). nQMaker: estimating time non-reversible amino acid substitution models, Systematic Biology, 71, 1110–1123.
Kosinski L, Aviles NR, Gomez K, Masel J (2022). Random peptides rich in small and disorder-promoting amino acids are less likely to be harmful, Genome Biology & Evolution, 14, evac085.
James JE, Willis SM, Nelson PG, Weibel C, Kosinski L, Masel J. (2021). Universal and taxon-specific trends in protein sequences as a function of age, eLife, 10, e57347.
Zile K., Dessimoz C., Wurm Y., Masel J. (2020). Only a single taxonomically restricted gene family in the Drosophila melanogaster subgroup can be identified with high confidence, Genome Biology and Evolution, 12(8), 1355-1366.
Bertram J., Masel J. (2020) Evolution rapidly optimizes stability and aggregation in lattice proteins despite pervasive landscape valleys and mazes, Genetics, 214(4), 1047-1057.
Kosinski L. J., Masel J. (2020). Readthrough errors purge deleterious cryptic sequences, facilitating the birth of coding sequences, Molecular Biology and Evolution, 37(6), 1761-1774.
Foy SG, Wilson BA, Bertram J, Cordes MHJ, Masel J. (2019) A shift in aggregation avoidance strategy marks a long-term direction to protein evolution, Genetics 211:1345-1355.
Willis S., & Masel J. (2018) Gene birth contributes to structural disorder encoded by overlapping genes, Genetics 210(1): 303-313.
Bungard D., Copple J. S., Yan J., Chhun J., Kumirov V. K., Foy S. G., Masel J., Wysocki V. H., & Cordes M. H. J. (2017) Foldability of a natural de novo evolved protein, Structure 25:1-10.
Wilson B. A., Foy S. G., Neme R., & Masel J. (2017) Young genes are highly disordered as predicted by the preadaptation hypothesis of de novo gene birth, Nature Ecology & Evolution, 1:0146.
Andreatta, M. E., Levine, J.Al., Foy, S.G., Guzman, L., Kosinski, L., Cordes, M.H.J., & Masel, J. (2015) The recent de novo origin of protein C-termini, Genome Biology & Evolution 7(6):1686-1701.
Wilson, B. A., & Masel, J. (2011). Putatively noncoding transcripts show extensive association with ribosomes. Genome Biology & Evolution, 3, 1245-1252
Giacomelli, M. G., Hancock, A. S., & Masel, J. (2007). The conversion of 3′ UTRs into coding regions. Molecular Biology & Evolution, 24(2), 457-64.