Thanks to the growth of genomics, proteomics, and metabolomics, it is possible to investigate properties of the Last Universal Common Ancestor (LUCA) and its predecessors in detail. LUCApedia was established to aggregate and unify the results of studies aimed at describing early life through a variety of bioinformatics approaches and pair them with a number of enzymological characteristics predicted in previous studies to reflect catalysts important in the early evolution of life. Users may query the webserver for individual proteins to rapidly identify evidence of deep ancestry. Advanced users may download the database as a series of flat files and use it to discover trends in early enzymatic and metabolic evolution and to test hypotheses related to early life.
Datasets corresponding to studies predicting characteristics of the Last Universal Common Ancestor (LUCA) consist of different data types: Protein structures, protein domain folds, clusters of orthologous genes, etc. In order to use these data in concert, they must be organized into a common framework. We achieve this unification by mapping these datasets to Uniprot IDs1 (also called “entry names”), KEGG IDs2, and Biocyc IDs3. These three implementations are separate and it is up to the user whether to choose one for his or her study or to compare the results of all three to achieve a greater level of confidence in his or her study. Methods of mapping each of these datasets into Uniprot, KEGG, and Biocyc IDs are described in Section V.
Dataset of ribozyme functions — 32 EC codes
The RNA world hypothesis predicts that the original genetic system involved RNA genes encoding RNA enzymes (also called ribozymes)4. This dataset represents enzymatic functions (by Enzyme Commission5 code) that have been observed in vivo or synthesized in vitro.
Dataset from Harris et al., 20036 — 80 COGs
This study attempted to identify the minimal gene set of LUCA by identifying Clusters of Orthologous Groups of genes7 (COGs) that were present in every genome available at the time.
Dataset from Mirkin et al., 20038 — 571 COGs
This study attempted to use a less stringent requirement for the gene set of LUCA by adding COGs, which appear to be ancient, but do not appear in every genome because they have been replaced by functional analogs through the process of non-orthologous gene displacement. LUCApedia 1.0 uses data from this study corresponding to a gain penalty of 1.0.
Dataset from Delaye et al., 20059 — 115 Pfam motifs
This study attempted to model the functional repertoire of LUCA through all-against-all BLAST10 searches of twenty taxonomically diverse organisms. The results are a series of Pfam11 motifs that are predicted to have been present in LUCA’s proteome.
Dataset from Yang et al., 200512 — 66 SCOP superfamilies
This study attempted to identify the minimal proteome of LUCA by creating a phylogeny of 174 taxonomically diverse organisms using a quantitative classification system based on protein domain content. This method identified universal domains, defined at the level of SCOP13 superfamilies.
Dataset from Wang et al., 200714 — 165 SCOP folds
This study attempted to identify the minimal proteome of LUCA by creating a phylogeny of 185 taxonomically diverse organisms using a quantitative classification system based on genomic surveys of protein domain content. A branch of this phylogeny was identified as the point at which LUCA diverged into the three domains of life. All terminal nodes deeper than this branch are considered to represent domains present in LUCA.
Dataset from Srinivasan and Morowitz, 200915 — 286 EC codes
This study attempted to identify the set of metabolic reactions present in LUCA. Complete metabolomes of five autotrophic bacteria and one autotrophic archaean were compared and reactant-product pairs present in all six organismal datasets were predicted to have been present in LUCA.
Nucleotide cofactor usage
Enzyme functions that employ nucleotide-derived cofactors are predicted to reflect a prior state in which the same reaction was catalyzed by ribozymes16. Cofactors derived from nucleotides were identified through literature review from the complete pool of cofactors used in Uniprot annotations.
Amino acid cofactor usage
Enzyme functions that employ amino acid-derived cofactors are predicted to reflect the transition from ribozymes to protein enzymes as the primary catalytic molecule of life16. Cofactors derived from amino acid were identified through literature review from the complete pool of cofactors used in Uniprot annotations.
Iron-sulfur cofactor usage
Enzyme functions that employ iron-sulfur cofactors are predicted to reflect protobiological chemistry taking place on the surface of pyrite minerals17. Iron-sulfur cofactors were identified through literature review from the complete pool of cofactors used in Uniprot annotations.
Zinc cofactor usage
Enzyme functions that employ zinc cofactors are predicted to reflect protobiological chemistry catalyzed by zinc ions18. Zinc cofactors were identified through literature review from the complete pool of cofactors used in Uniprot annotations.