Introduction

Living and non-living molecular phenomena (or those of ‘prelife’ (Nowak and Ohtsuki 2008)) are only separated chemically by complexity and the unique metabolic, replicative and evolvable abilities endowed by the physico-chemical properties of biological systems. Many attempts at comprehensively defining life have been made with an aim towards universal applicability (Benner 2010; Gayon 2010; McKay 2004; Oliver and Perry 2006). Such efforts indirectly ask whether there are underlying features of life which conform to universal laws of ‘general biology’, in the same manner as can be said for chemistry and physics (Benner et al. 2004; Blumberg 2011). A widely applied and respected definition invokes replication and Darwinian evolution as key characteristics which encompass the most essential aspects of life (Cleland and Chyba 2002). In common with other property-based definitions, this ‘substrate independent’ formula does not specify underlying chemical processes, and how they might be generalized for biosystems anywhere.

It is an often-noted fact that our experience and familiarity with the fundamentals of living biosystems is limited to a sample size of one: existing life on Earth. The evolutionary linkage of known life to a Last Universal Common Ancestor is clearly demonstrable at the molecular level (Goldman et al. 2013), where the most basic systems have been evolutionarily maintained. While this deep conservation might in principle derive from absolute necessity, it is equally well interpreted as resulting from basic processes (such as translation via ribosomes and the genetic code) becoming evolutionarily ‘locked’ in specific configurations. In this view, once fundamental biosystems (and the molecules mediating them) have become established, they cannot significantly diversify owing to the impracticality of selecting alternatives of competitive fitness. Since comparative studies with a completely independent biology are not currently possible, it accordingly follows that the ubiquity of familiar biomolecules on Earth is a poor basis for the inference that such molecules themselves are necessarily universal. Yet beyond known molecular detail, there are higher-order principles of chemical organization and molecular function where generalizations may carry logical force. In this respect, it has been repeatedly acknowledged previously that life is universally likely to require macromolecules composed of specific building blocks (Bains 2004; Pace 2001). The latter macromolecular nucleic acid or protein subunits are frequently and conveniently referred to as ‘molecular alphabets’ (Philip and Freeland 2011; Szathmáry 1992), reflecting the potential for their combinatorial assortment and concatenation into a vast number of possibilities. (‘Concatenation’ is used in this context to refer to the formation of either short oligomers or long polymers from alphabetic monomeric subunits).

Here the association of molecular alphabets and biology is emphasized through a detailed examination of the postulate that any form of complex chemically-based life cannot exist or evolve by any means other than at least one molecular alphabet, whose members link into linear concatemers with functional properties. To properly appraise this proposal, it is important to note how macromolecules derived from molecular alphabets can be grouped into distinct classes primarily based on how they replicate, irrespective of the specific chemical natures of their subunits. Although in known biosystems the importance of alphabets has been frequently noted, their role as a potentially universal aspect of complex life appears to have received relatively little detailed attention.

In this context, it is necessary to highlight the precedents provided by life on Earth, but from this springboard it is possible to infer which aspects of molecular alphabets have universal applicability. It should be emphasized that here the fundamental requirements for biological system complexity are under scrutiny, rather than the actual mechanisms for the transition from pre-life to living systems. Most significantly, functional repertoires required for the operation of increasingly complex biosystems necessitate escalating molecular size and complexity. It is apparent that these conditions are only feasibly satisfied through the evolution of molecular alphabetic concatemers, arguing in turn for the fundamental requirement for molecular alphabets in the progression and evolution of any chemically-based complex life.

Nature of Molecular Alphabets

In order to evaluate the role of molecular alphabets in biosystems as potentially universal phenomena, it is useful to initially review how such alphabets are defined in general terms, and to consider how they may be grouped into different functional classes.

Alphabetic Definitions

An ‘alphabet’ in the molecular sense generally refers to the defined set of monomers from which functional oligomers or polymers are derived, in an analogous fashion to stringing individual letters together in the correct sequence to form meaningful words. But unlike the latter, the chemical manner in which molecular alphabetic monomers are joined together is also significant, as well as the sequence of a string itself. Consequently, it is always necessary to compare a set of alphabetic monomers with their concatenated functional forms. For the present purposes, analyses of ‘molecular alphabets’ will thus include both the monomeric subunits directly constituting an alphabet, and their joined concatenated products.

The following statement serves as a practical generalizable definition: ‘A biological molecular alphabet consists of a specific set of relatively small distinct molecules which by means of a templating process can covalently concatenate into a large number of combinatorial alternative oligomers or polymers with specified sequences, of essential informational and/or functional significance for biosystem operations’. Generalizable properties inferred from these known biological molecular alphabets are listed in Table 1, as opposed to ‘non-alphabetic’ molecules of any size and complexity.

Table 1 General properties of oligomers or polymers (concatenates) derived from known natural molecular alphabets (free monomers), as contrasted with ‘non-alphabetic’ molecules (any other molecules not derived from specific monomeric ‘alphabetic’ subunits)

Replicative Classes of Alphabets

The molecular alphabet of nucleic acids enables replication of all life on this planet, by virtue of direct nucleobase complementarity, while the entirely different alphabet of amino acids which constitute proteins is encoded by nucleic acid digital linear sequences. Transformation of this information into folded proteins, whose functional interactions have analog character, can be viewed as a digital-to-analog conversion process (Goodwin et al. 2012). The replication of functional forms specified by the protein alphabet is thus inextricably linked to the copying of the nucleic acid strands which specify their polypeptide sequences. For the purpose of making generalizations it is therefore useful to distinguish between ‘primary’ or replicative (information-transmitting) molecular alphabets, and secondary (functional) alphabets encoded by molecular templates of the replicative alphabet itself.

Nucleic acids in existing biosystems cannot replicate without catalytic feed-back from proteins, and this ‘entanglement’ is another feature of known biological molecular alphabets. But it is widely accepted that a precursor to the familiar DNA-RNA-protein biology was once based on RNA alone, referred to as the RNA World (Joyce 2002). Prompted by initial observations of RNA catalysis (Gilbert 1986), the RNA World model was considerably supported by the finding of ribozyme-mediated peptidyl transferase activity in all ribosomes (Steitz and Moore 2003). Biosystems where RNA molecules combine the tasks of informational storage, replication, and functional catalysis can be considered monoalphabetical in their fundamental constitution. Despite the entanglement of nucleic acids and proteins in modern biosystems (including viruses with RNA genomes and the simplest RNA viroids), the putative existence of the RNA World emphasizes the primacy of the nucleic acid alphabet for terrestrial biology.

Since proteins replicate only through nucleic acid encoding, the familiar protein alphabet can be regarded as secondary in a replicative sense to the primary digital alphabet of nucleic acids (Fig. 1A). The schematic generalizations of Fig. 1 use a binary alphabet for simplicity, but a real binary nucleic acid alphabet exhibiting template and secondary structure-enabling complementarity has been shown to be compatible with specific catalysis (Reader and Joyce 2002). While a primary alphabet must accommodate both function and replication, the products of secondary alphabets (Fig. 1-B1 and -B2) are free of replicative constraints, and can explore the complete range of functional space permitted by the collective properties of alphabet members .Footnote 1

Fig. 1
figure 1

Schematic illustration of generalizable classes of alphabets and their inter-relationships. A: An alphabet with primary status owing to its direct replicability, here depicted with a binary pair of members acting as digital self-complementary units for templated replication. Specific concatenated strings of such an alphabet can also fold into structures with catalytic or ligand-binding properties. B1: A secondary alphabet (dependent on the primary alphabet for replication) depicted with three distinct members, which are templated through linkage with specific triplet blocks, in turn assembled on a primary alphabet complementary template. (This simplified depiction does not include splitting of the primary alphabet into informational and functional/templating subalphabets). B2: As for B1, but representing a different secondary alphabet, with distinct members and encoding, but still using an equivalent digital adaptational templating mechanism. C: A distinct secondary alphabet assembled through primary alphabet templates, but where the ligands are not encoded through linkage with self-complementary blocks, but rather directly recognized through structural folds of the primary template

In principle, the use of digital adapters (as with aminoacylated tRNA molecules for protein synthesis) could enable encoding of more than one entirely discrete secondary alphabet, using distinct codons and a completely separate set of alphabet members (depicted in Fig. 1-B2). Although there is no natural precedent for this, an artificial parallel can be found in the use of arbitrary ‘codons’ to encode specific chemical groups for directed assemblies of libraries for functional selection with linked encoding DNA strands (Halpin and Harbury 2004).

Alternative Modes of Replicative Templating

The templating arrangement in Fig. 1A depicts complementarity based on weak interactions between simple subunits of a binary alphabet, easily digitally represented. Yet the underlying units comprising a digital code can be of an arbitrary nature, provided they are used as discrete blocks. Structural recognition units could accordingly be used to encode a replicable template, as depicted in Fig. 2A. Self-replication via structural recognition has in fact been demonstrated for short block sequences of peptides (Ghosh and Chmielewski 2004; Lee et al. 1996) based on interhelical interactions. Although each interactive pair in such arrangements can be considered to have analog character (where folded structures form shape-based recognition surfaces) the templating process itself nevertheless can be equally well represented digitally as with simple monomers (as in Fig. 2B).

Fig. 2
figure 2

Replication models of concatemers of a simple binary complementary alphabet. A: Structural (‘analog’) templating with folded motifs. Structure 0 recognizes a structurally complementary motif 1, which then behaves as a ‘shape mimic’, analogously to antibody anti-idiotype recognition. Both structures are composed of smaller interactive subunits, the sequence of which determine strand folding and the mutual recognition process. If a string of 1 and 0 motifs are concatenated together in such a way that their recognition properties are preserved, then a complementary string can be produced by binding and joining of the matching motif sequence. L = ‘leftward’ ends of monomer strings. B: Templating through binary alphabet self-complementarity. Both alphabet members can be treated as direct digital units and encoded accordingly (0/1), and replication occurs by single-member alignment and linking. Both A and B can thus be digitally encoded (1/0), but differ in the relative complexities of each digital unit

But both real and hypothetical structural replication scenarios (Fig. 2A) carry the restriction that their basic replicative units are themselves relatively complex and composed of underlying simpler monomers. For efficient replication to occur, evolution of synthetic machinery is required to create the ‘starting blocks’ in either case, on top of synthesis of monomers themselves. Such complexity constraints indicate that even if a folded ‘structural alphabet’ (with functional units based on structural complementarities) is feasible, a primary replicative alphabet based on simple and linear complementarity between monomers would have a strong selective advantage through reduced energetic demands. A binary self-complementary alphabet can form functional structures (Reader and Joyce 2002) that could potentially be used for replicative templating based on folded structural motifs (Fig. 2A), but such self-complementarity itself can be used for simpler and linear replication (Fig. 2B).

A primary replicative alphabet could in principle encode a secondary alphabet not via digital adapters, but by means of direct structure-based recognition through ordered concatenation of specific folding motifs (Fig. 1C). Although such interactions might possess the ‘shape mimic’ analog character of Fig. 2A, it should be noted that this hypothetical arrangement is a secondary templating effect, distinct from the role of a primary replicative molecular alphabet. No precedent for this with natural RNA molecules has been defined, although analogous processes have been postulated in biological circumstances such as the templating of complex carbohydrate syntheses (Dunn 2011).

Subalphabets

With the exception of certain RNA virus families, life on Earth has divided primary functional and informational templating between the two distinct, but closely related polymers of RNA and DNA. Relative to the original RNA, this splitting has involved changes in corresponding alphabetic monomers such that an altered DNA backbone is produced (loss of 2’ OH groups), and methylation of a sole complementarity-determining group (uracils becoming thymines). This transition can be viewed as the formation of a ‘subalphabet’, since RNA and DNA sequence can still hybridize based on conventional hydrogen bonding. Were this not the case, two distinct digital alphabets would then exist.

In the RNA World hypothesis, the emergence of DNA necessarily post-dates the existence of biosystems made possible by both the informational and functional potential of RNA molecules. This position is consistent with the observation that while specific single-stranded DNA molecules with catalytic or binding functions can be artificially selected (Breaker 1997), natural precedents for this have never been found. Most proposals for the origin of DNA concern specific chemical advantages relative to RNA (Butzow and Eichhorn 1975)), or the improved replicative fidelity of DNA as an informational storage polymer (Lazcano et al. 1988). Also, double-stranded RNA genomes may be practically limited in size, for fidelity-related reasons (Poole et al. 2000).

Regardless of the cogency of these arguments, they cannot generalize evolutionary constraints forcing the subdivision of replicative templating and function by any primary alphabet, since the underlying premises are wholly based on specific properties of DNA vs. RNA. But more basic selective forces promoting the formation of subalphabets may exist. It has been proposed that the original DNA replicators were viral, whose parasitic advantage drove their host cellular RNA genomes towards a DNA transition themselves through acquisition of viral sequences (Forterre 2005, 2006). Genomes which serve as both informational templates and the source of catalytic function may have an inherent inefficiency in the need for replicases to act in dual replicative/template roles, providing ‘template-only’ parasites with an intrinsic advantage (Takeuchi et al. 2011). Splitting of the original dual-aspect replicative/functional primary alphabet then removes this parasitic niche. Since the development of parasitic replicators is an inherent feature of modeled replicative ecosystems (Bresch et al. 1980) and is certainly ubiquitous in terrestrial biology, arguments founded on this premise (Forterre 2006; Takeuchi et al. 2011) clearly have a greater potential general applicability than ones based purely on specific chemistries.

Molecular Complexity and Alphabets

To appraise the association of molecular alphabets and life, it is necessary to briefly consider how biological molecular requirements can be satisfied. A major inherent requirement in life systems is high interaction specificity on the part of participating biomolecules of great diversity and very high local concentrations. It is also obvious that a variety of functional activities must be performed to provide metabolic energy, molecular building blocks, and replication. To attain workable efficiency, these functions must be co-operatively integrated as dynamic systems, necessitating regulation at multiple levels. In known biosystems, system complexity is associated with underlying molecular complexity, which correlates in turn with increasing molecular size.

Biological catalysis, for example, is generally associated with macromolecules. The smallest natural monomeric enzymes contain approximately 100 amino acid residues (human acylphosphatase has 99 residues (Stefani et al. 1997); the E. coli equivalent 92 residues (Ramazzotti et al. 2006)), or sizes of 10 kD to a first approximation.Footnote 2 With respect to small nucleic acid catalysis, a ribozyme of only 29 bases with combined aminoacyl- and peptidyl-RNA synthetic activity (Illangasekare and Yarus 1999), and a DNAzyme with ribonuclease activity of only 28 bases (Santoro and Joyce 1997) have been artificially selected. While remarkably small, these single-stranded catalysts still have molecular weights of approximately 9 kD.

On the other hand, much smaller artificial peptides (<500 D) have been shown to catalyze specific reactions (Revell and Wennemers 2007), and recent evidence has indicated unanticipated catalytic abilities of certain small organic compounds, including some amino acids (proline in particular) (MacMillan 2008; Movassaghi and Jacobsen 2002). Certain biological activities catalyzed by well-known enzymes (such as the aldol reaction (MacMillan 2008)) may thus also be driven by small molecules, and it is indeed an important question as to the extent to which low molecular weight organocatalytic mediators (including ‘aminozymes’) functionally operate in existing biosystems (Barbas 2008). But even if certain ‘orphan’ biocatalyses (activities for which no matching genomic sequences have been assigned (Lespinet and Labedan 2006; Pouliot and Karp 2007)) prove to be mediated by small molecules rather than enzymes, and aminozymes or equivalents contribute to catalyses already known to be driven enzymatically, the reliance of life processes on macromolecular complexity is not diminished.

Several observations contribute to this viewpoint. While some ‘model’ enzymatic reactions are relatively easy to catalyze, even by promiscuous reaction sites found in initially unexpected protein sources such albumins, many others are far more demanding (Hollfelder et al. 1996; James and Tawfik 2001). Some enzymatic functions (such as polymerases, ligases or certain DNA sequence-specific binding proteins) necessitate the binding and juxtaposing of multiple substrates and cofactors over distances of many tens of angstroms, an improbable feat for low molecular weight compounds. Even if these issues are disregarded, other aspects of macromolecular catalysis are highly significant. Domains of a large protein enzyme remote from the relatively small catalytic center may substantially contribute towards catalytic performance (Benkovic and Hammes-Schiffer 2003; Eisenmesser et al. 2005), sometimes even promoting ‘superefficiency’, or catalysis accelerated beyond the limiting diffusion rate (Stroppolo et al. 2001). Protein regions not directly associated with catalysis itself may nonetheless have crucial roles in catalytic regulation through signaling sites for secondary modifications, or allostery mediated by ligand binding (Fenton 2008). Multiple layers of regulation allow single catalysts to be deployed in many different contexts in complex organisms, a regularly observed manifestation of the many-layered modularity of genomic information packing (Weiss 2005). A multi-functional molecule is accordingly hard to design without recourse to increasing structural complexity. Although these considerations have protein enzymes in mind, they logically apply to ribozymes as well, even though the RNA World can only be approached by experimental attempts to recapitulate the appropriate RNA-based catalysts. In this regard, it can be noted that recent progress towards the development of ribozyme polymerases would suggest that a molecule of substantial size will be required to provide the necessary efficiency (at present 189 bases; (Wochner et al. 2011)).

It must be accordingly concluded that although numerous types of catalyses can certainly be effected by small organic molecules, any form of chemically-based life of ascending complexity will inevitably become critically dependent on catalytically demanding activities, complex catalytic regulation, and multi-layered intermolecular signaling, which only large molecules can deliver.

At the same time, known biosystems are also critically dependent on molecules which are small in comparison to most proteins. Though of relatively low molecular weights, many of these require complex syntheses, as exemplified by chlorophylls, heme, cobalamin, antibiotics, and numerous others. Some such molecules may have structural similarities (for example, heme and chlorophyll), but in the main, they are very structurally distinct. The syntheses which serve to create them are also diverse, and require highly specialized catalytic tools, sometimes in the form of a concerted set of catalyzed reactions in a series. These processes are fulfilled through the agency of protein enzymes, derived of course from a specific molecular alphabet. And the largest known natural molecule which is not a repetitive polymer or derived from an alphabet (maitotoxin; 3,425 Da (Nicolaou and Aversa 2011)) is much lower in molecular weight than even a modest 10 kD protein. In contradistinction to the logical conundrums posed by the synthesis of uniquely complex molecules, large molecules based on molecular alphabets offer solutions.

Attaining Molecular Complexity

Measures of complexity can be difficult to formalize, and are dependent on the system level under analysis (ranging from molecules to biosystems, organisms, and societies) (Adami 2002). For present purposes, it will be useful to focus on formal complexity at the molecular level, here considered in terms of the number of distinguishable catalysts required for synthesis of discrete covalently bonded molecules from simple precursors. ‘Distinguishable catalysts’ in this context are defined as biological molecular agents (usually protein enzymes, but including ribozymes and in principal small organocatalysts) performing defined catalytic tasks which can be demarcated by their specificities of substrate recognition and product formation. For the present purposes of complexity estimation, such discrete tasks can be mediated by either separate molecules, distinct sites on the same macromolecule, or via the concerted agency of multi-subunit complexes. In addition, given that enzymatic promiscuity is a widespread phenomenon (Khersonsky and Tawfik 2010), some catalysts will overlap in this categorization, but a biomolecular synthesis will nonetheless require a minimal number of discrete catalytic entities.

For example, biosynthesis of a linear or branched polymer of a monosaccharide carbohydrate (such as glucose-derived glycogen or starch) by a specific sequential pathway (P) requires a definable number of catalysts (CP) to generate the sugar monomer, and then a limited number of catalysts (k) for polymerization, where k includes catalysts necessary for the synthesis of small-molecule cofactors and polymerization itself. If Npol is the number of distinct catalysts involved in total to produce the polymer from simple precursors, then an obvious conclusion is that Npol = CP + k. In principle, and certainly in practice in complex biological systems, alternate routes can exist for both monomer synthesis (such as photosynthetic vs. neoglucogenic pathways for glucose monosaccharides) and polymerization itself (as exemplified by starch biosynthesis; (Kaminski et al. 2012)). Likewise, use of alternate cofactors (along with their attendant syntheses) by a polymerizing catalyst could change the k term, but in all such alternate pathways, polymer production must be based on a definable number of catalytic agents.

The synthetic requirements for simple repetitive polymers can then be compared with polymers derived from molecular alphabets. Although synthesis of known alphabetic biopolymers is based on digital templates and requires highly complex biocatalyses, the same relationship exists between a core set of catalystsFootnote 3 required for activated monomer and template syntheses, followed by repeated rounds of catalysis required for polymerization. In both cases, then, the number of required distinguishable catalysts is independent of the final product molecular weight, although for alphabetic molecules the value of CP (the production of suitably activated or adapted monomers and templates by a specific pathway) is much higher than for a simple polymer.

For non-alphabetic molecules, since synthesis is not directed by the use of repetitive subunits or members of an alphabetic set, there is a high probability that the necessary number of distinguishable catalysts will rise with molecular weight of the product. In this respect, it must be noted that the huge diversity of even small organics (estimated at 10180 for those <1,000 Da alone (cited in (Gorse 2006; Warr 2011) as communicated from D. Weininger) indicates that no simple relationship between synthetic catalytic requirement and molecular weight should exist in the conventional low-molecular weight size range. In addition, biosynthesis of some relatively large natural products may require a lower number of distinguishable catalyzed reactions than initially apparent, through the agency of reaction cascades (such as conversion of multiple epoxy groups into cyclic ether rings, as proposed for a range of natural polyether toxins (Vilotijevic and Jamison 2009)). Nevertheless, a broad positive correlation is predictable between the size of a non-alphabetic (non-repetitive) molecular mediator and the catalytic diversity required for its synthesis. Attainment of the molecular complexity required by biosystems through non-alphabetic processes will thus be associated with an escalating cost in providing for increasing synthetic diversity, applicable in turn to the synthesis of each catalyst itself.

Complexity Regression and Biosystems

Can the necessary large-scale molecular complexity required for fully operational life processes be achieved by some other means than molecular alphabets? Acknowledging that functional and regulatory demands will inevitably require large molecules does not in itself force the molecular alphabetic concept. In principle, ‘non-alphabetic’ molecules (any molecules synthesized from non-repetitive components) could be used, provided these were of sufficient complexity to perform the required tasks. As noted above, this stipulation would inevitably require that at least a subset of these molecules would become very large relative to the usual boundaries of ‘small’ molecules. Attendant with this complexity is the associated growth in synthetic complexity. If synthesis of a complex molecule N required a sequential series of modifications to increasingly complex precursors A➔ B ➔ C➔ …..➔ N, by a series of catalysts iA ➔ iN-1, then as each precursor product increases in complexity (especially with increasing numbers of chiral centers), so too must increase the total number of reaction steps and the general catalytic demands on the complete synthesis, in order to ensure specificity. Such problems might be solved with increasingly complex catalysts, but this in turn places the same demands on the synthesis of each catalytic agent itself, and so on for the catalysts required to create the secondary catalysts, leading to an indefinite regression of complexity. (This is similar to the limitation faced by structurally-based replication schemes, as depicted in Fig. 2A). Low-molecular weight organocatalysts might enable some reaction steps (especially during early stages in abiogenesis), but there is no evidence to suggest that this could be the basis of a highly complex biosystem.

Alphabets, Information Content, and Evolvability

Though both structurally repetitive molecules such as glycogen and any biologically functional polypeptide are definable as biopolymers, they are profoundly different in their respective information contents. In this context functional information (I; as binary bits) is given by I = -log2(P), where P is the probability that a random sequence will encode a molecule with greater than any given degree of sequence-defined function (Szostak 2003). For a specific population of molecules, the probability of finding (selecting) function can be expressed as P = Nf/Nt, where Nf is the number of molecules above a functional threshold, and Nt is the total population size. The biological role of glycogen as energy storage relates to its polymerization itself, rather than to any specific arrangement of its periodic monomers. Accordingly, polydispersed populations of glycogen molecules are not functionally distinguishable (Nf = Nt); therefore by this criterion the glycogen ‘information content’ I = 0. But for an aperiodic alphabetic polymer, evolutionary selection for a threshold function is followed by cumulative selection for increasingly refined functional properties, where the functional information term is indeed greater than zero.

Once the baseline ‘investment’ in the production of monomers and biosynthetic machinery is made, the simplest form of molecular complementarity enabling regular templating will be energetically favored for replication of alphabetic molecules. A replicative code based on folded structural recognition (as depicted in Fig. 2A) will thus require more information to specify than a simple complementarity-based code between individual members of the same alphabet (Fig. 2B). This can be reframed by proposing that for any complex system progressing towards true self-sustaining life capable of evolution, the most algorithmically simple solution will prevail, and this mandates the use of primary molecular alphabets with the simplest workable inter-subunit complementarity. Only the latter can provide the means for feasibly accessing selectable functions for cumulative improvements by successive reiterations.

Experimentation has clearly shown that selecting for function from combinatorial variants of relatively short folded RNA strands is feasible, since functions above a defined threshold of activity will tend to be associated with a set of related sequences rather than a unique combination (Szostak 2003). Combinatorial assortment is an inherent possibility arising during the replication of templated biological molecular alphabets, allowing in turn the possibility of natural selection for molecular variants with improved fitness. Recapitulation of this facility with non-alphabetic systems leads to the complexity regression trap outlined above. By way of comparison, defined functions can indeed be obtained by screening large artificial libraries of low-molecular weight non-alphabetic compounds, often with diverse scaffold structures (Golebiowski et al. 2003; Haggerty et al. 2011). However, each such compound requires its own synthetic pathway, requiring far greater complexity in total than that needed for the generation of a combinatorial library of templated alphabetic molecules.

The cumulative selection of functionally informative molecules is an essential requirement for evolvability. It is necessarily a generalizable observation that any molecular alphabet enabling the formation of functional polymers must equally well be capable of forming a much larger range of non-functional macromolecular concatenates. This is reflected by the often-noted fact that the usual protein alphabet allows the potential formation of hyperastronomical numbers of combinatorial forms even for relatively small polypeptides (for example, a small protein of 100 residues could have 20100 potential sequences). Of these, only a minute fraction will be soluble, acquire stable folds, and exhibit functional properties. Even for a much smaller alphabet of 4 members (as with nucleic acids), the combinatorial explosion as the total size of a concatenate increases results in a huge number of alternative forms, the vast majority of which will lack any known function. The information implicit in specific protein or nucleic sequences via their folded states is then realized through evolutionary selection for function.

Metabolic Origins, ‘Non-Alphabetic’ Life, and Self-Organization

For relatively simple systems, complexity regression problems may be escapable by cycling of an autocatalytic network, as postulated and modeled in ‘metabolic origin’ theories of abiogenesis (Kauffman 2007; Shapiro 2006). But there are limits to what levels of complexity might be achievable in this manner with real-world chemistry (Orgel 2008). If even the simplest monomeric natural enzymes require molecular weights on the order of 10 kD to function in integrated biosystems (as noted above), then the prospect of creating complex ‘non-alphabetic’ molecules of anything approaching this size is daunting.

In any case, the proposal that molecular alphabets should be a universal requirement for complex life is technically neutral with respect to either genetic or metabolic origin theories for life on Earth itself, or elsewhere. With the former stance, self-replicating genetic systems (necessarily involving at least simple alphabets) have temporal priority in abiogenesis, but even metabolic-origin theories are ultimately the means for initiating complex life based on alphabets, not the basis for development of complex life itself. (This is necessarily the case, since both theories attempt to explain the origin of the existing ‘sample’ of terrestrial life, which is observably alphabet-based). Likewise, whatever the mechanisms and pathways of abiogenesis on Earth, the outcome has converged on alphabetic biosystems. A key difference between the genetic and metabolic stances is thus whether molecular alphabets were an early or relatively late event in molecular evolution.

While autocatalytic self-replication of complementary blocks such as nucleotides may have been an important event in the origin of life (Sievers and von Kiedrowski 1994), an underlying issue for the development of any complex biosystems still remains. In essence, this could be framed as the question, “What are the limits of complex functions of autocatalytic chemical systems based on small molecules, especially given a requirement for self-replication?”. Models for replication of autocatalytic sets as ‘compositional genomes’ have been devised (Segre et al. 2000), although refuted on theoretical grounds (Vasas et al. 2010). The (usually) implicit assumption of metabolic-origin theories is that autocatalytic chemical cycles, however sophisticated, provide a necessary foundation upon which polymeric molecules mediating more complex molecular systems can develop. In some models, autocatalytic sets of monomers directly lead to the generation of functional oligomeric forms (as with amino acids to functional peptides (Kauffman 1986)). The transition from ‘prelife’ (Nowak and Ohtsuki 2008) to living systems can accordingly be envisaged as a transition from non-alphabetic processes to systems using (replicating) alphabets. A viable model for ‘non-alphabetic’ life of a comparable organizational order to observable cellular biology would require a radical extension of current theories of metabolic replicative systems.

Self-assembly and self-organization are key concepts in ‘metabolic’ models of abiogenesis, and self-assembling structures such as lipid membranes or supramolecular protein complexes are biologically ubiquitous. Yet these existing phenomena are only rendered possible by the fundamental alphabetic biology at the core of the systems of which they are components. Protein catalysis mediates the formation of self-assembling subunits, and proteins are intimately involved with self-organization of structures such as the Golgi apparatus (Kuhnle et al. 2010) or membranes (Simons and Sampaio 2011). Macromolecular folding itself exemplifies self-assembly programmed by interactions within a linear sequence string of alphabetic members, where function has been selected and cumulatively refined by natural evolution. Accordingly, there is no contradiction to the postulated universality of molecular alphabets in presuming that the capacity for self-assembly or dynamic self-organization manifested by certain cellular structures is also an essential aspect of biological systems.

Alphabetic Constraints

Concatenation of alphabetic monomers in specific sequences can create considerable folded functional diversity, but this process is not unlimited in its scope. Such functional range is constrained in part by the collective properties of the relevant set of alphabetic monomers themselves. An evaluation of the functional constraints on specific known molecular alphabets is useful in considering to what extent generic classes of alphabets may allow the development of complex biosystems.

If the primary requirements of biological molecular alphabets can be encapsulated as replication plus function, then the precedents of this planet demonstrate that a single alphabet can perform both, but within limits (Doudna and Lorsch 2005). Simple alphabets that facilitate replication by digital complementarity will tend to suffer in functional range through lack of diverse monomeric physico-chemical properties. Three logical alternative pathways to overcome this hurdle exist: (1) increase the size of the primary replicative/functional alphabet, (2) co-opt an entirely separate alphabetic system, or (3) delegate most functions to a subsidiary alphabet.

The first of these alternatives is the simplest in principle, and is certainly possible experimentally (Henry and Romesberg 2003; Piccirilli et al. 1990). Even if evolution retained a 4-member nucleic acid alphabet on this planet, would it not be reasonable to expect that entirely novel biosystems could use a single replicative/functional alphabet with an expanded repertoire for promoting diverse functions? Computational modeling has provided evidence that for low-fidelity replications systems (associated with the RNA World), a 2-base pair system is optimal, but larger (3-base pair) alphabets would be theoretically favored after the advent of the RNA-DNA- Protein World (Gardner et al. 2003; Szathmáry 1992, 2003). The retention of a 4-base nucleic acid alphabet is then seen as a classic example of a ‘frozen’ mechanistic state (dictated by former conditions) which is so integral to subsequent biosystem operation that it is not amenable to evolutionary change. Since proteins functionally supersede folded RNA molecules, there would be no selective pressure for such a change in the post-RNA World in any case (Szathmáry1992).

A particular constraint for primary replicative alphabets with implications for fidelity stems from the supposition that increasing numbers of base pairs would inevitably be accompanied by increasing molecular similarity between each interactive pair (Szathmáry 1992), a trend which would impose an escalating cost towards replicative accuracy. This point was made with H-bonding partner moieties in mind, but recent experimental evidence suggests alternative (and more diverse) arrangements could be made, including hydrophobic complementarity (Hirao et al. 2006) and metal-ion stabilized interactions (Takezawa and Shionoya 2012). If such findings are extended towards any kind of informational digital macromolecular system, the issue of diversity between separate molecular pairs may thus not be a fatal limitation. But on the other hand, any form of inter-subunit complementarity must limit the possible diversity of the alphabet as a whole (Szathmáry 1992), since each molecular pair must necessarily exhibit a compatible non-covalent bonding match of some kind to enable complementarity in the first place. Even if inter-pair diversity is not a fundamental hindrance, inescapable intra-pair structural constraints then place functional limits on dual replicative/functional roles of any monoalphabetic system.

The second of the above possibilities for functional diversification from a primary alphabet (co-opting an entirely separate alphabetic system) suggests co-operation between two entirely distinct digitally-templated alphabets within a single biosystem. Apart from the sheer complexities of such an arrangement,Footnote 4 both individual alphabets would still be subject to the same universal diversification strictures. Therefore, the third possibility of a subsidiary (secondary) alphabet (Fig. 1-B1-C) becomes the most logical proposal for functional diversification. Members of secondary alphabets escape the restrictions dictated by direct complementarity, and their concatenates are then free to explore a much wider range of structural alternatives.

Complexity Barriers and Alphabets

The grouping of alphabets into broad classes based on their replicative and functional properties has implications for the generalizable requirements for attaining increasingly complex biological organization. If primary alphabets replicating by digital complementarity are indeed always constrained in terms of their achievable catalytic diversities, then it follows that splitting of informational replication and biosystem functions will be selected for, provided relevant pathways are evolutionarily feasible. At the same time, these transitions imply that the evolutionary progression of molecular alphabets is of central significance in the enablement of biosystems to proceed to higher levels of complexity. The emergence of primordial replicative alphabets in the first place, whatever their molecular antecedents, is a logical starting point for the origin of Darwinian evolutionary processes. Subsequent developments in alphabetic division of labor and functional capacity were then crucial in permitting the evolution of more complex biosystems (Table 2). From this point of view, the RNA World itself was a complexity bottleneck from which an evolutionary escape route was provided through the formation of a subalphabet (DNA) and a secondary alphabet (proteins). It is important to note that citing complexity barriers of increasing ascendancy does not imply any direct evolutionary inevitability; only that on either side of a given barrier different levels of complexity are evident. If an evolutionary ‘escape route’ does become available, though, systems using such pathways acquire improved fitness and thereby predominate over competitors. Even if a trend towards increasing complexity accompanies heritable biological variation (for example, as hypothesized in the Zero-Force Evolutionary Law (Fleming and McShea 2013)), initial molecular evolution of replicable alphabets is a necessary precondition before higher-level complexity can be progressively attained.

Table 2 Generalized major complexity barriers associated with molecular alphabets in biosystem origins and evolution

Several key evolutionary transitions have been listed (Szathmáry and Smith 1995), one of which is the progression beyond the RNA Word, but it is unlikely that all such transitions are subject to equal probabilities of occurrence. For example, the arrival of eukaryotic cells through symbiosis between an archaeal cell (as the most probable candidate) and α-protobacterial ancestors of mitochondria and chloroplasts has been noted as a fundamental restriction in the enablement of the evolution of complex multicellular life (Lane and Martin 2010). The immensely long time lag between the origins of the first prokaryotic cells and the arrival of eukaryotes suggests that this kind of event was indeed a rate-limiting step in effect, heavily dependent on chance-based interactions (Lane and Martin 2010). In contrast, the comparatively rapid origin of basic cellular life could be interpreted as evidence for the relative ease of traversing alphabet-related complexity barriers (Table 2), despite their profoundly fundamental importance for progression of all terrestrial biosystem complexity.

While this view might indeed be correct, alphabetic evolutionary progression is likely to be linked with specific conditions in early terrestrial environments which favored not only the emergence of alphabetic systems per se, but also permitted their development across successive complexity hurdles. In the absence of further information, it cannot be ruled out that a relatively effective primary replicative alphabetic system could arise whose molecular nature rendered it refractory to further functional modifications (Table 2), at least through feasible natural evolutionary pathways. Such ‘blocked’ but effective replicators might in turn suppress (through resource competition or other means) the nascent development of potentially superior systems for the enablement of complexity. Obviously this did not occur on Earth, but it is another conceivable limitation on alternative life system development elsewhere in the universe, if transitions between primary and secondary alphabets are evolutionarily difficult for certain specific chemistries.

On the other hand, with present information it can be equally well argued that once a primary molecular alphabet of virtually any chemical composition has been established, pathways will exist for its subsequent progression and diversification (Table 2), where the ‘dead-end dominant replicator’ scenario is at best highly improbable. Even if different alphabetic chemistries result in varied probabilities for evolutionary escape from each complexity barrier, in this viewpoint all would progress eventually, and none would be blocked. Here the major complexity barrier falls back to the transition from ‘prelife’ to life (Nowak and Ohtsuki 2008).

Even if intricate self-sustaining autocatalytic metabolic nets are possible, and can naturally arise in the right environmental conditions, there is no prospect that these could evolve into increasing levels of complexity, as considered above, and amplified by theoretical studies (Vasas et al. 2010). A stable self-sustaining metabolic network would thus be ‘stuck’ at a limited complexity level unless it could progress towards a system employing a digitally-templated primary molecular alphabet. Finally, it may be noted that despite the versatility of the secondary protein alphabet, it is not without functional limitations. To a certain extent, organic or inorganic cofactors can be co-opted for functional augmentation (Table 2), in an analogous manner as presumed to have occurred for nucleic acid functional molecules within the RNA world.

Universal Alphabets and Evolutionary Convergence

Most concepts of extraterrestrial life, and especially extrasolar manifestations of it, are based on the assumption that such biology must represent a completely independent design scheme from that known on Earth Footnote 5. From this, it must be surmised that if functionally analogous design features were to be found between terrestrial and non-terrestrial life, they would necessarily represent true evolutionary convergence in the strongest sense.

In contrast to such completely hypothetical ‘pure’ convergence, terrestrial evolutionary convergence has many more nuances. Some apparent cases of convergence have been challenged by molecular evidence showing underlying morphogenetic homologies. For example, eye evolution was until relatively recently ascribed as convergent in as many as 60 independent lineages, but now strong evidence for its essential monophyletic nature has emerged via the ancient master control element Pax6 and its target genes (Gehring 2010). Similar ‘deep homologies’ in other genetic regulatory circuitry provide a common tool-box for widely distinct lineages to ‘converge’ on analogous functional solutions (Shubin et al. 2009).

If truly independent terrestrial evolutionary convergence is rarer than previously thought, the probability of extrasolar life convergences (where no common genetic underpinnings are possible) might seem to accordingly diminish. Yet some kinds of convergence may be dictated through physical necessity, possibly including biological scaling ‘laws’ operating over both cellular and macroscopic ranges (West and Brown 2005). Apart from such postulated inescapable design influences, long-term evolutionary pathways are inherently subject to variable environmental contingencies which render them essentially unpredictable (Grant and Grant 2002), even in principle (Day 2012). Even with precisely controlled environments, specific mutational pathways which may confer beneficial phenotypes are not a priori predictable (Stanek et al. 2009). These caveats are obviously reinforced in the case of unknown extraterrestrial biospheres.

At a much more fundamental level, mathematical modeling has nevertheless indicated that certain features (including hypercycles and self-maintenance) will reproducibly emerge in self-organizing systems as abstract chemical principles (Fontana and Buss 1994). Chemical reactions are obviously influenced by physical factors (temperature, pressure, reactant concentrations) and the presence of chemical modulators (catalysts, alternative reaction participants), but arguments based on chemical valency and bonding architectures must be universal. Conditions deviating in the extreme from those on Earth (and during its early history) are far more likely to limit the types of polymers which can be formed than permit the formation of radically new ones with suitable stability.

It is thus argued here that certain types of molecular organization are logically necessary for any form of chemically-based life, in an analogous manner to physical constraints which enforce certain kinds of evolutionary convergence. Any emerging biosystem, by this reasoning, will converge upon evolved molecular alphabets and their concatenated derivatives as a universal pattern for attaining higher complexity. Even if unsuspected chemical configurations under novel conditions exist, it is hard to escape the conclusion that the concatenation of molecular alphabets in specific replicable sequences must be a defining feature of life anywhere in the universe.

Convergence at the fundamental level of molecular organization represented by molecular alphabets can be taken further to argue that digital primary alphabets are also necessary for both replication and function, and the enablement of effective molecular evolution itself, as noted above. Secondary encoded alphabets also appear to offer the most parsimonious route towards solving inherent efficiency problems of monoalphabetic systems. Subject to their solvent environments, secondary alphabets would likely possess members covering a range of different properties for charge, hydrophilicity, and hydrophobicity, in order to allow stable folding and packing of residues into functional forms.

The ‘choice’ of alphabet members based on such properties in general is likely to be as equally non-random as is found with the conventional protein alphabet (Philip and Freeland 2011). With respect to the protein alphabet, many alternative amino acids could have potentially been included, and some non-canonical amino acids in particular may have been present in early proteins (Alvarez-Carreño et al. 2013). Eventual stabilization of an evolving molecular alphabet is likely to depend on both environmental and metabolic accessibility, and functional advantages conferred by acquisition of specific alphabet members (Cleaves 2010). Beyond these kinds of generalizations, convergence at any level of specific chemistry for independently evolving alphabets is possible, but not predictable from first principles.

Conclusions

In summary, available information suggests that that complex living systems based on chemical bonding universally require molecular alphabets for their realization, concatenation of which can yield functional molecules through self-folding into stable higher-order structures. At the core of this proposal is the requirement within any biosystem for a digital self-complementary primary replicative alphabet, which must also be capable of giving rise to functional concatenates at least during an early stage of molecular evolution. Limitations exist on the functional range of primary replicative alphabets through the need to preserve digital complementarity, suggesting that the evolution of unconstrained secondary alphabets is an additional co-requirement for the development of more complex life.

Apart from in silico modeling and biosystem simulation, the universality of molecular alphabets can be potentially physically studied and tested in two broad ways. The first approach could be pursued only if samples of life with a completely different evolutionary provenance become obtainable. This would be satisfied by finding life elsewhere in the solar system, or if non-standard terrestrial ‘shadow’ life was identified (Davies et al. 2009), provided that in either case the novel biosystem was demonstrably of uniquely different origins to familiar life. In the absence of such samples, the second alternative is already being pursued, by experimental modifications, extensions, and minimizations of the familiar nucleic acid and protein alphabets (Henry and Romesberg 2003; Liu and Schultz 2010; Pinheiro and Holliger 2012). Ultimately, a series of falsifiable propositions relating to the role of molecular alphabets in the evolution of complex biosystems can be put to the test through increasingly sophisticated levels of synthetic biology.

An analogy made elsewhere (Dunn 2009) compared known alphabetic and hypothetical ‘non-alphabetic’ life with writing systems based on phonetic alphabets vs. purely logographic scripts. While ‘alphabetic life’ is clearly possible from the precedent set on this planet, the above arguments suggest the impossibility of analogous ‘logographic life’, where each biomolecule is a separately constructed entity. As an echo of this, it has been noted that pure logographic scripts for natural languages such as Chinese actually do not exist (written Chinese shows phonetic character aspects when needed, as for example when representing foreign words (Unger 2004)). It is in fact the relative simplicity of alphabetic generation of function (whether molecules or words) upon which succeeding layers of complexity can be built, as realized on Earth. Although the word ‘alphabet’ is not seen in lists of words most frequently used in life definitions (Trifonov 2011), these considerations suggest that it could usefully become a part of them.