Annotating 1000 genomes

A manifesto written in early 2004.

The Project to Annotate the First 1000 Sequenced Genomes, Develop Detailed Metabolic Reconstructions, and Construct the Corresponding Stoichiometric Matrices

by Ross Overbeek

Introduction

In December, 2003 The Fellowship for Interpretation of Genomes (FIG) initiated The Project to Annotate 1000 Genomes (P1K). The explicit goal was to develop a technology for more accurate, high-volume annotation of genomes and to use this technology to provide superior annotations for the first 1000 sequenced genomes. Members of FIG were convinced that the current approaches for high-throughput annotation, based on protein families and automated pipelines that processed genomes sequentially, would ultimately fail to produce annotations of the desired accuracy. We believe that the key to development of high-throughput annotation technology is to have experts annotate single subsystems over the complete collection of genomes. The existing annotation approaches, in which teams analyze a whole genome at a time, ensure that annotators have no special expertise relating to the vast majority of genes they annotate. By having individuals annotate single subsystems over a large collection of genomes, we allow individuals with expertise in specific pathways (or, more generally, subsystems) to perform their task with relatively high accuracy.

The early stages of the effort began at FIG, but quickly spread to a number of cooperating institutions, most notably Argonne National Lab. During the first year of the project, we have developed detailed encodings of subsystems that include a majority of the genes from subsystems that make up the core cellular machinery. More importantly, we have developed the initial versions of technology needed to support the project.

The Project to Annotate 1000 Genomes has reached the stage where it is clear that it will very shortly produce what we call informal metabolic reconstructions that cover the majority of central metabolism as it is implemented in the close to 300 more-or-less complete genomes that are now available. We think of an informal metabolic reconstruction as a partitioning of the cellular machinery into subsystems, the specification of the functional roles that make up each subsystem, and the inventory of which genes in a specific organism implement the functional roles. What is needed to support both qualitative analysis and effective quantitative modeling is to convert these informal metabolic reconstructions into formal metabolic reconstructions. By a formal reconstruction, we mean an accurate encoding of the metabolic network. The goal of such an encoding is to construct a list of metabolites and a detailed reaction network that is internally consistent (in the sense that metabolites that are produced by reactions are connected as substrates to other reactions or to specific transporters, and that all metabolites that act as substrates are produced by other reactions or provided by transporters). Perhaps, a better way to put this is that all apparent anomalies are highlighted as such, and the essential components of the metabolic network are accurately encoded. The output of such an effort is normally what is termed a stoichiometric matrix, the basic resource required to support stoichiometric modeling. One of the central goals of this enlarged effort is to develop accurate stoichiometric matrices for each of the 1000 genomes; we refer to this component of the effort as The Project to Produce 1000 Stoichiometric Matrices.

It is our belief that the development of the technology required to mass-produce accurate genome annotations will ultimately allow fully automated annotation pipelines to achieve relatively high accuracy. Similarly, the existence of 1000 accurate formal metabolic reconstructions would constitute a resource that would allow rapid and accurate development of stoichiometric matrices for newly-sequenced genomes. That is, besides producing accurate annotations, informal metabolic reconstructions, formal metabolic reconstructions, and stoichiometric matrices for a large collection of diverse genomes, we believe that the expanded project will produce technology that will support nearly automatic, very rapid characterization of new genomes.

All of the encoded subsystems, metabolic reconstructions and stoichiometric matrices will be made freely available on open web sites. In addition, the software environments used to develop the encoded subsystems and stoichiometric matrices will be developed and supported as open source software. By making the fundamental data items, the encoded subsystems and stoichiometric matrices, freely available to the community, we expect to stimulate development of alternative software systems to support curation and maintenance of these items.

The Project to Annotate 1000 Genomes

We have chosen to conceptually break the Project to Annotate 1000 Genomes into three stages. We discuss these stages as if they will occur sequentially; in fact, all three stages are now in progress. To understand the three stages, the reader must have at least a rudimentary grasp of what we mean by an encoded subsystem and an informal metabolic reconstruction. When we speak of a subsystem, we think of a set of related functional roles. In a specific organism, a set of genes implement these roles, and we think of those genes as constituting the subsystem in that organism. That is, we are really dealing with an abstract notion of subsystem (in which the subsystem is a set of functional roles) and instances of the subsystem in a specific organism (in which a set of genes implements the abstract functional roles). Precisely the same subsystem and functional roles exist in distinct organisms, although obviously the genes are unique to each organism.

Subsystems are thought of as possibly having multiple variants. Organisms that have operational versions of a subsystem may well have genes that implement slightly different subsets of the functional roles that make up the subsystem. Each subset of functional roles that exists in at least one organism with an operational version of the subsystem constitutes an operational variant.

We think of an informal metabolic reconstruction for an organism as a set of operational variants of subsystems that are believed to exist for the organism. In this conceptualization, one does not have a meaningful functional hierarchy or DAG; rather, we simply have an inventory of functional roles that are implemented in the organism, along with the variants of subsystems that they implement. We do believe that the task of imposing an actual hierarchy is relatively straightforward in comparison with the effort required to construct the set of operational variants. In some contexts, we have included a functional overview in which the subsystems are embedded at the lowest levels. It is clear that, given a diverse collection of informal metabolic reconstructions, the development of appropriate functional hierarchies can be generated with relatively few resources.

Our encoding of a subsystem can now be reduced to

a specification of a set of functional roles (this amounts to the abstract subsystem) and sets of genes which implement the operational variants in a number of genomes. These genes are given as a subsystem spreadsheet in which each row corresponds to a single genome, each column corresponds to a single functional role, and each cell contains the set of genes in that genome that are believed to implement the given functional role.

The Project to Annotate 1000 Genomes amounts to an effort to produce detailed and comprehensive encodings of several hundred subsystems, which will impose assigned functions on genes in each of the genomes. The total percent of genes that can be assigned functions this way is probably on the order of 50-70% in most genomes (in large eukaryotic genomes the total is obviously substantially lower). The percent will grow as our understanding grows. What should be noted is that the accuracy of these assignments will be substantially better than that of current assignments, and the conserved cellular machinery almost all falls within the projected subsystems.

Once we have produced our initial set of annotations, we believe that automated pipelines and protein families are excellent tools for propagating them. Protein families are, in fact, a key component of annotation and provide the fundamental mechanism for projection of function between genes. The added dimension provided by subsystems, along with the manual curation required to develop accurate initial encodings of subsystems, is an essential technology for increasing the accuracy and effectiveness of protein families. Ultimately the encoded subsystems will be used to make incremental, essential corrections to collections of protein families (like those supported by UniProt and COGs), and a basis for much more accurate annotation will emerge.

We now proceed to describe the details of the three stages.

Stage 1: Development of Initial Encodings of Subsystems

The initial stage of the project will involve development of approximately 100-150 subsystems that will cover most of the conserved cellular machinery in prokaryotes (and all of the central metabolic machinery in eukaryotes). This work will be done largely by trained annotators who achieve a limited mastery of specific subsystems via review articles and detailed analysis of the collection of genomes. These individuals can define the abstract subsystems and add most genomes to the emerging spreadsheets, but not without error. They are necessarily far less skilled than experts who have invested tens of years in study of specific subsystems.

These initial subsystems will have many uses. They can be used to enhance sets of curated protein families, to clarify identification of gene starts, and to develop a consistent set of annotations. They will form the basis of informal metabolic reconstructions, and will be used to support the development of formal metabolic reconstructions. However, given the relative lack of expertise of these initial annotators and the fact that they will seldom have access to the wet lab facilities needed to remove ambiguities in assignments, errors will inevitably remain.

Stage 2: The Use of True Experts and the Wet Lab to Refine the Encodings

The second stage will involve the gradual refinement and enhancement of the original subsystem encodings by domain experts. Almost every subsystem spreadsheet makes it clear that numerous detailed questions remain to be answered. These questions relate to correcting gene calls, correction of frameshifts, refining function assignments, and removing ambiguities (either via bioinformatics based analysis or through actual wet lab efforts).

The participation of domain experts will be critical, but it seems most likely that a relatively small set will choose to get involved until the utility of the approach becomes obvious. We already have some domain experts (in translation, transcription, and a limited number of metabolic subsystems) participating in the effort. We believe that this number will grow rapidly over the next 2-3 years.

It should be emphasized that upon completion of step 2 we will have accurate annotations and a solid foundation for the construction of stoichiometric matrices.

Stage 3: Understanding the Evolutionary History of the Genes within the Subsystem

The third stage involves determination of the evolutionary history of the genes within the subsystem. To understand what this involves and the utility of this type of analysis, we must simply recommend two papers by the team led by Roy Jensen:

Ancient origin of the tryptophan operon and the dynamics of evolutionary change by Xie, Keyhani, Bonner, Jensen, Microbiol Mol Biol Rev. 2003 Sep;67(3):303-42 Inter-genomic displacement via lateral transfer of bacterial trp operons in an overall context of vertical genealogy, by Xie, Song, Keyhani, Bonner, Jensen, BMC Biology, 2004, 2:15

These papers elegantly display the exact style of analysis required to uncover and clarify the evolutionary history of the relevant genes. Essentially, trees must be built containing all of the genes implementing each specific functional role (multiple trees may be needed for distinct forms). Those trees that display a common topology indicate which columns in the spreadsheet can be used to infer the most probable vertical history of the subsystem. Once the overall history has been clarified, it becomes possible to attempt clarification of horizontal transfers, to reconstruct the history of clusters on the chromosome, and in some cases to tie the analysis to regulatory issues.

The effort required to do this style of analysis well is high. While we expect the initial efforts to go slowly, we also expect experience and advances in tools to dramatically reduce the required effort. In any event, it is clear that this stage will not be completed in the next few years, but will undoubtedly stimulate large amounts of related research.

Filling in the Missing Pieces

The encoded subsystems produced by the Project to Annotate 1000 Genomes offer a detailed picture of exactly what components have been identified and are present in each genome. Perhaps as significant, they vividly display exactly what is missing or ambiguous, allowing one to arrive at an accurate inventory of gaps in our understanding. The issue of how best to address these gaps is an integral part of the project. The technology that is emerging is what we refer to as the bioinformatics-driven wet lab. This concept refers to the development of a wet lab that utilizes conventional biochemical and genetic techniques in a framework designed to maximize the overall number of confirmations. It is driven by predictions arising from the analysis of subsystems, and it targets a prioritized list of conjectures. That is, the explicit goal is to fill in as many gaps and remove as many ambiguities as possible for resources consumed.

Although it is inconceivable that one experimental group would be able to assess all of the functional predictions, we believe that integrating an experimental component into our annotation/modeling effort will directly support our main goal. In addition to verification of key predictions and removal of central ambiguities, it will validate the overall approach and set an example for other groups worldwide.

The Project to Develop 1000 Stoichiometric Matrices

We believe that the informal metabolic reconstructions are of substantial value by themselves. Indeed, numerous applications are quite obvious. However, they are not enough to support quantitative modeling. Whole genome modeling will require development of stoichiometric matrices, an effort that will pay many dividends. The most immediate payout is as quality control on the informal metabolic reconstruction. Just as the use of subsystems imposes a critical set of consistency checks on the assignment of function to genes, an attempt to develop an internally consistent reaction network imposes a strong consistency check on both the annotations and assertions of the presence of specific subsystems.

Over the last 4-5 years, the success of stoichiometric modeling has set the stage for large-scale employment of the technology. The key limiting factor is the development of the stoichiometric matrix itself. This is a time-consuming task that frequently requires on the order of a year for a skilled practitioner. Many actual modeling efforts have foundered on just the technical difficulties in producing this basic datum. Bernhard Palsson has pioneered much of the key research that has led to the recent successes. Spending large amounts of effort, his team has built a very few of these stoichiometric matrices, iteratively improving their accuracy. They have successfully used these matrices to support initial modeling efforts on the organisms, and the results have gained international recognition.

Palsson's team originated the The Project to Produce 1000 Stoichiometric Matrices, and they will play the lead role in converting the informal metabolic reconstructions into formal reconstructions and produce the matrices. The team at FIG and Argonne National Laboratory will participate in the effort, coordinating closely with Palsson's team. At this point, the Palsson team and the teams at FIG, ANL, and The Burnham Institute are all working on issues relating to tools to automate the generation of matrices from informal metabolic reconstructions.

The Participants

We expect participants in both projects from many institutions worldwide, probably with both academic and commercial interests. Initially, it is likely that the effort will be led from FIG, ANL and Palsson's team at UCSD. We are planning on Roy Jensen playing a role relating to quality control and development of tools to support Stage 3 analysis. Andrei Osterman from the Burnham Institute will lead wet lab efforts to challenge in silico predictions.

If the effort is successful, we would hope to stimulate numerous research efforts worldwide, and we welcome broad participation. Ultimately, leadership and participation will broaden rapidly, if the effort is successful.

A Proposed Schedule

Let us begin by estimating the point at which 1000 genomes will become available. One simple approach would go as follows:

The number of genomes will double approximately every 18 months. We now have about 300 more-or-less complete genomes. Therefore, we should have approximately 1000 genomes in just a bit under 3 years (by sometime in 2007)

There is a great deal in this analysis that is far from certain. However, let us use this estimate as a working hypothesis.

2005

During 2005, Stage 1 will be completed for the vast majority of subsystems. Stage 2 will be initiated for 30-50 subsystems. Less than 10 will move deeply into stage 3.

We will actively attempt to produce 10-15 stoichiometric matrices. We will focus on diverse organisms of interest to DOE and a set of gram-positive pathogens.

We will begin a detailed review for quality assurance by a small number of expert biochemists and microbiologists.

We expect wet lab confirmations to begin, but this is one area in which funding plays an essential role. We expect funding to support targeted confirmation/rejection of the numerous conjectures arising from the bioinformatics to begin in 2005-2006. It is possible to fairly accurately predict the potential flow of confirmations, but we cannot predict available funding. We believe that the bioinformatics-driven wet lab, in which conjectures are prioritized and grouped, would allow a relatively small group (of 3-4 postdocs and technician) to characterize up to 50 novel gene families encoding the most important functional roles in central metabolic subsystems of diverse organisms per year.

2006

During 2006, the vast majority of subsystems will enter Stage 2. We will attempt to move a large number into Stage 3 (this is truly difficult to predict; it depends hugely on success with the early attempts, our ability to reduce the required effort, and the research aims of the participants).

We would plan on completing at least 200 more stoichiometric matrices.

If the wet lab component of the effort is fully functional, we would expect a steady stream of confirmations, and (based on our past experience) we would project roughly that 75-90% of the tested conjectures will be validated.

2007

During 2007 we would plan on pushing Stage 2 and 3 analysis as far as possible. We believe that we will have the subsystems needed to cover the vast majority of well understood subsystems and many that are not well understood.

We would plan on completing initial stoichiometric matrices for several hundred more genomes. Since the majority of the genomes will not become available until this year, of necessity many of the stoichiometric matrices will not be reasonably complete before sometime in 2008 or 2009.

If the wet lab component of the effort is fully functional, we would expect the stream of successful conjectures to stimulate numerous labs to join the effort. Ultimately, the role of the wet lab component that is tightly-coupled to the project is to demonstrate the huge improvement in efficiency that can be attained by coupling the wet lab effort to well-chosen, targeted conjectures generated from the subsystems.

A Short Note on the Analysis of Environmental Samples

It is becoming clear that analysis of environmental samples will become increasingly significant. Consider a framework in which we have 1000 genomes and detailed informal metabolic reconstructions for all of them. We believe that, given a substantial environmental sample,

it will be possible to produce accurate estimates of which organisms are present (where an "organism" in this context should probably be viewed as "some organism within a very constrained phylogenetic neighborhood"), it will be possible to produce fairly precise estimates of the metabolism of the organisms believed to be present, and it will be possible to compared the predicted metabolism with the actual enzymes detected in the environmental sample.

The hope is clearly that we will be able to make accurate estimates, given 1000 well-annotated genomes.

Summary

The value of a collection of 1000 genomes depends directly on the quality of the annotations, the corresponding metabolic reconstructions, and the extent to which the foundations of modeling have been established.

The Project to Annotate 1000 Genomes is based directly on the notion of building a collection of carefully created and curated subsystems. The fact that the individuals who encode these subsystems annotate the same subsystem over a broad collection of genomes allows them to gain an understanding of detailed variation and at least a minimal grasp of the review literature. They will be annotating genes for which they develop some detailed familiarity. We place this technology in direct opposition to the existing approaches in which individuals annotate complete genomes (assuring an almost complete lack of familiarity with the majority of genes being annotated), and automated pipelines are badly limited by the ambiguities and errors in existing annotations.

The Project to Produce 1000 Stoichiometric Matrices has the potential of laying the foundations for quantitative modeling. Many, if not most, existing modeling efforts are dramatically hampered by the fact that very, very few stoichiometric matrices now exist, and the cost of developing more using existing approaches is quite high.

The development of a wet lab component that challenges a carefully prioritized set of conjectures flowing from both the subsystems analysis and the initial modeling based on quantitative modeling is essential. It will confirm the relative efficiency of this approach (which might reasonably be characterized as "picking the low-hanging fruit"), and in the process establish a paradigm that directly challenges the more common approach to establishing priorities.

We claim to understand the key technology needed to develop high-throughput development of annotations, metabolic reconstructions, and stoichiometric matrices. By the summer of 2005, this should be completely obvious.