Glossary
Annotators SEED
The master copy for all data in the SEED environment. Users can not access this password protected site.
Annotation
Annotation can be defined as assigning a gene function to a specific sequence. Traditionally annotation has been performed on a gene by gene basis in each genome separately. The SEED approach is based on the idea to annotate many genomes at a time. The SEED approach to annotation requires subsystems to be created, see Clearinghouse.
Assigning a gene function and annotation
Annotators assign gene functions to genes, and we call this process annotation. In most contexts, people use the term annotation to refer to assignments of function to the genes within a single organism. We certainly use the term in this sense, but we also use it to describe the process of assigning functions to corresponding genes from numerous genomes. Our basic approach to annotation is to ask our annotators to annotate the genes included in a #subsystem (e.g., glycolysis) across all genomes. This process of annotation of the genes within a subsystem across a set of genomes, rather than annotation of genes within a single genome, allows our annotators to focus on a constrained set of functional roles and attempt to accurately identify exactly what variant, if any, of a subsystem exists in each of the genomes.
We use the term annotation to refer to assigning functions to genes (either within a single organism or to a constrained set of gene/protein families across a set of organisms). This activity certainly is closely related to the construction of subsystems and protein families (which we call FIGfams), but we will describe those activities elsewhere.
Assignment
please see Assigning a gene function and annotation
FIGfam
FIGfams are protein families generated by the Fellowship for Interpretation of Genomes (FIG). These families are based on the collection of subsystems, as well as correspondences between genes in closely related strains (we describe the construction of FIGfams in a separate SOP). The important properties of these families are as follows:
- Two PEGs which both occur within a single FIGfam are believed to have the same function.
- There is a decision procedure associated with the family which can be invoked to determine whether or not a gene can be “safely” assigned the function associated with the family.
Functional role
The concept of functional role is both basic and primitive in the sense that we will not pretend to offer a precise definition. It corresponds roughly to a single logical role that a gene or gene product may play in the operation of a cell.
Gene function
The function of a protein-encoding gene (PEG) is the functional role played by the product of the gene or an expression describing a set of roles played by the encoded protein. The operators used to construct expressions and the meanings associated with the operators are described in
http://www.nmpdr.org/FIG/Html/SEED_functions.html
Genes other than PEGs can also be assigned functions (e.g., SSU rRNA). However, in most cases the functions assigned to genes other than PEGs tend not to be problematic. This document will focus solely on annotation of PEGs.
=== Metabolic Reconstruction
When we use the term metabolic reconstruction of a given genome we will simply mean the set of populated subsystems that contain the genome, the PEGs (and their assigned functions) that are connected to functional roles in these populated subsystems, and the specific variant code associated with the genome in each of the populated subsystems.
=== NMPDR pathogen genome ===:
The NMPDR is responsible for five classes of genomes:
- Campylobacter jejuni
- Listeria monocytogenes
- Staphylococcus aureus
- Streptococcus pneumoniae and Streptococcus pyogenes
- Pathogenic Vibrio
PEG
A Protein Encoding Gene (PEG) is equivalent to a CDS (Coding Sequence).
Subsystem
A subsystem is a set of functional roles that an annotator has decided should be thought of as related. Frequently, subsystems represent the collection of functional roles that make up a metabolic pathway, a complex (e.g., the ribosome), or a class of proteins (e.g., two-component signal-transduction proteins within Staphylococcus aureus). A populated subsystem is a subsystem with an attached spreadsheet. The rows of the spreadsheet represent genomes and the columns represent the functional roles of the spreadsheet. Each cell contains the identifiers of genes from the corresponding genome the implement the specific functional role. That is, a populated subsystem specifies which genes implement instances of the subsystem in each of the genomes. The rows of a populated genome are assigned variant codes which describe which of a set of possible variants of the subsystem exist within each genome (special codes expressing a total absence of the subsystem or remaining uncertainty exist). Construction of a large set of curated populated subsystems is at the center of the NMPDR and SEED annotation efforts.
Subsystem Clearing House
Since annotators can work on any machine (including the public SEED) the way to propagate subsystems is via Subsystem Clearing House web page.