Difference between revisions of "Glossary"
| FolkerMeyer (talk | contribs) m |  (→RAST) | ||
| (26 intermediate revisions by 4 users not shown) | |||
| Line 1: | Line 1: | ||
| + | === Aliases === | ||
| + | Usually used in context of feature IDs. They are database crossreferences. | ||
| + | |||
| === Annotators SEED === | === Annotators SEED === | ||
| The master copy for all data in the SEED environment. Users can not access this password protected site. | The master copy for all data in the SEED environment. Users can not access this password protected site. | ||
| + | All annotations are made available via the [[#SEED-Viewer| SEED-Viewer]] and the [[#Trial-SEED|Trial-SEED]]. | ||
| === Annotation === | === Annotation === | ||
| − | + | please see [[#Assigning_a_gene_function_and_annotation| Assigning a gene function and annotation]] | |
| − | |||
| === Assigning a gene function and annotation === | === Assigning a gene function and annotation === | ||
| − | Annotators assign gene functions to genes, and we call this process annotation.  In most contexts, people use the term annotation to refer to assignments of function to the genes within a single organism.  We certainly use the term in this sense, but we also use it to describe the process of assigning functions to corresponding genes from numerous genomes.  Our basic approach to annotation is to ask our annotators to annotate the genes included in a [[#subsystem]] (e.g., glycolysis) across all genomes.  This process of annotation of the genes within a subsystem across a set of genomes, rather than annotation of genes within a single genome, allows our annotators to focus on a constrained set of functional roles and attempt to accurately identify exactly what variant, if any, of a subsystem exists in each of the genomes.   | + | Annotators assign gene functions to genes, and we call this process annotation.  In most contexts, people use the term annotation to refer to assignments of function to the genes within a single organism.  We certainly use the term in this sense, but we also use it to describe the process of assigning functions to corresponding genes from numerous genomes.  Our basic approach to annotation is to ask our annotators to annotate the genes included in a [[#subsystem|Subsystem]] (e.g., glycolysis) across all genomes.  This process of annotation of the genes within a subsystem across a set of genomes, rather than annotation of genes within a single genome, allows our annotators to focus on a constrained set of functional roles and attempt to accurately identify exactly what variant, if any, of a subsystem exists in each of the genomes.   | 
| We use the term annotation to refer to assigning functions to genes (either within a single organism or to a constrained set of gene/protein families across a set of organisms).  This activity certainly is closely related to the construction of subsystems and protein families (which we call FIGfams), but we will describe those activities elsewhere.   | We use the term annotation to refer to assigning functions to genes (either within a single organism or to a constrained set of gene/protein families across a set of organisms).  This activity certainly is closely related to the construction of subsystems and protein families (which we call FIGfams), but we will describe those activities elsewhere.   | ||
| Line 13: | Line 16: | ||
| === Assignment === | === Assignment === | ||
| please see [[#Assigning_a_gene_function_and_annotation| Assigning a gene function and annotation]] | please see [[#Assigning_a_gene_function_and_annotation| Assigning a gene function and annotation]] | ||
| + | |||
| + | === Bidirectional Best Hit (BBH) === | ||
| + | |||
| + | The paper [http://www.pubmedcentral.nih.gov/articlerender.fcgi?tool=EBI&pubmedid=10077608 The use of gene clusters to infer functional coupling] | ||
| + | defines a Bidirectional Best Hit or BBH as follows: | ||
| + | |||
| + | <blockquote> | ||
| + | Given two genes Xa and Xb from two genomes Ga and Gb, Xa and Xb are called a “bidirectional best hit (BBH)” if and only if recognizable similarity exists between them (in our case, we required fasta3 scores lower than 1.0 × 10−5), there is no gene Zb in Gb that is more similar than Xb is to Xa, and there is no gene Za in Ga that is more similar than Xa is to Xb. Genes (Xa, Ya) from Ga and (Xb, Yb) from Gb form a “pair of close bidirectional best hits (PCBBH)” if and only if Xa and Ya are close, Xb and Yb are close, Xa and Xb are a BBH, and Ya and Yb are a BBH. | ||
| + | </blockquote> | ||
| + | |||
| + | === Clearinghouse === | ||
| + | please see [[#SubsystemClearinghouse|Subsystem Clearinghouse]] | ||
| + | |||
| + | ===Feature=== | ||
| + | A feature is a defined region in the DNA. A PEG is the most prevalent feature type in the SEED. Some other feature types include | ||
| + | RNA, prophage and pathogenicity islands. The format for a feature ID is fig|genome_id.feature_abbreviation.feature_number (ie fig|83333.1.peg.100 ).  | ||
| === FIGfam === | === FIGfam === | ||
| Line 20: | Line 39: | ||
| #	There is a decision procedure associated with the family which can be invoked to determine whether or not a gene can be “safely” assigned the function associated with the family. | #	There is a decision procedure associated with the family which can be invoked to determine whether or not a gene can be “safely” assigned the function associated with the family. | ||
| + | |||
| + | === FIG Identifier / FIG-IDs === | ||
| + | We provide identifiers for genome sequences and features in the following form: | ||
| + | |||
| + | {| | ||
| + | ! Entity type !! key !! identifier | ||
| + | |- | ||
| + | | Genome || genome || fig<nowiki>|</nowiki>83331.1  | ||
| + | |- | ||
| + | | PEG || id || fig<nowiki>|</nowiki>83331.peg.123  | ||
| + | |- | ||
| + | | RNA feature || id || fig<nowiki>|</nowiki>83331.rna.1  | ||
| + | |- | ||
| + | |} | ||
| + | |||
| + | (Please also see below for information on how to link to the SEED.) | ||
| + | |||
| + | === Functional coupling === | ||
| + | |||
| + | The availability of multiple genomes provides an opportunity to gain new insights into the processes that drive the dispersion and formation of chromosomal gene clusters. The paper [http://www.pubmedcentral.nih.gov/articlerender.fcgi?tool=EBI&pubmedid=10077608 The use of gene clusters to infer functional coupling] describes a method to compute functional coupling of features due to conserved gene clusters . | ||
| === Functional role === | === Functional role === | ||
| Line 29: | Line 68: | ||
|   http://www.nmpdr.org/FIG/Html/SEED_functions.html |   http://www.nmpdr.org/FIG/Html/SEED_functions.html | ||
| − | Genes other than PEGs can also be assigned functions (e.g., SSU rRNA).  However, in most cases the functions assigned to genes other than PEGs tend not to be problematic.  | + | Genes other than PEGs can also be assigned functions (e.g., SSU rRNA).  However, in most cases the functions assigned to genes other than PEGs tend not to be problematic.   | 
| + | |||
| + | === Linking to the SEED === | ||
| + | |||
| + | We support linking to the SEED using a generic mechanism: | ||
| + | |||
| + | Base URL:  | ||
| + | |||
| + | http://www.theseed.org/linkin.cgi? | ||
| + | |||
| + | {| | ||
| + | |+ Supported SEED Identifiers for external use | ||
| + | ! Entity type !! key !! identifier !! Example | ||
| + | |- | ||
| + | | Genome || genome || fig<nowiki>|</nowiki>83331.1 || [[http://www.theseed.org/linkin.cgi?genome=fig|83333.1 http://www.theseed.org/linkin.cgi?genome=fig|83333.1]] | ||
| + | |- | ||
| + | | PEG || id || fig<nowiki>|</nowiki>83331.peg.123 || [[http://www.theseed.org/linkin.cgi?id=fig|83333.1.peg.123 http://www.theseed.org/linkin.cgi?id=fig|83333.1.peg.123]] | ||
| + | |- | ||
| + | | RNA feature || id || fig<nowiki>|</nowiki>83331.rna.1 || [[http://www.theseed.org/linkin.cgi?id=fig|83333.1.rna.1 http://www.theseed.org/linkin.cgi?id=fig|83333.1.rna.1]] | ||
| + | |- | ||
| + | |} | ||
| + | SEED identifiers contain the NCBI taxonomy ID, thus if the taxonomy ID changes, we need to update our internal data accordingly. To provide stable external identifiers, we keep a list of IDs that have changed and display warning message informing the user of the change and provide a link to the new version of the data requested. | ||
| − | === Metabolic Reconstruction | + | === Metabolic Reconstruction === | 
| When we use the term metabolic reconstruction of a given genome we will simply mean the set of populated subsystems that contain the genome, the PEGs (and their assigned functions) that are connected to functional roles in these populated subsystems, and the specific variant code associated with the genome in each of the populated subsystems. | When we use the term metabolic reconstruction of a given genome we will simply mean the set of populated subsystems that contain the genome, the PEGs (and their assigned functions) that are connected to functional roles in these populated subsystems, and the specific variant code associated with the genome in each of the populated subsystems. | ||
| − | === NMPDR pathogen genome === | + | === NMPDR pathogen genome ===   | 
| The NMPDR is responsible for five classes of genomes: | The NMPDR is responsible for five classes of genomes: | ||
| Line 44: | Line 104: | ||
| #	Streptococcus pneumoniae and Streptococcus pyogenes | #	Streptococcus pneumoniae and Streptococcus pyogenes | ||
| #	Pathogenic Vibrio | #	Pathogenic Vibrio | ||
| + | |||
| + | === Pair of Close Homologs (PCH) === | ||
| + | |||
| + | The paper [http://www.pubmedcentral.nih.gov/articlerender.fcgi?tool=EBI&pubmedid=10077608 The use of gene clusters to infer functional coupling] | ||
| + | defines a Pair of Close Homologs as follows: | ||
| + | |||
| + | <blockquote> | ||
| + | We can also define the concept of “pairs of close homologs” (PCHs) as follows: genes (X′a, Y′a) from Ga and (X′b, Y′b) from Gb form a PCH if and only if X′a and Y′a are close, X′b and Y′b are close, X′a and X′b are recognizably similar, and Y′a and Y′b are recognizably similar. Here, we will consider two genes to be recognizably similar if their gene products produce fasta3 scores lower than 1.0 × 10−5. We use a scoring scheme analogous to the one described for PCBBHs to evaluate the connections between PCHs, except that if Ga and Gb are the same genome, we assign an arbitrary “same-genome score” (“same-genome” pairs cannot occur for PCBBHs by definition, but for PCHs they are possible). Unlike PCBBHs from two very close genomes for which contiguity is completely uninformative in the vast majority of cases, PCHs allow recognition of gene clusters that play similar (but usually not identical) roles (such as two transport cassettes containing pairs of homologs) in the same or similar organisms. The arbitrary “same-genome score” should, we believe, have a value that is high enough to rank such instances as significant.  | ||
| + | </blockquote> | ||
| === PEG === | === PEG === | ||
| A Protein Encoding Gene (PEG) is equivalent to a CDS (Coding Sequence). | A Protein Encoding Gene (PEG) is equivalent to a CDS (Coding Sequence). | ||
| + | |||
| + | === Populated Subsystem === | ||
| + | please see [[#Subsystem|Subsystem]] | ||
| + | |||
| + | |||
| + | === RAST === | ||
| + | RAST or Rapid Annotation using Subsystem Technology is a rapid and very accurate annotation technology. We make a RAST server available for public use at   | ||
| + | |||
| + |  http://rast.nmpdr.org | ||
| + | |||
| + | === Scenarios === | ||
| + | |||
| + | Scenarios represent components of a metabolic reaction network in which specific compounds are labeled as inputs and outputs. The metabolic network is assembled using biochemical reaction information associated with functional roles in subsystems to find paths through scenarios from inputs to outputs. Scenarios that are connected by linked inputs and outputs can be composed to form larger blocks of the metabolic network, spanning processes that convert transported nutrients into biomass components. | ||
| + | |||
| + | === SEED-Viewer === | ||
| + | The SEED Viewer is a web-based application that allows browsing of SEED data structures. | ||
| + | |||
| + | We use the SEED-Viewer to provide a public read-only version of the latest SEED data at: | ||
| + | |||
| + |  http://seed-viewer.theseed.org | ||
| + | |||
| + | '''Please note''': The data is updated automatically every 24 hours. When citing or linking to the SEED please use this version. | ||
| === Subsystem === | === Subsystem === | ||
| − | A subsystem is a set of functional roles that an annotator has decided should be thought of as related.  Frequently, subsystems represent the collection of functional roles that make up a metabolic pathway, a complex (e.g., the ribosome), or a class of proteins (e.g., two-component signal-transduction proteins within Staphylococcus aureus).  A populated subsystem is a subsystem with an attached spreadsheet.   The rows of the spreadsheet represent genomes and the columns represent the functional roles of the spreadsheet.  Each cell contains the identifiers of genes from the corresponding genome the implement the specific functional role.  That is, a populated subsystem specifies which genes implement instances of the subsystem in each of the genomes.  The rows of a populated genome are assigned variant codes which describe which of a set of possible variants of the subsystem exist within each genome (special codes expressing a total absence of the subsystem or remaining uncertainty exist).  Construction of a large set of curated populated subsystems is at the center of the NMPDR and SEED annotation efforts. | + | A subsystem is a set of functional roles that an annotator has decided should be thought of as related.  Frequently, subsystems represent the collection of   | 
| + | [[#Functional_role|functional roles]] that make up a metabolic pathway, a complex (e.g., the ribosome), or a class of proteins (e.g., two-component signal-transduction proteins within Staphylococcus aureus).  A '''populated subsystem''' is a subsystem with an attached spreadsheet.   The rows of the spreadsheet represent genomes and the columns represent the functional roles of the spreadsheet.  Each cell contains the identifiers of genes from the corresponding genome the implement the specific functional role.  That is, a populated subsystem specifies which genes implement instances of the subsystem in each of the genomes.  The rows of a populated genome are assigned '''variant codes''' which describe which of a set of possible variants of the subsystem exist within each genome (special codes expressing a total absence of the subsystem or remaining uncertainty exist).  Construction of a large set of curated populated subsystems is at the center of the NMPDR and SEED annotation efforts. | ||
| + | |||
| + | === Subsystem Clearing House === | ||
| + | Since annotators can work on any machine (including the public SEED) the way to propagate subsystems is via  | ||
| + |  http://clearinghouse.theseed.org/clearinghouse_browser.cgi?  | ||
| + | |||
| + | === Trial-SEED === | ||
| + | A public, read-write copy of the SEED is made available on  | ||
| + |  http://theseed.uchicago.edu/FIG/index.cgi | ||
| − | + | '''Please note''': The data on this server is updated in irregular intervals. Users should not assume that annotations made on this system will persist. Please publish your annotations to the [[#Subsystem_clearing_house|Subsystem Clearing house]]. | |
| − | |||
| − | === | + | === Variant Code=== | 
| + | please see [[#Subsystem|Subsystem]] | ||
Latest revision as of 11:50, 3 December 2008
Aliases
Usually used in context of feature IDs. They are database crossreferences.
Annotators SEED
The master copy for all data in the SEED environment. Users can not access this password protected site. All annotations are made available via the SEED-Viewer and the Trial-SEED.
Annotation
please see Assigning a gene function and annotation
Assigning a gene function and annotation
Annotators assign gene functions to genes, and we call this process annotation. In most contexts, people use the term annotation to refer to assignments of function to the genes within a single organism. We certainly use the term in this sense, but we also use it to describe the process of assigning functions to corresponding genes from numerous genomes. Our basic approach to annotation is to ask our annotators to annotate the genes included in a Subsystem (e.g., glycolysis) across all genomes. This process of annotation of the genes within a subsystem across a set of genomes, rather than annotation of genes within a single genome, allows our annotators to focus on a constrained set of functional roles and attempt to accurately identify exactly what variant, if any, of a subsystem exists in each of the genomes.
We use the term annotation to refer to assigning functions to genes (either within a single organism or to a constrained set of gene/protein families across a set of organisms). This activity certainly is closely related to the construction of subsystems and protein families (which we call FIGfams), but we will describe those activities elsewhere.
Assignment
please see Assigning a gene function and annotation
Bidirectional Best Hit (BBH)
The paper The use of gene clusters to infer functional coupling defines a Bidirectional Best Hit or BBH as follows:
Given two genes Xa and Xb from two genomes Ga and Gb, Xa and Xb are called a “bidirectional best hit (BBH)” if and only if recognizable similarity exists between them (in our case, we required fasta3 scores lower than 1.0 × 10−5), there is no gene Zb in Gb that is more similar than Xb is to Xa, and there is no gene Za in Ga that is more similar than Xa is to Xb. Genes (Xa, Ya) from Ga and (Xb, Yb) from Gb form a “pair of close bidirectional best hits (PCBBH)” if and only if Xa and Ya are close, Xb and Yb are close, Xa and Xb are a BBH, and Ya and Yb are a BBH.
Clearinghouse
please see Subsystem Clearinghouse
Feature
A feature is a defined region in the DNA. A PEG is the most prevalent feature type in the SEED. Some other feature types include RNA, prophage and pathogenicity islands. The format for a feature ID is fig|genome_id.feature_abbreviation.feature_number (ie fig|83333.1.peg.100 ).
FIGfam
FIGfams are protein families generated by the Fellowship for Interpretation of Genomes (FIG). These families are based on the collection of subsystems, as well as correspondences between genes in closely related strains (we describe the construction of FIGfams in a separate SOP). The important properties of these families are as follows:
- Two PEGs which both occur within a single FIGfam are believed to have the same function.
- There is a decision procedure associated with the family which can be invoked to determine whether or not a gene can be “safely” assigned the function associated with the family.
FIG Identifier / FIG-IDs
We provide identifiers for genome sequences and features in the following form:
| Entity type | key | identifier | 
|---|---|---|
| Genome | genome | fig|83331.1 | 
| PEG | id | fig|83331.peg.123 | 
| RNA feature | id | fig|83331.rna.1 | 
(Please also see below for information on how to link to the SEED.)
Functional coupling
The availability of multiple genomes provides an opportunity to gain new insights into the processes that drive the dispersion and formation of chromosomal gene clusters. The paper The use of gene clusters to infer functional coupling describes a method to compute functional coupling of features due to conserved gene clusters .
Functional role
The concept of functional role is both basic and primitive in the sense that we will not pretend to offer a precise definition. It corresponds roughly to a single logical role that a gene or gene product may play in the operation of a cell.
Gene function
The function of a protein-encoding gene (PEG) is the functional role played by the product of the gene or an expression describing a set of roles played by the encoded protein. The operators used to construct expressions and the meanings associated with the operators are described in
http://www.nmpdr.org/FIG/Html/SEED_functions.html
Genes other than PEGs can also be assigned functions (e.g., SSU rRNA). However, in most cases the functions assigned to genes other than PEGs tend not to be problematic.
Linking to the SEED
We support linking to the SEED using a generic mechanism:
Base URL:
http://www.theseed.org/linkin.cgi?
| Entity type | key | identifier | Example | 
|---|---|---|---|
| Genome | genome | fig|83331.1 | [http://www.theseed.org/linkin.cgi?genome=fig|83333.1] | 
| PEG | id | fig|83331.peg.123 | [http://www.theseed.org/linkin.cgi?id=fig|83333.1.peg.123] | 
| RNA feature | id | fig|83331.rna.1 | [http://www.theseed.org/linkin.cgi?id=fig|83333.1.rna.1] | 
SEED identifiers contain the NCBI taxonomy ID, thus if the taxonomy ID changes, we need to update our internal data accordingly. To provide stable external identifiers, we keep a list of IDs that have changed and display warning message informing the user of the change and provide a link to the new version of the data requested.
Metabolic Reconstruction
When we use the term metabolic reconstruction of a given genome we will simply mean the set of populated subsystems that contain the genome, the PEGs (and their assigned functions) that are connected to functional roles in these populated subsystems, and the specific variant code associated with the genome in each of the populated subsystems.
NMPDR pathogen genome
The NMPDR is responsible for five classes of genomes:
- Campylobacter jejuni
- Listeria monocytogenes
- Staphylococcus aureus
- Streptococcus pneumoniae and Streptococcus pyogenes
- Pathogenic Vibrio
Pair of Close Homologs (PCH)
The paper The use of gene clusters to infer functional coupling defines a Pair of Close Homologs as follows:
We can also define the concept of “pairs of close homologs” (PCHs) as follows: genes (X′a, Y′a) from Ga and (X′b, Y′b) from Gb form a PCH if and only if X′a and Y′a are close, X′b and Y′b are close, X′a and X′b are recognizably similar, and Y′a and Y′b are recognizably similar. Here, we will consider two genes to be recognizably similar if their gene products produce fasta3 scores lower than 1.0 × 10−5. We use a scoring scheme analogous to the one described for PCBBHs to evaluate the connections between PCHs, except that if Ga and Gb are the same genome, we assign an arbitrary “same-genome score” (“same-genome” pairs cannot occur for PCBBHs by definition, but for PCHs they are possible). Unlike PCBBHs from two very close genomes for which contiguity is completely uninformative in the vast majority of cases, PCHs allow recognition of gene clusters that play similar (but usually not identical) roles (such as two transport cassettes containing pairs of homologs) in the same or similar organisms. The arbitrary “same-genome score” should, we believe, have a value that is high enough to rank such instances as significant.
PEG
A Protein Encoding Gene (PEG) is equivalent to a CDS (Coding Sequence).
Populated Subsystem
please see Subsystem
RAST
RAST or Rapid Annotation using Subsystem Technology is a rapid and very accurate annotation technology. We make a RAST server available for public use at
http://rast.nmpdr.org
Scenarios
Scenarios represent components of a metabolic reaction network in which specific compounds are labeled as inputs and outputs. The metabolic network is assembled using biochemical reaction information associated with functional roles in subsystems to find paths through scenarios from inputs to outputs. Scenarios that are connected by linked inputs and outputs can be composed to form larger blocks of the metabolic network, spanning processes that convert transported nutrients into biomass components.
SEED-Viewer
The SEED Viewer is a web-based application that allows browsing of SEED data structures.
We use the SEED-Viewer to provide a public read-only version of the latest SEED data at:
http://seed-viewer.theseed.org
Please note: The data is updated automatically every 24 hours. When citing or linking to the SEED please use this version.
Subsystem
A subsystem is a set of functional roles that an annotator has decided should be thought of as related. Frequently, subsystems represent the collection of functional roles that make up a metabolic pathway, a complex (e.g., the ribosome), or a class of proteins (e.g., two-component signal-transduction proteins within Staphylococcus aureus). A populated subsystem is a subsystem with an attached spreadsheet. The rows of the spreadsheet represent genomes and the columns represent the functional roles of the spreadsheet. Each cell contains the identifiers of genes from the corresponding genome the implement the specific functional role. That is, a populated subsystem specifies which genes implement instances of the subsystem in each of the genomes. The rows of a populated genome are assigned variant codes which describe which of a set of possible variants of the subsystem exist within each genome (special codes expressing a total absence of the subsystem or remaining uncertainty exist). Construction of a large set of curated populated subsystems is at the center of the NMPDR and SEED annotation efforts.
Subsystem Clearing House
Since annotators can work on any machine (including the public SEED) the way to propagate subsystems is via
http://clearinghouse.theseed.org/clearinghouse_browser.cgi?
Trial-SEED
A public, read-write copy of the SEED is made available on
http://theseed.uchicago.edu/FIG/index.cgi
Please note: The data on this server is updated in irregular intervals. Users should not assume that annotations made on this system will persist. Please publish your annotations to the Subsystem Clearing house.
Variant Code
please see Subsystem