GeneCalling
Calling Genes:
Standard Operating Procedure
Introduction
This procedure is designed to be used by members of the NMPDR and the SEED developers to call genes in two broad classes of genomes:
• Genomes for NMPDR pathogens, which come from five sets of closely related strains of pathogens, and
• Diverse genomes used to support comparative analysis.
When we integrate diverse complete genomes, we often just take the genomes from RefSeq. Only in cases where the RefSeq gene calls require improvement do we apply similar procedures as for the NMPDR pathogens. This document describes the procedure used to call genes for NMPDR pathogens.
Summary of the procedure
The overall set of steps used to call genes is as follows:
1. Identify tRNAs: When we search for tRNAs in the diverse, non-NMPDR genomes, we use tRNAscan-SE, a tool developed by Sean Eddy, which we believe performs extremely well. However, for NMPDR pathogens, we maintain a library of all located tRNAs; locating the corresponding instances in new genomes is extremely straightforward.
2. Identify rRNAs: Again, for this we use a custom tool constructed by the NMPDR. Since the basic classes of rRNAs for each of the organisms curated by the NMPDR have been identified in several existing genomes, locating these rRNAs in new genomes is again completely straightforward.
3. Identify a set of expected genes: There is a set of “universal” genes and a number of families of highly conserved genes that are expected to be present in each of the NMPDR pathogens. We look for the genes candidates that are the most similar to these “universal” genes or matching the highly conserved gene families in each new genome, recording those instances that are found.
4. Use GLIMMER trained on the detected genes: We use the highly conserved gene candidates found in step (3.) as training set for GLIMMER 2, a tool developed and made available by TIGR. In most cases, we believe that the set of gene candidates identified by GLIMMER given a highly reliable training set (which we feel exist in the cases of NMPDR pathogens) should be a “superset” of the set of real genes (modulo a small rate of “false negatives”).
5. Resolve overlaps: We have developed a custom tool that resolves gene overlaps. Because we can use the gene-calls for a set of phylogenetically close organisms to validate the candidates gene calls in the new genome, we can reliably identify the vast majority of genes in steps 1--3 with high accuracy. Hence, we are confident that this overlap resolution tool performs well.
6. Backfill gaps: We check for potential short genes within any gaps exceeding 90 bp that remain after step 5. We do this by blasting the gaps between gene candidates against a collection of genomes from the immediate phylogenetic neighborhood (i.e., a fixed set of genomes for each of the general classes of NMPDR pathogens). Gene candidates that are detected are added to the calls from steps 1--5. In some cases, this step may add gene candidates that are not real; we therefore periodically review the calls in each set of close genomes for possible “false positives,” but in general, we would prefer to add a few too many genes, rather than leave out real ones.
7. Check Starts: A penultimate pass is made to adjust start positions using an algorithm based on seeking a consistent position in each of the sets of genomes curated by the NMPDR. The algorithm takes into account the start codon, the presence or absence of a ribosome binding site, and the ability to align candidate start positions with those in the existing set of close genomes.
8. Identification of potential frameshifts and pseudo-genes: As a last step, we run an automated tool that attempts to identify potential frameshifts and pseudo-genes. When these are detected, these features are recorded, along with the evidence supporting the assertion.
Each of these steps generates a log entry summarizing the outcome. The final output constitutes a genome in a form that can be added to the SEED, and then into the NMPDR.
Details
Step 1: Identify tRNAs
To locate the tRNAs and produce tbl/fasta entries, invoke
find_instances –n User LibraryOfInstances SkeletalOrganismDirectory
This tool assumes that the library contains instances of tRNAs from closely related genomes. It locates instances matching the tRNAs stored in the indicated library, updates the tbl and fasta file, adds corresponding assigned_functions entries, and logs the results.
Step 2: Identify rRNAs
To locate the tRNAs and produce tbl/fasta entries, invoke
find_instances –n User LibraryOfInstances SkeletalOrganismDirectory
This tool assumes that the library contains instances of rRNAs from closely related genomes. It locates instances matching the rRNAs stored in the indicated library, updates the tbl and fasta file, adds the corresponding assigned_functions entries, and logs the results.
Step 3: Identify a set of expected genes
The searching for and identifying protein-encoding genes is done in two steps:
find_neighbors GenomeDirectory 20 found new > close.neighbors find_genes_based_on_neighbors GenomeDirectory found new < close.neighbors
The first step takes DATA/FigFamsData and a genome directory (of the form xxxx.y) and locates the set of 20 closest genomes, calling a few genes in the process. The second command locates instances of of FIGfams that occur within the closely related genomes and in the new genome. In the case of closely-related NMPDR pathogens, we expect to locate 80-90% of the protein-encoding genes and to inherit annotations from existing genomes.
Step 4: Use GLIMMER trained on existing genes
The command
recall_based_on_found GenomeDirectory close.neighbors > candidate_peg.tbl
recalls the genes using GLIMMER, making sure that there is a gene for every protein-encoding gene (peg) that has already been called (with the same start). That is, this tool forms a superset of the previously called genes by merging the calls from a GLIMMER run trained using the existing set of protein-encoding gene-calls with the existing set.
Step 5: Resolve overlaps
By running
resolve_overlaps GenomeDirectory candidate_peg.tbl
we remove unacceptable overlaps between called genes and those determined by GLIMMER, and add them to the genes in GenomeDirectory.
Step 6: Backfill gaps
We then run
backfill_gaps GenomeDirectory
to search for possible genes in any gaps that appear to be unusually long. Here, we retain only genes that have similarity to genes that exist in other genomes (that are sufficiently distant from the new genome).
Step 7: Check Starts
We use
adjust_starts GenomeDirectory
to attempt to make starts consistent with those of genes in existing genomes.
Step 8: Identification of potential frameshifts and pseudo-genes
Finally, we run
correct_frameshifts GenomeDirectory
if we feel that we should attempt to automatically correct potential frameshifts. This will produce log entries documenting the “corrections”. If we prefer to simply add annotations to what appear to be fragments of genes (i.e., real pseudo-genes due to real frameshifts or partial genes due to assembly errors), we run instead
annotate_potential_frameshifts GenomeDirectory
which adds annotations describing the evidence for a potential frameshift. The decision to use one approach over the other is handled case by case.