Genes in JGI genomes can be manually
Gene deletion corresponds to adding a tag “/dubious” or a tag “/mini” in the public genomes.
Genes should be deleted, if they are overlapping with other genes (divergent or same strand), and the overlap cannot be removed by truncating the gene(s). While removing the overlap by gene deletion, the real gene should be kept and the dubious one should be deleted. The gene is more likely to be real, if it
- encodes an RNA, or
- has COG, or InterPro, or Pfam assignment, or
- has homologs outside the genus.
Genes encoding proteins of less than 60 aa, with no homologs outside the genus, are likely to be dubious and are also deleted.
Rules for gene deletion:
- delete both the gene and the CDS;
- delete the transmembrane region and signal peptide features.
Top of page
Genes can be extended or truncated to match the length of their homologs.
Gene should be extended, if
- its 3’ end is significantly shorter (more than 30 aa) than the 10 best homologs,
- it has an incomplete COG (less than 80% of the full-length hit), and
- there is an alternative start codon upstream of the current one. ATG is preferred over GTG, which is preferred over TTG.
Rules for gene extension:
- extend both the gene and the CDS;
- after extension the gene should produce a better alignment with its homologs
- extension should not create overlapping genes – adjust the gene length correspondingly
- if extending over a stop codon (TGA translated to selenocysteine), adjust translation manually.
If the gene cannot be extended due to the absence of a valid start codon in the upstream region, but it still corresponds to a significant portion of COG (less than 80%, but more than 30%) and produces a protein longer than 100 aa, the gene should be marked as “short” in the “/note” with a comment that it cannot be extended. It is assumed that such genes might be functional, but the function is most likely different from that of its full-length homologs.
Top of page
Gene should be truncated, if
- it is a real gene (see the characteristics of a real gene above)
- it overlaps with another real gene (divergent or same strand). Note, that this overlapping gene could have been missed by the ab initio gene prediction algorithm, but picked up later by BLASTx.
- its 3’ end is significantly longer (more than 30 aa) than the best homologs, and
- this N-terminal extension does not have a COG, InterPro or Pfam hit, or a homolog (if it does, this gene is a fusion)
- there is an alternative start codon downstream of the current one. Rules for start codon selection are the same as for gene extension.
If the gene cannot be truncated due to the absence of a valid start codon downstream from the current one, but it is considered real, then the gene should be marked as “long” in the “/note” with a comment of that it cannot be truncated.
Rules for gene truncation:
- truncate both the gene and the CDS;
- transmembrane region and signal peptide features should be also truncated or deleted altogether
Top of page
Candidates for gene insertion should qualify as real genes (COG, InterPro or Pfam hit, or a homolog outside the genus; length matches that of its homologs and COG hit is >80% of the full-length hit).
Rules for gene insertion:
- add both the gene and the CDS
- gene insertion should not create overlapping genes – adjust the length of the gene correspondingly
- assign a unique locus_tag to the gene and CDS; acceptable format is “or1989a”, “or1990b”, etc.
- add a “/product” tag to the CDS, but not to the gene. Product description should be in lower case. If it includes reference to a family (e. g. ATPase), it should say “ATPase-like protein”, not “similar to ATPase”. References to an organism (e. g. “similar to a hypothetical protein RP189 from Rickettsia prowazekii”) should be avoided; use “hypothetical protein” instead and put the reference to a specific organism and protein in the “/note”. Do not put EC numbers or COG numbers in the “/product”; use “/note” instead.
Top of page
Genes are tagged as pseudogenes if they are
- interrupted by more than one stop codon or frameshift, or
- frameshift fragments separated by another gene, or
- the gene has truncated COG hit, less than 30% of the full-length COG.
Rules for assigning a tag “/pseudo”:
Translations of manually modified genes will be adjusted to match the coordinates
- mark both the gene and CDS
- if there is more than 1 fragment of the same gene, use “join”.
Top of page
For comments or questions about this web page please contact K.Mavrommatis