Documentation
Introduction
The documentation in this page gives information
about the database structure and content
and provides links to detailed information
from the other tools that are used.
The Description of
fields does not provide a complete listing of all
fields in the database but rather an explanation of the records
that are displayed in the web site. If you want detailed information
about the database tables, see
Database Schema.
The coordinate system used in the databases is based on a
single file which contains the complete genome sequence.
Numbering begins with 1 at the first base in this file. Gene starts,
tRNA starts, and rRNA starts are the coordinate of the first base of
the first codon for coding sequences, i.e. the first base of the
individual tRNA or rRNA molecule. Preprocessing or
cotranslational events are not considered. When a database gene record
contains a start coordinate that is greater than
the stop coordinate, the nucleotide sequence presented is the
reverse complement of the sequence contained in the genome
nucleotide file.
The Analytical Tools are divided by their access.
Web based tools are
all available to execute through the internet and the others are stand-alone
programs. The links here access the help pages for the programs. An identical
list is displayed on the left frame and those links are connected to the
tools themselves.
Functional Class Assignments
is a list of the classes assigned in this database.
The Database Schema has a complete listing of all the fields in each table of
the chosen database.
See the Contact Information
to convey your comments,
suggestions, correction or concerns.
Back to Table of
Contents.
Description of
Fields
Gene Record
- Gene ID:
- This is a unique, locally assigned ID (identifier) for records in the database, which in
some cases will agree with IDs reported in GenBank
entries. For the most part, the design of the gene
id follows the standard name feature key (i.e. MG001 for
M. genitalium).
- DNA Molecule Name:
- Name of the molecule (i.e. chromosome or plasmid).
- GenBank ID:
- Unique ID assigned by GenBank when a sequence has been submitted to
the database.
- BGene ID:
- Gene ID that references another database's naming scheme. Not always used.
- Definition:
- Description of the predicted gene's function.
- Gene Name:
- Usually a three or four character name; duplicate and
triplicate names are common but not all genes have
assigned names.
- Gene Start:
- The coordinate of the first nucleotide of the first amino
acid in the predicted protein.
- Gene Stop:
- The coordinate of the last nucleotide of the codon
preceding the predicted stop codon. The stop codon is not the gene stop.
- Gene Length:
- The length of the nucleotide coding sequence.
Calculated as abs(gene stop - gene start)+1 . The
length of the sequence from the first base of the start
codon to the last base of the codon preceding the stop
codon. The stop codon is not included in the length.
- Molecular Weight:
- The molecular weight of the protein, calculated from the protein sequence.
Molecular weight that have been determined experimentally will be noted in the
comments.
- pI:
- The pH at which the net charge of the protein is zero,
calculated from the protein sequence using the isoelectric command of the GCG
package.
- Net Charge:
- The net charge of the protein in a pH 7.0
environment. This is calculated from the protein sequence using the
isoelectric command of the GCG package.
- EC:
- Enyzme Commission (EC) numbers refer to enzymatic steps. These
numbers are determined by the
Nomenclature
Committee of the International Union of Biochemistry and Molecular
Biology (NC-IUBMB). Since they refer to enzymatic steps and not to
proteins per se, one enzyme can be assigned more than one EC number.
- Functional Class:
- Classification of the proposed cellular function. A list of
categories used for the
bacterial genomes is provided. The format for functional class is a broad class followed by a semicolon and then a more specific class (similar to pathway).
- Pathway:
- The name of the pathway in which the protein is
thought
to participate . Format for the field is similar to
that for the functional class field.
Each
pathway field can contain a main category followed by a
semicolon followed by a sub-category followed by a
semicolon for each of the main categories.
- Primary Laboratory Evidence:
- References to experimental lab work pertaining
to the sequence itself or to orthologs in the same genus.
- Secondary Laboratory Evidence:
- References to experimental lab work pertaining
to sequences highly similar to the sequence but not to
organisms in the same genus.
- Comment:
- A text field where any specific details about the record
can be placed.
- Blast Summary:
- Summary of results from the sequence alignment tool
PSI-BLAST.
- COGs Summary:
- Summary of the results of the COG (Clusters of
Orthologous
Groups) analysis. Scope of relatedness is given in
the phylogenetic pattern; best hits are the subset of
relations that best match this pattern.
- Blocks Summary:
- Summary of results obtained from
Blocks, a sequence analysis tool
for finding ungapped segments corresponding to the most highly conserved
regions of proteins.
- ProDom Summary:
- Summary of results obtained from the protein domain database search tool, ProDom.
- Paralogs:
- A term that has been used equivocally to denote 1)
similar sequences that have arisen through duplication
prior to diversification; different functions are
presupposed; 2) similar sequences in a single organism
that in some instances would be better termed isologs. In
the absence of biochemical information, paralogs in this
database are homologs that are not obviously orthologs.
- Pfam Summary:
- Summary of results obtained from the protein family database, Pfam.
- Structural Feature(s):
- Structural features are predicted using PHD, SignalP,
SEG, and Coils.
These are stored in a custom structure within one field in
the
database.
- PDB Hit:
- Results from BLAST hits to sequences in PDB (Protein Data Bank). Use PDB to search for 3-D macromolecular structures.
- Gene Protein Sequence:
- The amino acid sequence from the first codon to the
codon preceding the stop codon.
- Gene Nucleotide Sequence:
- The nucleotide sequence from the first base of the start
codon to the last base of the codon preceding the stop
codon.
Back to Table of
Contents.
Intergenic Space Record
- IGS ID:
- Unique identifier assigned by the database. Always begins with IGR (intergenic region) and is
followed
by a number. The number is a relative position marker, that
is to say
that IGR1 is closer to the first base in the genome
nucleotide file than is
IGR2. IGR numbers do not correspond with gene IDs.
- DNA Molecule Name:
- Name of the molecule (i.e. chromosome or plasmid).
- IGS Start:
- The first base of the intergenic region on the plus or top
strand.
- IGS Stop:
- The last base of the intergenic region on the plus or top
strand.
- Features:
- Description of structural features such as genes contained in an IGS such as tRNA and rRNA genes.
- Comment:
- A text field where any specific detail about the record
can be placed.
- IGS Nucleotide sequence:
- The nucleotide sequence of the intergenic region.
The sequence represents the plus or top strand since there is no
directionality
associated with an intergenic region. Features within an
intergenic
region may have directionality; this information is stored
in another
table. For example the directionality of a tRNA molecule
would be
stored in the tRNA table.
Back to Table of
Contents.
tRNA Record
- tRNA ID:
- This is the primary key in the tRNA table. An
example
of a tRNA_ID is tRNA-Arg-4. This is for the
fourth arginine
tRNA in the genome. Ordering is by start coordinate with
tRNA-Arg-1
having the smallest start coordinate of any of the arginine
tRNA's.
- DNA Molecule Name:
- Name of the molecule (i.e. chromosome or plasmid).
- tRNA Start:
- The first base of the predicted mature tRNA
molecule.
If the start is greater than the stop, then the gene is on
the reverse or bottom
strand.
- tRNA Stop:
- The last base of a predicted tRNA molecule.
- IGS ID:
- Unique identifier assigned by the database. Always begins with IGR (intergenic region) and is
followed
by a number. The number is a relative position marker, that
is to say
that IGR1 is closer to the first base in the genome
nucleotide file than is
IGR2.
- Unique ID that corresponds to one entry in the IGS table.
- Anticodon:
- The three letter nucleotide sequence of the tRNA molecule
which acts as the anticodon.
- tRNA Nucleotide Sequence:
- Nucleotide sequence of the tRNA.
- Comment:
- A text field where any specific detail about the record
can be placed.
Back to Table of
Contents.
rRNA Record
- rRNA ID:
- This is the primary key in the rRNA table. The syntax of
the ID is a number (which is the standard weight in terms of
sedimentation
properties) followed by S followed by rRNA followed by _
followed by a
number (this number is used to distinguish rRNA molecules of
the same
weight). For example, 16SrRNA_1. The last two characters
(_1) may
be omitted if there is only one rRNA operon in the genome, as
is the
case in Mycoplasma genitalium.
- DNA Molecule Name:
- Name of the molecule (i.e. chromosome or plasmid).
- rRNA Start:
- The first base of the rRNA gene. If the start
coordinate is less than stop coordinate then the gene is
coded for on the
plus or top strand. The plus or top strand is defined by the primary sequence of
the
genome as submitted to GenBank.
- rRNA Stop:
- The last base of the rRNA gene.
- IGS ID:
- Unique identifier assigned by the database. Always begins with IGR (intergenic region) and is
followed
by a number. The number is a relative position marker, that
is to say
that IGR1 is closer to the first base in the genome
nucleotide file than is
IGR2.
p
- rRNA Nucleotide Sequence:
- Nucleotide sequence of the rRNA.
- Comment:
- A text field where any specific details about the record
can be placed.
Back to Table of
Contents.
Repeat Record
- Repeat Name:
- Unique identifier assigned and used by the database.
- Repeat Type:
- A description of the type of repeat (i.e. tandem, inverted, direct, etc.)
- DNA Molecule Name:
- Name of the molecule (i.e. chromosome or plasmid).
- Repeat Unit Coordinates
-
- Start:
- Start coordinate for each unit of the repeat.
- Stop:
- Stop coordinate for each unit of the repeat.
- Comment:
- A text field where any specific details about the record
can be placed.
Back to Table of
Contents.
Analytical
Tools
Local BLAST
Local BLAST Search is a regular BLAST search program performed against our local databases
at the Los Alamos National Laboratory
(LANL), rather than at the National Center for Biotechnology Information (NCBI). In addition to the nr
and nt databases which are
downloaded from NCBI monthly, our local databases also include many bacterial and viral databases located
at LANL. Local BLAST Search allows
BLAST searches against the same genome for paralogs as well as against any specific bacterial or viral database of
interest.
Click Here for help on
general BLAST searching.
Back to Table of
Contents.
PSI-BLAST
BLAST (Basic Local Alignment Search Tool) is a set of
similarity search programs designed to explore all of the
available sequence databases regardless of whether the query
is protein or DNA. For literature references
please see Goodman L 1997 "More blast for the buck." Genome
Research.
7:858-859 and Atschul AF et al., 1997 "Gapped BLAST and
PSI-BLAST: a new generation of protein database search
programs." Nucl. Acid Res. 25:3389-3402.
PSI-BLAST searches are iterated, with a position specific
scoring
matrix. It seeks to identify single gapped alignments, rather
than a
collection
of ungapped alignments. The matrix used in the i+1th
iteration is
computed based on significant alignments found in the
ith
iteration.
The success of the method of iterative blast searching
depends on the
quality
of the matrix produced in the previous iteration. This
in turn
depends
on the homologous nature of the set of sequences which match
the query
above some BLAST E-value. Weighting is performed
on the set
of sequences used to generate the matrix according to
Heinkoff D and
Heinkoff
JG 1994 J. Mol. Biol. 216:813-818, so that sequences in the
set that have
high similarities are not weighted as much as those from a
smaller set
of more divergent sequences.
Bastpgp arguments and argument values most commonly used.
| Argument |
Description |
Value |
| -v |
Number of one line descriptions to display (default
250). |
10 |
| -b |
Number of alignments to display (default 250). |
10 |
| -m |
Alignment view (default 0) |
3 |
| -I |
Show GI's in the defline (default F). |
T |
| -a |
Number of processors to use (default 1). |
2 |
| -F |
Filter query sequence with SEG (default F). |
T |
Back to Table of
Contents.
COGs
COGs 
stands for Cluster of Orthologous Groups of proteins. The
proteins that comprise each COG are assumed to have evolved
from an ancestral protein, and are therefore either orthologs
or paralogs. COGs were delineated by comparing protein
sequences encoded in 21 complete genomes, representing 17
major phylogenetic lineages. Each COG consists of individual
proteins or groups of paralogs from at least 3 lineages and
thus corresponds to an ancient conserved domain.
There are two basic issues to understanding COG
analysis:. 1) how
the COG database has been built; 2) how one uses this
database for
the purpose of annotation. The first issue, how
the database
is built, is accomplished by doing pairwise comparisons of
the 43,897
proteins
in the 21 complete genomes listed in the following table. For
each protein,
the best hit (BeT) in each of the other genomes was
detected. A COG
is then defined by a relationship of BeTs.
The second
issue, using the database, is accomplished by BLASTing an
unknown sequence
against the set of all genomes in the COGs database, and
looking for the
case in which the unknown sequence has BeTs to more than one
member of
the COG.
A phylogenetic pattern is a series of lowercase letters, uppercase letters,
and/or
dashes that is a shorthand representation of the
presence or
absence of proteins from a particular organism in the
COG of
interest. Each letter in a pattern represents a
particular
organism, given in the table below, along with the
pattern
position assigned to that organism. Uppercase letters indicated that at least two orthologs belong to that COG.
Organism Name and Abbreviation
| Organism Name
| Code
|
| Archaeoglobus fulgidus |
a |
| Methanococcus jannaschii |
m |
| Methanobacterium thermoautotrophicum |
t |
| ;Pyrococcus horikoshii |
k |
| Saccharomyces cerevisiae |
y |
| Aquifex aeolicus |
q |
| Thermotoga maritima |
v |
| Synechocystis sp. PCC6803 |
c |
| Escherichia coli |
e |
| Bacillus subtilis |
b |
| Mycobacterium tuberculosis |
r |
| Haemophilis influenzae |
h |
| Helicobacter pylori 26695 |
u |
| Helicobacter pylori J99 |
j |
| Mycoplasma genitalium |
g |
| Mycoplasma pneumoniae |
p |
| Borrelia burgdorferi |
o |
| Treponema pallidum |
l |
| Chlamydia trachomatis |
i |
| Chlamydia pneumoniae |
n |
| Rickettsia prowazekii |
x |
The phylogenetic pattern, -----qvcE-------o---x, for example, would
indicate that
Aquifex aeolicus, Thermotoga maritima, Synechocystis sp.
PCC6803, Borrelia burgdorferi and Rickettsia prowazekii
have one ortholog which belongs to the COG and Escherichia
coli has at least two that belong to the COG.
COG Functional Class Abbreviations
| Information storage and
processing |
| J |
Translation, ribosomal structure and biogenesis |
| K |
Transcription |
| L |
DNA Replication, recombination, and repair |
| Cellular processes |
| D |
Cell division and chromosome partitioning |
| M |
Cell envelope biogenesis, outer membrane |
| N |
Cell motility and secretion |
| O |
Posttranslational modification, protein turnover,
chaperones |
| P |
Inorganic ion transport and metabolism |
| T |
Signal transduction mechanisms |
| Metabolism |
| C |
Energy production and conversion |
| E |
Amino acid transport and metabolism |
| F |
Nucleotide transport and metabolism |
| G |
Carbohydrate transport and metabolism |
| H |
Coenzyme transport and metabolism |
| I |
Lipid metabolism |
| Poorly characterized
proteins |
| R |
General function prediction only |
| S |
Function unknown |
For further information
see Tatusov RL, Koonin EV, and Lipman DJ. 1997 "A
genomic
perspective
on protein families." Science. 278:631-637.
Back to Table of
Contents.
ProDom
ProDom (protein domain database) has been designed as a
tool to help analyze domain
arrangements of proteins and protein families.
It consists of an automatic compilation of homologous
domains.
Current versions of ProDom are built using a novel procedure
based on recursive PSI-BLAST
searches (Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang
Z, Miller W & Lipman DJ, 1997,
Nucleic Acids Res., 25:3389-3402; Gouzy J., Corpet F. & Kahn
D., 1999, Computers and
Chemistry 23:333-340.) Large families are much better
processed with this new procedure than
with the former DOMAINER program (Sonnhammer, E.L.L. & Kahn,
D., 1994, Protein Sci.,
3:482-492).
Back to Table of
Contents.
Blocks
Blocks are
multiply aligned ungapped segments corresponding to the most
highly conserved regions of proteins. A database of these
blocks
has been built and a query sequence is compared for local
similarities
within the sequence to a block in the database. Local
and global
alignments are scored independently so that they can be used
in concert
to infer homology.
For more information,see Henikoff S and Henikoff JG 1994
"Protein family classification based on searching a database
of blocks." Genomics 19:97-107 and Henikoff S, Henikoff
JG, Alford WJ, and Pietrokovski S 1995 "Automated
construction and graphical presentation of protein blocks
from unaligned sequences." Gene-COMBIS. Gene 163 (1995)
GC 17-26.
Back to Table of
Contents.
Pfam
Pfam is a database of
multiple alignments of protein domains or
conserved protein regions. The alignments represent some evolutionary
conserved structure which has implications for the protein's function.
Profile hidden Markov models (profile HMMs) built from the Pfam
alignments can be very useful for automatically recognizing that a
new protein belongs to an existing protein family, even if the homology
is weak. Unlike standard pairwise alignment methods (e.g. BLAST, FASTA),
Pfam HMMs deal sensibly with multidomain proteins.
Pfam-Pro is a new procaryotic protein family database from
TimeLogic. It consists of
Hidden Markov Models of protein domains or conserved protein regions. The
models in Pfam-Pro have been built from Pfam alignments to a number of
completed procaryotic genomes. Unlike Pfam, the models in Pfam-Pro are
trained exclusively on procaryotes, and may therefore show an increased
selectivity on other Procaryotes.
For more information see
The Pfam protein families database.
A. Bateman, E. Birney, R. Durbin, S.R. Eddy, K.L. Howe, and E.L.L. Sonnhammer
Nucleic Acids Research, 28:263-266, 2000.
Back to Table of
Contents.
PDB
PDB server
is used to predict protein 3-D structure based on homologous sequence searching.
It uses a version of
NRDB that includes all the PDB entries (excluding the BRK_MOD sequences and sequences only containing
'X's). Sequences are compared to this database with PSI-BLAST (Altschul et al, Nucl. Acids Res., 1997),
using an e-value cutoff of 0.001, and a maximum of five iterations.
For more information see H.M.Berman, J.Westbrook, Z.Feng, G.Gilliland, T.N.Bhat, H.Weissig,
I.N.Shindyalov, and P.E.Bourne. 2000. The Protein Data Bank.
Nucleic Acids Research, 28, 235-242 and
M.Huynen, T.Doerks, F.Eisenhaber, C.Orengo, S.Sunyaev, Y.P.Yuan, and P.Bork. 1998.
Homology-based fold prediction for Mycoplasma genitalium proteins.
J. Mol. Biol. 280, 323-326.
Back to Table of
Contents.
Entrez
Entrez is NCBI's search and retrieval system. With Entrez, one can search DNA and protein sequence databases, complete genomes, 3-D protein structures, population sequences and literature.
Back to Table of
Contents.
SEALS
The SEALS
package is designed specifically for large-scale research
projects in bioinformatics. It is based on a friendly
command line interface in the UNIX environment. It is scalable and
provides dozens of commands which allow the user to quickly answer complex
questions. While the data presented in the STD database is based on
specific analysis tools described below, the SEALS package has been invaluable
in the linking of various tools, the parsing of the resulting data, and the
retrieval of data from standard databases.
For more infromation on the SEALS package, see Walker, DR, and Koonin, EV
(1997) SEALS: A System for Easy Analysis of Lots of Sequences. Intelligent
Systems for Molecular Biology 5:333-339.
Back to Table of
Contents.
SIGNALP
See Nielsen H, Engelbrecht J, Brunak S, and von Heijne G
(1997)
"Identification
of prokaryotic and eukaryotic signal peptides and prediction
of their
cleavage
sites." Protein Engineering 10:1-6. For a review of
signal
prediction
methods, see Claros MG, Brunak S, and von Heijne G (1997)
"Prediction of
N-terminal protein sorting signals." Current Opinions
in Structural
Biology 7:394-398.
SignalP is an application of neural networks to the
problem of
identifying
protein sorting signals and the prediction of their cleavage
sites.
This is possible because these functional units are encoded
by linear
sequences
of ammoniac's rather than a 3D structure. Reported
performance
values
are presented below in a table reproduced from the Nielsen et
al.
reference
given in the preceding paragraph.
| Source |
Total number of Proteins |
Cleavage Site Location (% correct) |
Signal Peptide Discrimination
(correlation)* |
| Eukaryote |
1831 |
70.2 |
0.97 |
| Gram - |
452 |
79.3 |
0.88 |
| Gram + |
205 |
67.9 |
0.96 |
* The ability of the method to distinguish between the signal
peptides
and the N-terminals of nonsecretory proteins is measured by
the
correlation
coefficients (Mathews 1975 Biochim Acta 405:442-451).
Back to Table of
Contents.
Psort
PSORT is a computer program for the prediction of protein
localization sites in cells. It receives the information of an amino acid sequence and its source orgin,
e.g., Gram-negative bacteria, as inputs. Then, it analyzes
the input sequence by applying the stored rules for various sequence features of known protein sorting signals.
Finally, it reports the possiblity for the input protein to be localized at each candidate site with additional
information. For more help on PSORT, read Psort Users' Manual.
Back to Table of
Contents.
PHD
See Rost, B (1996) "PHD: predicting one-dimensional
protein structure
by profile-based neural networks." Methods Enzymology
266:525-39.
Back to Table of
Contents.
COILS
See Lupas, A (1996) "Prediction and analysis of coiled
coil
structures."
Methods Enzymol 266: 513-525.
Back to Table of
Contents.
SEG
See Wootton, JC, and Federhen, S (1996) "Analysis of
compositionally
biased regions in sequence databases." Methods Enzymol
266:554-571.
Back to Table of
Contents.
tRNAscan-SE
See Lowe, T.M. & Eddy, S.R. (1997)
"tRNAscan-SE: A program
for improved detection of transfer RNA genes in genomic
sequence."
Nucl Acids Res 25: 955-964.
tRNAscan-SE
identifies
tRNA genes in
genomic DNA sequences (as well as in RNA sequences).
The program
uses a modified, optimized version of tRNAscan v1.3 (Fichant
& Burks,
J. Mol. Biol. 1991, 220: 659-671), a new implementation of a
multistep
weight matrix algorithm for identification of eukaryotic tRNA
promoter
regions (Pavesi et al., Nucl. Acids Res. 1994, 22:
1247-1256), as well
as the RNA covariance analysis package Cove v.2.4.2 (Eddy
& Durbin,
Nucl. Acids Res. 1994, 22: 2079-2088).
Back to Table of
Contents.
Functional Class Assignments - Bacterial
- amino acid biosynthesis:
- aspartate family
- methionine, selenomethionine
- biosynthesis of cofactors, prosthetic groups, and carriers:
- folate
- heme and prophyrin
- riboflavin
- thiamin
- cellular processes:
- cell division
- cell killing
- chaperones
- detoxification
- protein and peptide secretion
- central intermediary metabolism:
- one carbon metabolism
- other
- phosphorus compounds
- polysaccharides
- energy metabolism:
- ATP-proton motive force interconversion
- glycolysis and gluconeogenesis
- pentose phosphate pathway
- pyruvate metabolism
- starch and sucrose metabolism
- sugars
- fatty acid and phospholipid metabolism:
- other categories:
- drug and analog sensitivity
- transposes
- Ureaplasma-specific antigen
- purines, pyrimidines, nucleosides, and nucleotides:
- deoxyribonucleotide metabolism
- general
- nucleotide and nucleoside interconversions
- purines
- pyrimidines
- salvage of nucleosides and nucleotides
- replication:
- DNA replication, restriction, modification, recombination, and repair
- transcription:
- DNA-dependent RNA polymerase
- RNA degradation
- RNA modification
- RNA processing
- transcription factors
- translation:
- aminoacyl-tRNA synthetases
- degradation of proteins, peptides, and glycopeptides
- protein modifications
- ribosomal proteins
- translation factors
- tRNA modification
- transport and binding proteins:
- amino acids, peptides and amines
- ammonium
- anions
- carbohydrates, organic alcohols, and acids
- cations
- ferrichrome
- general
- iron
- other
- unknown:
- hypothetical
- conserved hypothetical
Back to Table of
Contents.
Database Schema
This database was created using MySql,
a freely distributed SQL (Structured Query Language) database server.
Choose the organism from the selection list and the "List Fields" button
will retrieve a list that contains all the fields for all of the tables.
Back to Table of
Contents.
Contact Information
For comments or questions, please contact
the Help Desk.
Bioscience Division, B-N1
Los Alamos National Laboratory
TA-43, HRL-1, MS M888
Los Alamos National Laboratory
Los Alamos, NM 87545
Back to Table of
Contents.
Los Alamos National Laboratory • Est 1943