|
Research Abstracts
DOE Microbial Genome Program Report
Section 2: Functional and Computational
Analysis
Pangenomic Microbial
Comparisons by Subtractive Hybridization
Peter Agron, Lyndsay Radnedge, Evan Skowronski,
Madison Macht, Jessica Wollard, Sylvia Chin, Aubree
Hubbell, Marilyn Seymour, Christina Nocerino, and Gary
Andersen
Biology and Biotechnology Research Program; Lawrence
Livermore National Laboratory; 7000 East Ave.;
Livermore, CA 94550
Andersen: 925/423-2525, Fax: /422-2282,
andersen2@llnl.gov
Sequencing of whole genomes is reshaping microbiology.
However, as more sequence information is generated,
there will be increased sequence redundancy between
closely related species or strains. In the course of
time, the amount of new sequence information obtained
by whole-genome sequencing with current technology will
become increasingly less cost-efficient. We are
exploring the use of suppression subtractive
hybridization (SSH) of total DNA as a means of focusing
sequencing efforts on unique regions when a reference
strain of known sequence is compared to a different
isolate of the same species or genus. To rigorously
examine this approach, two sequenced strains of
Helicobacter pylori (J99 and
26695) were used as a model system to allow
rapid determination and mapping of difference products
based on sequencing alone.
Using highthroughput SSH methods, difference products
can be rapidly cloned, sequenced, and then mapped by
comparing the data to the H. pylori genome
database. To increase the likelihood of amplifying
difference products from any given region, several
restriction enzymes were used in separate SSH
experiments. We have obtained data from 2123 clones
that reveal 427 (20%) unique sequences. Control
subtractions with an Escherichia coli strain
containing the transposon Tn5 against its isogenic
parent showed a 270fold enrichment for Tn5 sequences,
demonstrating that SSH is highly effective. Current
efforts are focused on (1)mapping difference products
onto the relevant genome using the cross-match
algorithm and Percent Identity Plots, (2)assessing
coverage of difference regions by subtracted clones,
(3) assessing the redundancy of this coverage, and (4)
determining the reproducibility of SSH.
The Genome of the Extremely
Radioresistant Bacterium Deinococcus
radiodurans: Comparative Genomics
Kira S. Makarova,1,2 Eugene V.
Koonin,3 L. Aravind,2 Kenneth W.
Minton,1Roman L. Tatusov,2 Y. I.
Wolf,2 OwenWhite,3 and Michael
J. Daly1
1Uniformed Services University of the Health
Sciences; 4301 James Bridge Rd.; Bethesda, MD
208144799
301/295-3750, Fax: -1640,
mdaly@mxb.usuhs.mil
2National Center for Biotechnology
Information; National Institutes of Health; Bethesda,
MD 20814
3The Institute for Genomic Research;
Rockville, MD 20850
Extremophiles are nearly always defined with singular
characteristics that allow existence within a singular
extreme environment. The bacterium Deinococcus
radiodurans qualifies as a polyextremo-phile,
showing remarkable resistance to a range of damage
caused by ionizing radiation, dessication, ultraviolet
radiation, oxidizing agents, and electrophilic
mutagens. D. radiodurans is most famous for its
extreme resistance to ionizing radiation; it not only
can grow continuously in the presence of chronic
radiation (6000 rad/hour), but it can survive acute
exposures to gamma radiation that exceed 1.5 Mrad
without lethality or induced mutation. These
characteristics were the impetus for sequencing its
genome and the ongoing development of its use for
bioremediation of radioactive wastes.
Although it is known that these myriad resistance
phenotypes stem from its efficient DNA repair
processes, the mechanisms underlying this repair remain
poorly understood. In this work we present an extensive
comparative sequence analysis of the Deinococcus
genome. Deinococcus is the first representative
with a completely sequenced genome from a bacterial
branch of extremophilesthe Thermus-Deinococcus
group. Phylogenetic tree analysis, combined with the
identification of several synapomorphies between
Thermus and Deinococcus, support that it
is a very ancient branch localized in the vicinity of
the bacterial tree root. Distinctive features of the
Deinoccoccus genome, as well as features shared
with other freeliving bacteria, were revealed by
comparing its proteome to a collection of clusters of
orthologous groups of proteins (called COGs). Analysis
of paralogs in Deinococcus has revealed some
unique protein families. In addition, specific
expansions of several protein families including
phosphatases, proteases, acyl transferases, and MutT
pyrophos-phohydrolases were detected. Genes that
potentially affect DNA repair and recombination were
investigated in detail.
Some proteins appear to have been transferred
horizontally from eukaryotes and are not present in
other bacteria. For example, three proteins homologous
to plant desiccationresistance proteins were
identified; these are particularly interesting because
of the positive correlations of resistance to
desiccation and radiation. Further, the D.
radiodurans genome is very rich in repetitive
sequences, namely IS-like transposons and small
intergenic repeats. In combination, these observations
suggest that several different biological mechanisms
contribute to the multiple DNA repairdependent
phenotypes of this organism. The genetic mechanisms
underlying the extreme radiation resistance of the
organism are now being characterized experimentally
using a newly developed system for analyzing gene
expression patterns in D.radiodurans.
Protein Expression in
Methanococccus jannaschii
and Pyrococcus furiosus
Carol S. Giometti, S. L. Tollaksen, H.
Lim,1 J. Yates,1 J.
Holden,2 A. Lal Menon,2
G.Schut,2 M. W. W. Adams,2 C.
Reich,3 and G. Olsen3
Center for Mechanistic Biology and Biotechnology;
Argonne National Laboratory; 9700 S. Cass Ave.;
Argonne, IL 60439
630/252-3839, Fax: -5517,
csgiometti@anl.gov
1University of Washington; Seattle, WA
98195
2University of Georgia; Athens, GA
30602
3University of Illinois; Urbana, IL 61801
Complete genome sequences are now available for both
Methanococcus jannaschii and Pyrococcus
furiosus . The open reading frame (ORF) sequences
from these completed genomes can be used to predict the
proteins synthesized, but laboratory methods are needed
to verify those predictions. Two-dimensional gel
electrophoresis (2DE), coupled with mass spectrometry
of peptides isolated from the gels, is being used to
determine the constitutive expression of proteins from
these two archaea and to explore the regulation of
expression of nonconstitutive proteins. The most
abundant proteins (i.e., those easily detectable by
staining with Coomassie Blue R250) have been isolated
and analyzed from cells grown in minimal nutrient
media. Using a combination of matrix-assisted laser
desorption ionization (MALDI) and tandem mass
spectrometry, 100 proteins expressed by M.
jannaschii and 50 proteins expressed by P.
furiosus have been related to specific ORFs in the
respective genome sequences. The molecular weights and
isoelectric points determined by protein positions in
the 2DE patterns are compared with the ORF-predicted
molecular weights and isoelectric points for each
microbe. Numerous instances have been observed of
multiple proteins with different molecular weights or
isoelectric points being associated with the same ORF.
Possible reasons for such multiplicity include the
incomplete unfolding of these highly stable proteins
prior to electrophoresis, the nondissociation of
subunits, posttranslational modifications such as
phosphorylation (multiple proteins with the same
identity but different isoelectric points), or peptide
cleavage (multiple proteins with the same identity but
different molecular weights). Preliminary experiments
to change the protein expression of these organisms by
altering growth conditions have revealed significant
quantitative changes in a small number of proteins
visible in 2DE patterns. Correlation of proteins
expressed with specific ORFs is now focused on proteins
showing quantitative changes in expression and on less
abundant proteins. The observed protein abundances and
changes in abundance from these proteomic studies could
be useful for validation of protein-expression
predictions based on ORFs.
Microbial Genome Annotation
and Display
Frank Larimer, Doug Hyatt, Miriam Land,
Richard Mural, Morey Parang,
Manesh Shah, Jay Snoddy, and Edward Uberbacher
Computational Biosciences; Life Sciences Division; Oak
Ridge National Laboratory; 1060 Commerce Park Dr.; Oak
Ridge, TN 37830
865/574-1253, Fax: /241-1965,
larimerfw@ornl.gov
http://compbio.ornl.gov
and
http://genome.ornl.gov/microbial/
Once the genome of an organism has been sequenced,
portions that define features of biological importance
must be identified and annotated. When the newly
identified gene has a close relative already in DNA or
protein sequence databases, gene finding in
microorganisms is relatively straightforward. The genes
tend to be simple, uninterrupted open reading frames
(ORFs) that can be translated and compared with the
database.
The discovery of new genes without close relatives is
more problematic. Although identifying genelike ORFs is
easy, it is very difficult to determine which represent
real genes and which are merely statistical artifacts
of the sequence. This is a serious problem in organisms
with a high G+C content where random ORFs can be
abundant due a lack of stop codons.
A second issue in modeling microbial genes is accurate
prediction of the start codon, which is complicated
further by the use of minor start codons in addition to
the universal AUG. An accurate accounting and
description of genes in microbial genomes is essential
in determining the existence of functional metabolic
pathways and other aspects of whole-organism function.
Compared to simpler gene-prediction methods using ORFs
or single-coding measures, recently developed
gene-finding systems show excellent performance in
predicting coding genes and start sites, even for the
shortest microbial genes. Such highly accurate systems
are effective across the phylogenetic spectrum of
organisms as an essential baseline of analysis from
which much biological insight can be obtained.
Microbial genome sequencing is progressing rapidly.
Apart from the twenty-odd published genomes, more than
100 are being sequenced, with plans to sequence
hundreds or thousands more. Since every new genome
informs those that preceded it, updating genome
annotation is necessary to keep these resources
relevant; and consistent procedures, tools, and
methodology must be applied. The unique functions of
each individual organism need to be documented as
functions are placed in a recognized, consistent
scheme.
We are now representing all completed microbial genomes
in the Genome Channel and the Genome Catalog, providing
comprehensive sequencebased views of genomes from a
full genome display to the nucleotide sequence level.
We have developed tools for comparative multiple-genome
analysis that provide automated, regularly updated,
comprehensive annotation of microbial genomes using
consistent methodology for gene calling and feature
recognition. The visual genome browser represents
around 51,000 microbial GRAIL and 45,000 GenBank gene
models. Precomputed BEAUTY searches are provided for
all gene models, with links to original source material
and additional search engines. Comprehensive
representation of microbial genomes will require deeper
annotation of structural features, including operon and
regulon organization, promoter and ribosome
binding-site recognition, repressor and activator
binding-site calling, transcription terminators, and
other functional elements. Sensor development is in
progress to provide access to these features. Linkage
and integration of the gene-protein-function catalog to
phylogenetic, structural, and metabolic relationships
also will be developed.
A draft analysis pipeline has been constructed to
provide annotation for the Microbial Genome Program of
the Joint Genome Institute. The first two draft
sequences in the pipeline, with many more to come, are
the Nitrosomonas europaea and Prochlorococcus
marinus genomes. Multiple gene callers (Generation,
Glimmer, and Critica) are used to generate a candidate
gene model set. The conceptual translations of these
gene models generate similarity-search results and
protein family relationships; from these results, a
metabolic framework is constructed and functional roles
are assigned. Simple and complex repeats, tRNA genes,
and other structural RNA genes also are identified.
Annotation summaries are available through the JGI
microbial genomics
Web site; in addition, draft results are being
integrated into the interactive display schemes of the
Genome Channel and
Catalog.
WIT2: An Integrated System
for Genetic Sequence Analysis and Metabolic
Reconstruction
Ross Overbeek,1,2 Gordon
Pusch,1,2 Mark D'Souza,1 Evgeni
Selkov Jr.,1,2 Evgeni Selkov,1,2
and Natalia Maltsev1
1Mathematics and Computer Science Division;
Argonne National Laboratory, MCS-221; 9700 S. Cass
Ave.; Argonne, IL 60439
2Integrated Genomics Inc.
Maltsev: 630/252-5195, Fax: -5986,
maltsev@mcs.anl.gov
http://wit.mcs.anl.gov/WIT2
The WIT2 system was designed and implemented to support
genetic sequence and comparative analysis of sequenced
genomes as well as metabolic reconstructions from the
sequence data. It now contains data from 38distinct
genomes. WIT2 provides access to thoroughly annotated
genomes within a framework of metabolic reconstructions
connected to the sequence data; protein alignments and
phylogenetic trees; and data on gene clusters,
potential operons, and functional domains. We believe
that the parallel analysis of a large number of
phylogenetically diverse genomes can add a great deal
to our understanding of the higher-level functional
subsystems and physiology of the organisms. The unique
features of WIT2 include the following: (1) WIT2 is
based on the unique EMP-MPW collection of enzymes and
metabolic pathways developed by E.Selkov and
colleagues; this collection contains extensive
information on enzymology and metabolism of different
organisms. (2) WIT2 allows researchers to perform
interactive genetic sequence analysis within a
framework of metabolic reconstructions and to maintain
user models of the organism's functionality. (3)WIT2
provides access to a set of Webbased and original batch
tools that offer extensible query access against the
data. (4) WIT2 supports both shared and nonshared
annotation of features and the maintenance of multiple
models of the metabolism for each organism. (5) WIT2
supports metabolic reconstructions from expressed
sequence tag data.
Microbial Proteomics at
Pacific Northwest National Laboratory
Richard D. Smith, Ljiljana Pasa-Tolic, Mary
S. Lipton, Pamela K. Jensen,
Gordon A. Anderson, and Timothy D. Veenstra
Environmental Molecular Sciences Laboratory, MS K898;
Pacific Northwest National Laboratory; P.O. Box 999;
Richland, WA 993522
509/376-0723, Fax: -7722,
dick.smith@pnl.gov
Bacterial strains such as Shewanella putrefaciens
MR1 are key organisms in the bioremediation of
metals due to their ability to enzymatically reduce and
precipitate a diverse range of heavy metals and
radionuclides. Additionally, Deinococcus
radiodurans is an attractive candidate for
bioremediation because of its unique ability to survive
exceedingly high doses of ionizing radiation. The need
to develop an improved understanding of their enzymatic
pathways is important in refining the unique
capabilities of these organisms for bioremediation. As
a first step, an organism's proteome must be
characterized completely. The proteome is the name
given to the dynamic array of proteins expressed by a
genome. A single genome can exhibit many different
proteomes depending on the stage in the cell cycle;
cell differentiation; response to such environmental
conditions as nutrients, temperature, and stress; and
the manifestation of disease states. Although the
availability of full genomic reference sequences
provides a set of road maps of possibilities and the
measurement of expressed RNAs tells us what might
happen, the proteome is the key that tells us what
really happens. Therefore, the study of proteomes under
welldefined conditions can provide a better
understanding of complex biological processes,
requiring faster and more sensitive capabilities for
the characterization of microbial protein constituents.
We currently are developing technologies that integrate
and refine protein separation and digestion processes
with advanced Fourier transform ion cyclotron resonance
(FTICR) mass spectrometric methods. In some of these
studies, the cell's protein complement will be digested
with a protease and the resulting peptides will be
analyzed by capillary liquid chromatographymass
spectrometry (LCMS). The use of tandem mass
spectrometry (MSMS) provides additional sequence
information that, when combined with the mass of the
parent peptide, can be used to search existing
databases. This results in peptide identification,
which in turn is used to identify the parent protein.
Additionally, we are extending this mass spectrometric
technology to allow precise quantitation of changes in
the protein complement upon perturbation of the
microbial environment. This technology, based on the
use of stableisotope labeling, allows the creation of
"comparative displays" for the expression of many
proteins simultaneously. Two versions of each protein
are generated and simultaneously analyzed to study
changes in expression (i.e., repression or induction)
for hundreds to thousands of proteins. These combined
technologies are planned to be developed and
demonstrated in a D.radiodurans pilot project
that also would follow changes in the proteome after
exposure to ionizing radiation.
Protein Domain Dissection and
Functional Identification
Temple F. Smith, Sophia Zarakhovich, and
Hongxian He
BioMolecular Engineering Research Center; College of
Engineering; Boston University; 36 Cummington Street;
Boston, MA 02215
617/353-7123,
tsmith@darwin.bu.edu
Using various multialignment and conserved pattern
tools (e.g., psiBLAST, BLOCKS, pfam, and pimaII),
protein domains as "evolutionary modules" generally can
be identified. Using a set of 20 completely sequenced
microbial genomes (including yeast), we have generated
over 1300 profiles representing diagnostic sequence
domains. The majority either cover the entire length of
the proteins matching the profile or locate a sequence
region clearly identifiable in multiple distinct domain
contexts. We are addressing the relationship between
such sequence domains and structural domains as well as
problems involved in associating these domains to a
given biochemical function and the cellular role played
by that function.
In collaboration with Julio Collado Vides (CIFN,
Mexico), we are investigating the potential for
coordinate regulation among neighboring genes in
various biochemical pathways. We began with sets of
genes in Escherichia coli or some other bacteria
or archaea organized in operons. Next, each operon set
is being examined in yeast and Caenorhabditis
elegans for shared regulatory sequences. Initial
work led to the identification of two different types
of eukaryotic operon-equivalent organizations in yeast
and to our 1998 publication in Microbial and
Comparative Genomics.
Genome Sequencing
Carl R. Woese and Gary J. Olsen
Department of Microbiology; University of Illinois;
B103 Chemical and Life Sciences Laboratory; 601 S.
Goodwin Ave.; Urbana, IL 61801
carl@ninja.life.uiuc.edu,
gary@phylo.life.uiuc.edu
We prepared a sequencing-quality genomic DNA library
for Methanococcus maripaludis, an organism that
was being considered for sequencing as part of DOE's
Microbial Genome Program (MGP). We have done some
partial sequencing of clones from this library as part
of a project to use comparative analysis to elucidate
the differences between related high- and
low-temperature proteins (this sequencing was partially
supported by funding from the National Aeronautics and
Space Administration).
We also prepared a sequencing-quality genomic DNA
library for Giardia lamblia, a eukaryotic
microorganism. This permitted Mitchell Sogin (Marine
Biology Laboratory) to generate preliminary genome
sequencing data for a successful grant application to
the National Institutes of Health.
The sequence data resulting from our participation in
MGP have stimulated additional research by our group
and others. More specifically:
1. We continue to make new gene identifications through
comparative analyses of sequenced genomes.
2. We have experimentally verified the function of some
novel RNA methylase genes.
3. We have collaborated in the experimental
identification of a novel, archaeal S-adenosyl
methionine synthetase.
4. We have cloned and expressed RNA polymerase genes
and transcription-initiation factors from archaea and
have experimentally identified new proteinprotein
interactions in the transcription apparatus.
5. We have supplied 27 research groups with genomic DNA
and cell mass from organisms sequenced as part of the
MGP.
6. We are contributing ideas formulated as part of an
MGP proposal into a successful collaboration with Carol
Giometti (Argonne National Laboratory) to study the
proteomes of Methanococcus jannaschii and
Pyrococcus furiosus.
7. We have worked with the research group of Ross
Overbeek (Argonne National Laboratory) on the
development of his WIT and WIT2 environments for genome
analysis and comparison and have used the WIT2 system
to help with our analyses.
A Pilot Study to Develop and
Demonstrate a High-Throughput New Approach to
Characterizing Total Cellular Proteins Expressed by
Deinococcus radiodurans R1
Kwong-Kwok Wong, Richard D. Smith, Ljiljana
Pasa-Tolic, and Owen White1
Pacific Northwest National Laboratory; P.O. Box 999;
Richland, WA 99352
509/376-5097, Fax: -6767,
kk.wong@pnl.gov
1The Institute for Genomic Research;
Rockville, MD 20850
www.tigr.org
Deinococcus radiodurans, with its exceptional
radiation resistance, was once thought to grow within
nuclear reactors, but further studies now suggest that
the deinococci are soil microorganisms. Besides its
resistance to radiation, D. radiodurans also has
extreme resistance to cellular and genetic damage that
occurs in other organisms after exposure to many
genotoxic chemicals, oxidative damage, high levels of
uv radiation, and desiccation. Thus, D.
radiodurans is a potential candidate to be
engineered for degradation of hazardous chemicals at
mixed-waste sites, and it is important to understand at
the molecular level how the bacteria can adapt to such
stressful environments. The Institute for Genomic
Research has completely sequenced the D.
radiodurans genome, enabling further functional
analysis of putative genes encoded by the bacteria.
In a pilot study, we have established a "2-D virtual
gel" method and demonstrated that this new methodology
is applicable to characterizing proteins expressed by
D. radiodurans. Although numerous facets of the
technology need significant refinement, we have
generated preliminary results that are a major step
beyond any "proteome" measurements made to date in
terms of speed and sensitivity. In a single capillary
isoelectric focusing (CIEF) separation with online
FTICR mass spectrometry, we have detected at least 800
different proteins (based on the number of discrete
molecular weight species above 5 kDa). This single
experiment (requiring less than 30 min) uses about 250
ng of total protein, about 20 to 30 times less than
that of a typical 2-D polyacrylamide gel
electrophoresis experiment. This corresponds to low
femtomole quantities for the average detected protein
(with some proteins being detected at levels well into
the attomole range). The potential exists to greatly
improve the methodology's sensitivity, thereby opening
up the detection of very low copy number regulatory
proteins.
Related to these efforts, we also have developed a
general targeted mutagenesis method based on D.
radiodurans genomic information to define gene
function. Using the targeted mutagenesis method, we
have shown that both catalase (katA) and
superoxide dismutase (sodA) genes are required
for extreme radiation resistance. We are applying the
2-D virtual gel method to analyze proteins expressed by
different mutants.
Characterization of expressed proteins by 2-D virtual
gel and further targeted mutagenesis analysis will
provide a link to the function of the genomic data's
predicted open reading frames (ORFs) and is expected to
identify new small genes in the size range at which
identifying ORFs is problematic. The resulting
information can identify genes of interest and
facilitate detailed biochemical and genetic experiments
to gain a global understanding of the organism for
energy and environmental and industrial applications.
The developed 2-D virtual gel method will be applicable
to any sequenced organism. This project was funded
initially as a pilot study for 2years but we expect
research to continue well beyond that period.
|