|
Resolution: standard / high Figure 1.
Sequence discovery rates across various taxonomic groups. (a) Discovery of 'distinct' sequences as a function of sampled bacterial genomes. Distinct
sequences are defined as those that do not share significant sequence similarity with
a sequence in a previously sampled genome. Each point represents the addition of a
new genome, ordered either by the number of sequences (largest first) or by random.
Two datasets are shown: one that considers all sequences; and one that considers only
sequences that consist of more than 100 residues. (b) Discovery of distinct sequences in fully sequenced eukaryotic genomes. Genome addition
was ordered by the number of sequences (largest first). Certain points are labeled
to indicate the species added to show how the addition of closely related species
influences the local gradient of the graph. (c) Rate of distinct sequence discovery within various taxonomic groupings of eukaryotic
partial genomes. As before, each point represents the addition of a new partial genome
(largest first), and color indicates the taxonomic group sampled. It should be noted
that the classification of Protista as a group is historical and has recently been
shown to consist of several paraphyletic taxa, many of which (including the species
examined here) are considered basal to the root of Eukarya [29]. The inset graph provides
an expanded display. (d) Rate of sequence discovery as a function of genomes sampled for both bacterial genomes
and eukaryotic partial genomes. Each point represents the average and standard deviations
of the rate of distinct sequence discovery over a sliding window representing the
cumulative addition of 30 complete or partial genomes, obtained from 400 random orderings
of genome addition (see Materials and methods for more details). The six data series
include sequences from all bacterial and all partial genomes, bacterial sequences
> 100 residues in length, partial genome sequences > 300 bp in length and two 'restricted'
groups of bacterial sequences: those from a collection of genomes with only a single
(largest) representative from each species ('strains filtered'); and those from a
collection of genomes with only a single (again largest) representative from each
genus ('species filtered'). (e) Rate of gene family discovery for partial and bacterial genomes. Gene families include
singletons (families with only a single sequence representative) and were obtained
with reference to the COGENT database for bacteria, or determined through an equivalent
clustering procedure for partial genomes (see Materials and methods). As for (d),
each point represents the average and standard deviations of the rate of gene family
discovery over a sliding window representing the cumulative addition of 30 complete
or partial genomes, obtained from 400 random orderings of genome addition (see Materials
and methods for more details). Also shown are the gene family discovery rates for
the two 'restricted' groups of bacterial sequences mentioned above.
Peregrín-Álvarez and Parkinson Genome Biology 2007 8:R238 doi:10.1186/gb-2007-8-11-r238 |