Table 1

Data imported by PRESTA


Human
Mouse

EPD


     Total entries
276
200
     Imported by PRESTA*
214
167
     tcg total†‡
139
109
     Present in GenBank/EMBL§
0
0
     Weak promoters
4
3
     Confirmed by one EST#
99
64
     Confirmed by two ESTs#
82
56
GenBank


     Total entries¥
5,870
3,289
     After pre-filter
570
307
     tag
484
313
     tcg total†‡
291
208
     tcg non-redundant
241
192
     Not found in EMBL**
128
96
EMBL


     Total entries¥
6,314
2,251
     After pre-filter
1051
274
     tag created
820
222
     tcg total†‡
571
150
     tcg non-redundant
425
145
     Not found in GenBank**
312
49
GenBank + EMBL


     tcg non-redundant
553
241
     Present in EPD§
0
0
     Possibly misannotated
30
16
     Confirmed by one EST#
326
153
     Confirmed by two ESTs#
281
124

*EPD promoters are shown for comparison. Some EPD entries did not meet the PRESTA limit on downstream sequence length. Fraction of promoters successfully associated with ESTs. Both 'tag' and 'tcg' are internal PRESTA formats, 'tag' stores the promoter sequences, 'tcg' adds information about matching ESTs. §No overlap between the GenBank/EMBL non-redundant set and PRESTA-imported EPD entries was found using pairwise SEQALN alignment of immediately downstream transcribed sequences. This is not an error: EMBL sequences linked from EPD were correctly dissected, as some of them are homologous to dozens of 5' EST ends. Even more surprisingly, there is no apparent overlap between PRESTA and the full human subdivisions of EPD. An EPD entry directly stores a 49-base-pair stretch of the immediately upstream region. The full set of these stretches was downloaded by a simple web agent and compared to an analogous set of PRESTA sequences using SEQALN. There are no ESTs confirming the transcription start site and at least two 5' EST ends are longer then expected. # The 5' end of at least one (or two) matching ESTs maps to the -5 to +30 region relative to the transcription start site. In addition, the ratio of positively mapping to overshooting 5' ends is larger than 1:3. The current PRESTA version neglects the possibility that the library was amplified and that two or more ESTsactually originate from the same cDNA clone. ¥A sample query: (([genbank-Division:rod] & (([genbank-Organism:Mus*] & [genbank-Organism:musculus*]) | [genbank-Organism:Mus musculus*])) & ((((([genbank-FtKey:5\' utr] | [genbank-FtKey:precursor_rna]) | [genbank-FtKey:prim_transcript]) | [genbank-FtKey:promoter]) | [genbank-FtKey:tata_signal]) > parent)). **Not recovered by an equivalent query. This reflects different feature annotation rather then incomplete synchronization between the two major sequence databases.

Mach Genome Biology 2002 3:research0050.1   doi:10.1186/gb-2002-3-9-research0050

Open Data