Table 1

Identification of exons on the genome

Category
Database
Total records
Percent placed (%)
Total unique exons
Exons in complete ORFs
Exons in partial ORFs
Exon length (bp)
ORF length (bp)
Putative genes (non-splicing singletons)
Protein homology (Pfam hits)
CpG islands

Known
UTR-DB
40,258
80
19,195
5,075
1,895
6,925,762
1,990,818
10,007 (426)
5,701 (3,813)
3,866
genes
HTDB
15,305
89
48,477
12,077
7,706
11,893,081
4,043,544
4,816 (148)
2,938 (1,943)
1,960
Consensus
HINT
87,125
77
103,817
47,055
15,061
23,381,024
10,144,988
20,357 (959)
9,121 (6,453)
7,557
transcripts
EG
62,064
80
13,085
5,389
1,904
4,562,954
1,873,723
4,800 (154)
2,177 (1,679)
2,462

THC
84,837
81
38,806
15,463
6,671
12,406,081
5,078,661
8,604 (322)
2,907 (2,026)
3,983
Transcripts
GenBank CDS
110,222
81
41,917
31,626
1,452
5,303,064
4,299,272
2,634 (227)
1,858 (1,607)
1,178

dbEST Human
2,154,995
73
273,881
147,819
17,694
32,288,385
14,975,758
20,073 (7,136)
5,377 (3,745)
11,807
Rodent
MINT
92,531
30
8,284
5,433
120
866,046
780,566
777
123 (56)
486
transcripts
RINT
37,367
46
5,600
3,588
75
592,788
546,932
458
65 (32)
255

EMBL
43,488
28
5,819
4,108
59
724,630
655,993
202
68 (72)
135
Protein
SWISS-PROT
86,593
38
27,526
12,072
1,163
9,858,797
7,784,205
1,648
1,648 (1,244)
158
homology
TrEMBL
351,834
13
22,670
8,134
1,677
4,385,497
2,886,034
1,185
1,185 (654)
92

PIR
182,106
16
4,106
1,175
383
1,355,644
764,339
321
321 (132)
20
Total



613,183
299,014
55,860
114,543,753
55,824,833
75,982 (9,372)
33,489 (23,008)
33,959

Exons were identified after vector screening using transcript, rodent, and protein databases. The definition of a record varies according to the database, while 'exons' refer to high-scoring segment pairs in BlastN comparisons (E < 10-15 and sequence identity >90%) to the genome. Unique exons and all subsequent columns refer to placements that were possible after considering the preceding databases. Placement of rodent transcripts required evidence of splicing and sequence identity >80%. ORFs were identified using getorf [84] using a minimum size of 30bp to report. Protein homology required BlastX E < 10-15. Pfam hits required score >20 using hmmpfam [92]. Gene prediction programs are described in Table 2. CpG islands were identified using cpgreport [84] using standard criteria [45].

Wright et al. Genome Biology 2001 2:research0025.1   doi:10.1186/gb-2001-2-7-research0025