Table 3

cDNA analysis


DGCr1
DGCr2
Total

Clones that encode complete ORFs



     ORFs identical to the Release 3 predicted proteins*
3,429
1,946
5,375
     ORFs with 1-2% differences to Release 3 proteins
235
306
541
Total
3,664
2,252
5,916
Clones known to be compromised



     Nucleotide discrepancies
485
829
1314
     5' short
618
150
768
     3' truncated
57
26
83
     Co-ligated inserts
23
54
77
     ORFs with less than 50 amino acids
49
21
70
     Antisense transcripts
53
58
111
     Transposable elements
12
9
21
     Bacterial contaminants
2
4
6
Total
1,299
1,151
2,450
Clones that may represent alternative transcripts§



     5' short with upstream in-frame stop codon
32
4
36
     3' truncated with downstream in-frame stop codon
55
17
72
     Putative missed micro-exon in Release 3 annotation
23
7
30
Total
110
28
138
Unclassified clones
257
160
417

Summary of analysis of the 8,770 clones in GenBank plus 151 clones for which we do not have accession numbers yet. *The ORF predicted from the cDNA sequence is identical to the corresponding Release 3 predicted protein; 4,620 of these clones are from the LD, GH, HL, LP, RE or RH cDNA libraries, which were made from the same strain that was sequenced. Thus, we required their ORFs to be identical to those of the predicted Release 3 proteins. An additional 755 clones with ORFs identical to Release 3 proteins are from the AT, GM or SD libraries. The ORF predicted from the cDNA sequence is the same length as the Release 3 predicted protein with less than 2% amino-acid difference. These clones are derived from the AT, GM or SD cDNA libraries, which were made from strains or cell lines that are not isogenic with the strain that was sequenced. See text for explanation of the individual subclasses of compromised clones. §These clones have structures that are inconsistent with the corresponding Release 3 predicted gene. The 5'-short and 3'-truncated clones may reflect alternative splice products or promoters, or perhaps more likely, incompletely processed primary transcripts with retained introns. Additional experimental work will be required to distinguish these possibilities. Those clones referred to as putative missed micro-exons in Release 3 annotations are cases in which the cDNA clone contains additional nucleotides that are a multiple of 3, relative to the Release 3 predicted mRNA, and maintains the ORF. We expect that most of these discrepancies result from a failure of Sim4 to align micro-exons and that these cases will be resolved by modifying the Release 3 gene model; see [15] for more discussion. The predicted ORF from the cDNA clone does not match a Release 3 predicted protein, but the underlying cause could not be classified into one of the above categories. We expect that very few of these clones accurately reflect actual gene transcripts.

Stapleton et al. Genome Biology 2002 3:research0080.1   doi:10.1186/gb-2002-3-12-research0080