Table 3

cDNA analysis

DGCr1

DGCr2

Total


Clones that encode complete ORFs

ORFs identical to the Release 3 predicted proteins*

3,429

1,946

5,375

ORFs with 1-2% differences to Release 3 proteins

235

306

541

Total

3,664

2,252

5,916

Clones known to be compromised

Nucleotide discrepancies

485

829

1314

5' short

618

150

768

3' truncated

57

26

83

Co-ligated inserts

23

54

77

ORFs with less than 50 amino acids

49

21

70

Antisense transcripts

53

58

111

Transposable elements

12

9

21

Bacterial contaminants

2

4

6

Total

1,299

1,151

2,450

Clones that may represent alternative transcripts§

5' short with upstream in-frame stop codon

32

4

36

3' truncated with downstream in-frame stop codon

55

17

72

Putative missed micro-exon in Release 3 annotation

23

7

30

Total

110

28

138

Unclassified clones

257

160

417


Summary of analysis of the 8,770 clones in GenBank plus 151 clones for which we do not have accession numbers yet. *The ORF predicted from the cDNA sequence is identical to the corresponding Release 3 predicted protein; 4,620 of these clones are from the LD, GH, HL, LP, RE or RH cDNA libraries, which were made from the same strain that was sequenced. Thus, we required their ORFs to be identical to those of the predicted Release 3 proteins. An additional 755 clones with ORFs identical to Release 3 proteins are from the AT, GM or SD libraries. The ORF predicted from the cDNA sequence is the same length as the Release 3 predicted protein with less than 2% amino-acid difference. These clones are derived from the AT, GM or SD cDNA libraries, which were made from strains or cell lines that are not isogenic with the strain that was sequenced. See text for explanation of the individual subclasses of compromised clones. §These clones have structures that are inconsistent with the corresponding Release 3 predicted gene. The 5'-short and 3'-truncated clones may reflect alternative splice products or promoters, or perhaps more likely, incompletely processed primary transcripts with retained introns. Additional experimental work will be required to distinguish these possibilities. Those clones referred to as putative missed micro-exons in Release 3 annotations are cases in which the cDNA clone contains additional nucleotides that are a multiple of 3, relative to the Release 3 predicted mRNA, and maintains the ORF. We expect that most of these discrepancies result from a failure of Sim4 to align micro-exons and that these cases will be resolved by modifying the Release 3 gene model; see [15] for more discussion. The predicted ORF from the cDNA clone does not match a Release 3 predicted protein, but the underlying cause could not be classified into one of the above categories. We expect that very few of these clones accurately reflect actual gene transcripts.

Stapleton et al. Genome Biology 2002 3:research0080.1-0080.8   doi:10.1186/gb-2002-3-12-research0080

Open Data