A collection of sequenced full-length cDNAs is an important resource both for functional genomics studies and for the determination of the intron-exon structure of genes. Providing this resource to the Drosophila melanogaster research community has been a long-term goal of the Berkeley Drosophila Genome Project. We have previously described the Drosophila Gene Collection (DGC), a set of putative full-length cDNAs that was produced by generating and analyzing over 250,000 expressed sequence tags (ESTs) derived from a variety of tissues and developmental stages.
We have generated high-quality full-insert sequence for 8,921 clones in the DGC. We compared the sequence of these clones to the annotated Release 3 genomic sequence, and identified more than 5,300 cDNAs that contain a complete and accurate protein-coding sequence. This corresponds to at least one splice form for 40% of the predicted D. melanogaster genes. We also identified potential new cases of RNA editing.
We show that comparison of cDNA sequences to a high-quality annotated genomic sequence is an effective approach to identifying and eliminating defective clones from a cDNA collection and ensure its utility for experimentation. Clones were eliminated either because they carry single nucleotide discrepancies, which most probably result from reverse transcriptase errors, or because they are truncated and contain only part of the protein-coding sequence.
One of the goals of the Berkeley Drosophila Genome Project is to define experimentally the transcribed portions of the genome by producing a collection of fully sequenced cDNAs. We have previously reported the construction of cDNA libraries from a variety of tissues and developmental stages; these libraries were used to generate over 250,000 expressed sequence tags (ESTs), corresponding to approximately 70% of the predicted protein-coding genes in the Drosophila melanogaster genome [1,2]. We used computational analysis of these ESTs to establish a collection of putative full-length cDNA clones, the Drosophila Gene Collection (DGC) [1,2]. Here, we describe the process by which we sequenced the full inserts of 8,921 cDNA clones from the DGC, describe the methods by which we assess each clone's likelihood of containing a complete and accurate protein-coding region, and illustrate how these data can be used to uncover additional cases of RNA editing. We have confirmed the identification of 5,375 cDNA clones that can be used with confidence for protein expression or genetic complementation.
Results and discussion
Current approaches to full-insert sequencing of cDNA clones include concatenated cDNA sequencing , primer walking , and strategies using transposon insertion to create priming sites [5,6,7,8,9]. We adopted a cDNA sequencing strategy that relies on an in vitro transposon insertion system based on the MuA transposase, combined with primer walking (see Materials and methods for details).
The production of full-insert sequences from DGC cDNAs is summarized in Tables 1 and 2. For DGCr1, clones were sized before sequencing. Small clones (< 1.4 kilobases (kb)) were sequenced with custom primers and larger clones were sequenced using either mapped or unmapped transposon insertions. For DGCr2, clones were not sized and a set of unmapped transposon insertions was sequenced to generate an average of 5× sequence coverage. For both DGCr1 and r2, custom oligonucleotide primers designed using Autofinish  were used to bring the sequences to high quality. To date, we have completed sequencing 93% of the DGCr1 clone set and 80% of the DGCr2 clone set. The strategy used for sequencing DGCr1 clones appears to be more efficient, because on average they required fewer sequencing reads than DGCr2 clones. However, we were able to reduce cycle time and increase throughput using the shotgun strategy adopted for sequencing the DGCr2 clones. The average insert size of the 8,770 high-quality cDNA sequences that have been submitted to GenBank is 2 kb and they total 17.5 megabases (Mb) of sequence. The largest clone (SD01389) is 8.7 kb and is derived from a gene (CG10011) that encodes a 2,119-amino-acid ankyrin repeat-containing protein.
Evaluating the coding potential of each cDNA on the basis of its full-insert sequence
For many potential uses in proteomics and functional genomics [11,12,13], it is important to establish cDNA collections comprised only of cDNAs with complete and uncorrupted open reading frames (ORFs). To determine which of our sequenced clones meet this standard, we compared them to the annotated Release 3 genome sequence [14,15] using a combination of BLAST  and Sim4  alignments (see Materials and methods for details).
We grouped the cDNAs into four categories (Table 3). The first category contains a total of 5,916 cDNA clones, or 68% of the sequenced clones. We are confident that 5,375 of these clones contain a complete and accurate ORF, as they precisely match the Release 3 predicted protein for the corresponding gene. An additional 541 clones are from the SD, GM and AT libraries, which were generated from fly strains that are not isogenic with the strain used to produce the genome sequence. The predicted ORFs from clones from these libraries were required to be identical in length to the Release 3 predicted protein with less than 2% amino-acid difference to be placed in this category. We cannot at present distinguish whether these differences result from strain polymorphisms or reverse transcriptase (RT) errors. However, our own internal estimates of RT errors (see below), based on the observed nucleotide substitution rate in cDNAs derived from the same strain as the genomic sequence, and published estimates of strain polymorphisms  lead us to believe that the majority of these changes are the result of strain polymorphism.
Table 3. cDNA analysis
The second category represents 2,450 clones that are known to be compromised in one of a number of ways. The sequences of the largest class of compromised clones (1,314) align to the Release 3 predicted transcripts, but have nucleotide discrepancies that are most likely the result of errors generated by RT during library construction. These include missense and frameshift (+/-1 or +/-2 nucleotide difference) changes in the predicted ORF relative to the Release 3 predicted protein. Clones placed in this class can show up to 2% amino acid differences from the Release 3 peptide for isogenic libraries, and up to 4% difference for non-isogenic libraries. We estimated the error rate of an RNAseH-deficient RT (SuperScriptII, Invitrogen, Carlsbad, CA) by comparing the nucleotide sequence of cDNAs from isogenic libraries to the genomic sequence. For the GH, HL, LD, and LP libraries , we observed an error rate of 1 in 4,000; for the RE and RH libraries , we observed an error rate of 1 in 1,000. This difference is likely due to the different RT reaction conditions used in these two library construction protocols [1,2]. Although these numbers are higher than the 1 in 15,000 figure reported for SuperScriptII (Taurai Nenguke, personal communication), the in vitro assay used to obtain this error rate is based on assaying a single site for mutations that revert an amber codon.
The next largest class of compromised clones (768) consists of clones apparently truncated at their 5' ends, as judged by comparison to the Release 3 predicted ORFs of the corresponding genes. The 768 5'-short clones represent 757 distinct Release 3 annotated transcripts. For 151 of the 5'-short clones, 143 from DGCr1 and eight from DGCr2, we were able to identify clones with longer ORFs by additional EST sequencing. The remaining 606 clones are assumed to be 5' short because they do not possess a 5' in-frame stop codon and the corresponding annotated ORF in Release 3 extends further 5'. This class of clones represents approximately 9% of all finished clones, consistent with our original estimates that 80-94% of the DGC clones would contain the full ORF [1,2].
The remaining six classes of compromised clones consist of a total of 368 cDNAs (4% of all finished clones, see Table 3). Eighty-three clones encode ORFs that are truncated at their carboxy-termini and are most likely the result of priming from internal poly(A) tracts. Seventy-seven clones contain two unrelated ORFs and are almost certainly the result of two cDNAs being cloned into the same plasmid vector during library construction. Seventy clones contain ORFs of less than 50 amino acids. One hundred and eleven clones overlap a Release 3 predicted gene but are transcribed from the opposite strand from that of the mRNA encoding the Release 3 predicted protein and are considered anti-sense transcripts; a number of such cases were documented in the reannotation of the genome  and have been reported in many organisms . Twenty-one clones correspond to transcripts of transposable elements on the basis of their sequence similarity to identified Drosophila transposons . Finally, six clones contain a bacterial transposable element (Tn10, IS1 or IS2) that most likely inserted into the clone during propagation in Escherichia coli (bacterial contaminants).
The third and fourth categories consist of clones that may represent alternative transcripts (138) and clones that are currently computationally unclassified (417), respectively. The summary of the analysis of these clones is described in Table 3.
Improving the Drosophila cDNA resource
We have identified and sequenced cDNA clones that contain a complete and accurate ORF for 40% of all predicted Drosophila genes. We plan on extending this project in two ways. First, we intend to increase the number of genes represented in this set of fully vetted cDNA clones using a combination of experimental approaches. We can use site-directed mutagenesis to correct clones that carry single nucleotide changes or other small, localized defects. For the majority of the compromised clones, we have candidate replacement clones available that were identified as part of our EST sequencing and analysis efforts . Generation of the Release 3 annotation of the genome made extensive use of our full-insert sequence data . In the course of that effort, human curators identified a total of 2,013 clones that have become the DGCr3. The DGCr3 currently includes 309 clones chosen to replace clones with truncated ORFs, 543 clones for genes that are not currently represented in the DGC, and 833 clones that represent alternative splicing forms. To identify cDNAs for the remaining genes, we plan on using a combination of additional EST sequencing, reverse transcriptase PCR (RT-PCR) and cDNA library screening. Second, we plan on transferring ORFs to a universal cloning system (see [21,22] for examples) in order to generate a standard reagent for proteomics and other functional genomic experiments. In collaboration with Orbigen , we have already generated 72 baculovirus expression clones from a set of Gateway (Invitrogen, Carlsbad, CA) clones encoding transcription factors.
RNA editing is a well-documented mechanism of generating nucleotide diversity beyond that directly encoded by the genome. Adenosine deaminase (ADAR) targets double-stranded regions of nuclear-encoded RNAs, catalyzing the deamination of adenosine (A) to inosine (I) . Inosine mimics guanosine (G) in its base-pairing properties, and the translational machinery of the cell interprets I as G. In this way, an A-to-I conversion in the mRNA can alter the genetic information and, consequently, protein function. Null mutations in the single ADAR gene in Drosophila (dADAR) suggest that the function of pre-mRNA editing is to modify adult behavior by altering signaling components in the nervous system [25,26]. Among the mRNAs known to be edited in Drosophila are those encoded by cacophony (a calcium channel gene) , paralytic (a sodium channel gene)  and GluCla (a chloride channel gene) , all of which have multiple editing sites in their coding sequences.
In the course of evaluating the quality of the DGCr1 and DGCr2 cDNAs, described above, we compared their translation products to those of the recently completed Release 3 genomic sequence. Such comparisons should reveal cases of RNA editing. In cases in which the predicted protein sequences disagreed, we examined the corresponding nucleotide sequences in search of site-specific A-to-G variation between cDNA and genomic sequences. We identified over 30 candidates consistent with RNA editing; however, additional cDNA or EST data will be required to distinguish RNA editing from RT errors or strain polymorphisms. In a few cases we had enough cDNA and EST data to indicate that RNA editing is the most likely explanation for the observed variation. One such example is shown in Figure 1. The gene CG18314 encodes a G-protein-coupled receptor of the rhodopsin family, containing a seven-transmembrane protein domain with similarity to β2-adrenergic receptors of mouse and human [30,31]. Ten potential sites of RNA editing were revealed by comparison of the genomic sequence with those of two cDNAs and three ESTs. We validated these 10 sites by gene-specific RT-PCR using RNA isolated from heads of isogenic animals and identified 15 new sites (see legend to Figure 1). We are now in the process of a more rigorous and thorough analysis of potential RNA-editing targets.
Figure 1. A putative example of RNA editing as revealed by comparison of cDNA and genomic DNA sequences. (a) Gene models for CG18314 based on sequence of two DGCr1 full-length cDNA clones (GH15292.c, GH08370.c) that differ at their 5' and 3' termini. Although the cDNAs have alternative 5' and 3' UTRs and are alternatively spliced, they share the same protein-coding potential (shown in blue). CG18314 encodes a G-protein-coupled receptor of the rhodopsin family, containing a seven-transmembrane protein domain (7tm_1; the red bar shows the extent of the domain) with similarity to β2-adrenergic receptors of mouse (X15643, E value = 9e-23) and human (M15169, E = 8e-22). Shown hatched is a 310-bp portion of cDNA sequences with A-to-G nucleotide variation. (b) Sequence alignments of this 310-bp portion of genomic sequence, two cDNA and three EST sequences (GH14918, GH14553, HL02270). Shown in yellow are codons with A-to-G nucleotide variation. Above the genomic nucleotide sequence is its translated amino-acid sequence starting at amino acid 224 of the protein. Comparing the cDNA nucleotide sequence to the genomic sequence identifies 10 A-to-G nucleotide variations. Two are silent, seven result in amino-acid changes, and one alters the stop codon, allowing two additional amino acids to be encoded. The amino acids that are affected are shown below the nucleotide sequence (red letters in a gray circle). Two of the amino-acid changes (N224S and S229G) map to the conserved seven-transmembrane protein domain. The Anopheles gambiae genomic draft contains sequence encoding this protein (gi|21299606|gb|EAA11751.1| (AAAB01008960) agCP5433) which is highly conserved at the amino-acid sequence level (E = e-168) and also encodes N and S at these sites. To sample additional transcripts of this gene, we performed gene-specific RT-PCR to amplify the region shown in (b). From a total of 64 independent transcripts we confirmed the 10 cases of editing diagrammed above, and identified 15 new sites of A-to-G nucleotide variations. A list of these putative editing sites showing the resulting amino-acid change and the number of times this change was observed, given in parentheses, is as follows: N224D (2), N224S (12), L225L (9), N227S (1), S229G (9), H230R (1), M231V (1), L236L (16), A239A (1), P246P (2) E254G (1), I272I (1), I275M (1), I281V (1), S286G (1), K306R (16), K308R (5), K308G (8), Q312Q (1), A313A (1), L315L (31), I316V (52), *323W (44) and S324G (4).
Materials and methods
The Drosophila Gene Collection (DGC) consists of two releases, DGCr1 and DGCr2. A process flow diagram of our sequencing strategies is available online  and is summarized below. The clones in DGCr1 were arrayed by insert size  and sequenced accordingly; clones in DGCr2 were not arrayed by size. DGCr1 clones less than 1.4 kb were assembled using phrap  and analyzed with custom scripts to determine whether they were complete. Autofinish (part of the consed computer software package) was used to automatically design custom primers  for clones that needed quality improvement. Clones that did not finish in the first two rounds of Autofinish were sent to a manual finishing queue for more sophisticated finishing. cDNA clones larger than 1.4 kb were divided into three groups: 1.4 to 3 kb, 3 to 4.5 kb, and greater than 4.5 kb. All clones were sequenced using the in vitro Template Generation System (TGStm Finnzyme). Clones 3 to 4.5 kb in size, were sequenced using a minimal path of transposon-bearing clones. Clones, 1.4 to 3 kb and those greater than 4.5 kb, were sequenced with 24 and 48 unmapped transposon-bearing clones, respectively. After the initial cycle of transposon sequencing, the clones were analyzed using in-house scripts and Autofinish to determine their state of completeness and quality. DGCr2 clones were sequenced using 24 unmapped transposon-bearing clones. After an initial cycle of transposon sequencing, the clones were analyzed for completeness and quality as described above for DGCr1 clones, using in-house scripts and Autofinish. DGCr2 clone sequences were screened for transposable element sequences, cases of co-ligation, and presence of a poly(A) tail before any finishing work was ordered.
In vitro transposition and mapping insertion sites
Transposon insertion reactions were carried out in 96-well format using the Template Generation System (TGStm) according to the manufacturer's recommendations (Finnzyme). Transposon reactions consisted of 1 μl (50-150 ng) plasmid DNA isolated from Qiagen or Revprep DNA isolation robots, 1.6 μl 5× reaction buffer, 8 ng Entranceposon (KanR), 0.4 μl MuA transposase, and deionized water to bring the final volume to 8 μl. Reactions were carried out in PCR plates and incubated in an ABI thermocycler according to the manufacturer's instructions. After heat inactivation of the MuA transposase, 2 μl of the reaction were used to transform 17 μl of DH5α chemically competent cells (Invitrogen) in 96-well format. Following incubation at 37°C for 1 h in 183 μl SOC medium, cells were plated onto appropriate medium selecting for vector and Entranceposon antibiotic resistance. Plates were incubated at 37°C overnight. Colonies were picked into 1.2-ml polypropylene titer tubes (E&K Scientific) containing 0.5 ml LB medium supplemented with 7.5% glycerol and the appropriate antibiotics and incubated at 37°C overnight. These stocks were then used to inoculate 1.2 ml 2XYT medium in 96-well square deep-well plates (E&K Scientific) for culture and DNA plasmid preps. Transposon insertion sites were mapped relative to the vector ends by PCR essentially as described . Forty-eight transposon-bearing clones were picked for PCR mapping using the Mu-End primer (present at both ends of the tranposon) in combination with vector-specific primers, resulting in 96 PCR products. Agarose gels were imaged using custom software developed in-house (Earl Cornell, LBNL) and analyzed using an algorithm, Supertramp [35,36], to identify a minimal path of transposon-bearing clones to be re-arrayed and sequenced.
Purified plasmid DNA from transposon-bearing clones was sequenced using 2 μl ABI BigDye II Dye terminator mix (Applied Biosystems) in a 10-μl reaction. Sequencing reactions were processed through 96-well Sephadex G-50 SF plates (Multiscreen filter plates; Millipore) and loaded onto ABI Prism 3700 DNA Analyzer. Sequencing primers specific for each end of the Entranceposon were used in the reactions (5'-ATCAGCGGCCGCGATCC-3' and 5'-TTATTCGGTCGAAAAGGATCC-3'). Sequencing of 5' and 3' cDNA ends was carried out as previously described . The sequencing reported here was carried out over a 2-year period during which we made several major modifications to the strategy; for example, switching from sequencing mapped transposon insertions to random transposons. These changes improved throughput and cycle time, but made the process less efficient in terms of the required number of sequencing reads. Because of these changes, it is not possible to give a meaningful single efficiency estimate; however, our overall efficiency is comparable to other efforts using a similar strategy [8,9].
Data processing and assembly
cDNA clone data management relied on custom scripts and an Informix database. Sequences were processed using phred [37,38] and assembled using phrap . 5' and 3' EST end-reads were combined with the transposon-based reads to generate cDNA clone assemblies. We adopted the sequence quality-control standards defined for the Mammalian Gene Collection project . Custom scripts evaluated assemblies for: 5' and 3' EST reads in a single contig in the proper orientation; at least 10 bases of 3' poly(A) tail; phrap estimated error rate of less than one in 50,000 bases; and individual base quality of at least q25. Double-stranded coverage was not a criterion for a clone to be considered finished; however, we have determined that 96.2% of all submitted bases are double-stranded and 48% of clones had complete double-stranded coverage. Autofinish  was used to design primers to improve quality or extend sequence from multiple sequence contigs. cDNA clones with an estimated error rate greater than one in 50,000 bp were automatically identified and processed with additional rounds of Autofinish designed finishing work. If Autofinish could not design primers, custom primers were designed manually using consed. Custom scripts were used to manually order primers to generate a further round of sequencing.
The sequence data described in this paper have been submitted to the GenBank data library under accession numbers:
AF132562-AF132563, AF160906, AF160909,
AF145623-AF145684, AF160921, AF160923,
AF160879, AF160882, AF160933-AF160934,
AY070604-AY070608, AY071447-AY0 71450,
Analysis of finished cDNA sequences
cDNA sequence was submitted to GenBank with a preliminary annotation of the longest ORF and a gene assignment based on a high BLASTN similarity score to the Release 2 genome annotations. Subsequent processing was used to determine a more detailed analysis of the clone quality. Using BLASTN, sequence from each cDNA clone was compared to genomic sequence, predicted genes, predicted coding sequences (CDSs), known Drosophila transposable elements, and Escherichia coli transposable elements. Using BLASTP, the translation of the longest ORF was compared to the predicted Release 3 translations . Custom scripts were used to parse the BLAST output and record similarity results. We also compared the nucleotide sequence of each clone to the Release 3 genome sequence  using Sim4 and to the Release 3 predicted CDS with the highest BLAST score.
We confirmed the sequence quality of the genomic region encompassing CG018314 (12,731 bp) by independently assembling an 18,284 bp contig consisting solely of whole-genome shotgun (WGS) traces. The assembled sequence contig has an average of 8.6× sequence coverage. The phrap estimated error rate for each genomic base corresponding to a mRNA edited base is q90. Similarly, we determined the phrap estimated error rate for each mRNA edited base to be q90. We manually inspected chromatograms for high-quality discrepancies in the genomic sequence and found none, indicating that the edited bases are not due to population heterozygosity. To validate the editing sites, total RNA was isolated from heads from a mixed population of male and female adult flies from the isogenic strain y1; cn1 bw1 sp1 using the Concert™ Cytoplasmic RNA isolation reagent according to the manufacturer's guidelines (Invitrogen). Nine independent gene-specific RT-PCR reactions were performed using the Superscript™ one-step RT-PCR kit according to the manufacturer (Invitrogen) and PCR products were cloned into the PCR2.1 vector. Twenty-four independent subclones from each of four independent RT-PCR products were sequenced and twelve independent subclones from an additional five independent RT-PCR products were sequenced; we considered amplicons to represent independent transcripts if they arose from different RT-PCR reactions or if they differed in sequence. The gene-specific primers used in the RT-PCR experiments were 5'-GTGCAGACGAAAACGAGATGCCAATG-3' and 5'-TGTAGTTCTTCTCAAAGGGATTACG-3'.
We thank Catherine Nelson for critically reading and improving the manuscript, Sandeep Patel for help during the manual finishing phase of cDNA sequencing, Michelle Chew for excellent technical assistance in the early stages of this project, and the entire staff of the BDGP sequencing center. We also thank the FlyBase curators for their help in identifying the DGCr3, and Erwin Frise and Eric Smith for system administration. This work was supported by NIH Grant P50-HG00750 (GMR), Department of Energy Grant nos. DE-FG03-98ER62625 and DE-FG03-99ER62739 (GMR), and performed under Department of Energy Contract DE-AC0376SF00098, University of California.
Stapleton M, Liao G, Brokstein P, Hong L, Carninci P, Shiraki T, Hayashizaki Y, Champe M, Pacleb J, Wan K, et al.: The Drosophila Gene Collection: identification of putative full-length cDNAs for 70% of D. melanogaster genes.
Wiemann S, Weil B, Wellenreuther R, Gassenhuber J, Glassl S, Ansorge W, Bocher M, Blocker H, Bauersachs S, Blum H, et al.: Toward a catalog of human genes and proteins: sequencing and analysis of 500 novel complete protein coding human cDNAs.
Butterfield YS, Marra MA, Asano JK, Chan SY, Guin R, Krzywinski MI, Lee SS, MacDonald KW, Mathewson CA, Olson TE, et al.: An efficient strategy for large-scale high-throughput transposon-mediated sequencing of cDNA clones.
Celniker SE, Wheeler DA, Kronmiller B, Carlson JW, Halpern A, Patel S, Adams M, Champe M, Dugan SP, Frise E, et al.: Finishing a whole shotgun sequence assembly: Release 3 of the Drosophila euchromatic sequence.
Misra S, Crosby MA, Mungall CJ, Matthews BB, Campbell K, Hradecky P, Huang Y, Kaminker JS, Millburn GH, Prochnik SE, et al.: Annotation of the Drosophila melanogaster euchromatic genome: a systematic review.
Kaminker JS, Bergman C, Kronmiller B, Carlson J, Svirskas R, Patel S, Frise E, Wheeler DL, Lewis SE, Rubin GM, et al.: The transposable elements of the Drosophila melanogaster euchromatin - a genomics perspective.
BD Creator™ Gene Expression Systems [http://www.clontech.com/products/families/creator/index.shtml] webcite
Kobilka BK, Dixon RA, Frielle T, Dohlman HG, Bolanowski MA, Sigal IS, Yang-Feng TL, Francke U, Caron MG, Lefkowitz RJ: cDNA for the human-adrenergic receptor: a protein with multiple β2 membrane-spanning domains and encoded by a gene whose chromosomal location is shared with that of the receptor for platelet-derived growth factor.
Kimmel B, Palazzolo MJ, Martin CH, Boeke JD, Devine SE: Transposon-mediated DNA sequencing. In In Genome Analysis: a laboratory manual. Analyzing DNA.. Edited by Birren B, Green ED, Klapholz S, Myers RM, Roskams J. Cold Spring Harbor, NY: Cold Spring Harbor Laboratory Press; 1997::455-532.
Comput Appl Biosci 1995, 11:173-179. PubMed Abstract