<?xml version='1.0'?>
<!DOCTYPE art SYSTEM 'http://www.biomedcentral.com/xml/article.dtd'>
<art>
<ui>gb-2012-13-12-r119</ui>
<ji>1465-6906</ji>
<fm>
<dochead>Research</dochead>
<bibl>
<title><p>Mutation spectrum of <it>Drosophila </it>CNVs revealed by breakpoint sequencing</p></title>
<aug>
<au id="A1" ca="yes"><snm>Cardoso-Moreira</snm><fnm>Margarida</fnm><insr iid="I1"/><email>mmc256@cornell.edu</email></au>
<au id="A2"><snm>Arguello</snm><mnm>Roman</mnm><fnm>J</fnm><insr iid="I1"/><email>jra89@cornell.edu</email></au>
<au id="A3"><snm>Clark</snm><mi>G</mi><fnm>Andrew</fnm><insr iid="I1"/><email>ac347@cornell.edu</email></au>
</aug>
<insg>
<ins id="I1"><p>Department of Molecular Biology and Genetics, Cornell University, 526 Campus Road, Ithaca, NY 14853-2703, USA</p></ins>
</insg>
<source>Genome Biology</source>
<issn>1465-6906</issn>
<pubdate>2012</pubdate>
<volume>13</volume>
<issue>12</issue>
<fpage>R119</fpage>
<url>http://genomebiology.com/2013/13/12/R119</url>
<xrefbib><pubidlist><pubid idtype="doi">10.1186/gb-2012-13-12-r119</pubid><pubid idtype="pmpid">23259534</pubid></pubidlist></xrefbib></bibl>
<history><rec><date><day>31</day><month>5</month><year>2012</year></date></rec><revrec><date><day>25</day><month>10</month><year>2012</year></date></revrec><acc><date><day>22</day><month>12</month><year>2012</year></date></acc><pub><date><day>22</day><month>12</month><year>2012</year></date></pub></history>
<cpyrt><year>2013</year><collab>Cardoso-Moreira et al.; licensee BioMed Central Ltd.</collab><note>This is an open access article distributed under the terms of the Creative Commons Attribution License (<url>http://creativecommons.org/licenses/by/2.0</url>), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.</note></cpyrt>
<kwdg>
<kwd>Copy number variants</kwd><kwd>CNVs</kwd><kwd>Non-allelic homologous-recombination</kwd><kwd>NAHR</kwd><kwd>Single-strand annealing</kwd><kwd>SSA</kwd><kwd>Non-homologous end-joining</kwd><kwd>NHEJ</kwd><kwd>Replication-associated repair</kwd><kwd>Alternative end-joining</kwd><kwd>Microhomology-mediated end-joining</kwd><kwd>MMEJ</kwd><kwd>Filler DNA</kwd>
</kwdg>
<abs>
<sec><st><p>Abstract</p></st>
<sec><st><p>Background</p></st>
<p>The detailed study of breakpoints associated with copy number variants (CNVs) can elucidate the mutational mechanisms that generate them and the comparison of breakpoints across species can highlight differences in genomic architecture that may lead to lineage-specific differences in patterns of CNVs. Here, we provide a detailed analysis of <it>Drosophila </it>CNV breakpoints and contrast it with similar analyses recently carried out for the human genome.</p>
</sec>
<sec><st><p>Results</p></st>
<p>By applying split-read methods to a total of 10x coverage of 454 shotgun sequence across nine lines of <it>D. melanogaster </it>and by re-examining a previously published dataset of CNVs detected using tiling arrays, we identified the precise breakpoints of more than 600 insertions, deletions, and duplications. Contrasting these CNVs with those found in humans showed that in both taxa CNV breakpoints fall into three classes: blunt breakpoints; simple breakpoints associated with microhomology; and breakpoints with additional nucleotides inserted/deleted and no microhomology. In both taxa CNV breakpoints are enriched with non-B DNA sequence structures, which may impair DNA replication and/or repair. However, in contrast to human genomes, non-allelic homologous-recombination (NAHR) plays a negligible role in CNV formation in <it>Drosophila</it>. In flies, non-homologous repair mechanisms are responsible for simple, recurrent, and complex CNVs, including insertions of <it>de novo </it>sequence as large as 60 bp.</p>
</sec>
<sec><st><p>Conclusions</p></st>
<p>Humans and <it>Drosophila </it>differ considerably in the importance of homology-based mechanisms for the formation of CNVs, likely as a consequence of the differences in the abundance and distribution of both segmental duplications and transposable elements between the two genomes.</p>
</sec>
</sec>
</abs>
</fm>
<meta>
<classifications>
<classification subtype="pubmedcentral-release-delay-information" type="BMC">
<?release-delay 12|0 ?>
</classification>
</classifications>
</meta>
<bdy>
<sec><st><p>Background</p></st>
<p>One of the most surprising discoveries about genome sequence variation was the finding that copy number variants (CNVs; that is, duplications, deletions, and insertions) are widespread in eukaryotic genomes. CNVs have the potential to create novel genes, to alter gene structures, and/or to change gene regulation. As a result, CNVs can cause large phenotypic effects, ranging from highly deleterious <abbrgrp><abbr bid="B1">1</abbr><abbr bid="B2">2</abbr></abbrgrp>, to CNVs underlying adaptation to novel environments <abbrgrp><abbr bid="B3">3</abbr><abbr bid="B4">4</abbr></abbrgrp>. The phenotypic effects of CNVs shape their genomic distribution: in natural populations, CNVs are strongly depleted among protein-coding genes and other functional elements of the genome <abbrgrp><abbr bid="B5">5</abbr><abbr bid="B6">6</abbr></abbrgrp>. However, in addition to selection, mutational processes also impact the genomic distribution of CNVs <abbrgrp><abbr bid="B7">7</abbr><abbr bid="B8">8</abbr><abbr bid="B9">9</abbr><abbr bid="B10">10</abbr></abbrgrp>. The distribution of these variants is not uniform across the genome; instead, CNVs accumulate in discrete regions as a consequence of local increases in the mutation rate. Consequently, current efforts aimed at the identification of the causal CNVs of both deleterious and adaptive phenotypes could be greatly enhanced by a better understanding of the mutational processes underlying the formation of CNVs and the genomic features associated with elevated mutation rates.</p>
<p>CNVs are formed when the repair of DNA breaks (mostly DNA double-strand breaks) is not perfect, leading to the creation of copy-number mutations. DNA double-strand breaks arise as part of the normal metabolism of the cell or as a consequence of ionizing radiation or reactive oxygen species <abbrgrp><abbr bid="B11">11</abbr><abbr bid="B12">12</abbr></abbrgrp>. There are three molecular pathways available to repair the breaks, two that require sequence homology to perform the repair - homologous recombination (HR) and single-strand annealing (SSA) - and one that is homology-independent - non-homologous end-joining (NHEJ). Although both HR and SSA require sequence homology to repair DNA double-strand breaks, they differ in the extent of homology that is required: 100 to 200 bp for HR <it>versus </it>as little as 50 bp for SSA <abbrgrp><abbr bid="B11">11</abbr><abbr bid="B12">12</abbr></abbrgrp>. Another difference is that while SSA always creates a deletion as a consequence of the repair (it is a mutagenic repair pathway), most of the time HR repairs the DNA break without generating any mutation. However, the existence of segmental duplications (also called low copy repeats (LCRs)) or transposable elements near the DNA break can lead to misalignments in the region. In this case, the repair occurs between misaligned repeats leading to the formation of duplications and deletions in a process known as non-allelic homologous recombination (NAHR). In the absence of sequence homology the cell can use non-homologous pathways to repair DNA double-strand breaks. NHEJ, like SSA, is mutagenic, usually resulting in nucleotide substitutions or small indels, but it can also create larger insertions and deletions. While NHEJ does not require sequence homology, a related alternative end-joining pathway, microhomology-mediated end-joining (MMEJ), uses microhomology to mediate the repair <abbrgrp><abbr bid="B13">13</abbr></abbrgrp>. The different molecular pathways are therefore associated with different types of breakpoints and classes of CNVs: NAHR is associated with large stretches of sequence identity and generates both duplications and deletions; SSA is associated with smaller stretches of sequence identity and only generates deletions; NHEJ and its associated pathways (for example, MMEJ) are associated with either presence (2 to 10 bp) or absence of microhomology, and are mostly associated with deletions and insertions (although it can also generate duplications) <abbrgrp><abbr bid="B11">11</abbr><abbr bid="B12">12</abbr></abbrgrp>.</p>
<p>In recent years, additional molecular mechanisms have been proposed to operate in association with replication-based repair and cause CNVs. These mechanisms were proposed following the observation that a subset of human CNVs are highly complex <abbrgrp><abbr bid="B12">12</abbr><abbr bid="B14">14</abbr><abbr bid="B15">15</abbr></abbrgrp>. Such complex CNVs are hard to explain given the canonical HR (and the associated NAHR) and NHEJ pathways because they would require multiple DNA double-strand breaks. Furthermore, the analysis of the breakpoints of these CNVs suggested multiple rounds of strand invasion and the copying of nearby sequences <abbrgrp><abbr bid="B12">12</abbr><abbr bid="B14">14</abbr><abbr bid="B15">15</abbr></abbrgrp>, signatures that could more easily be explained by replication forks stalling (or collapsing), and subsequently disengaging from the template and re-annealing. Three of the proposed models are: fork stalling and replication switching (FoSTeS) <abbrgrp><abbr bid="B16">16</abbr></abbrgrp>, microhomology-mediated break-induced replication (MMBIR) <abbrgrp><abbr bid="B17">17</abbr></abbrgrp>, and serial replication slippage (SRS) <abbrgrp><abbr bid="B18">18</abbr></abbrgrp>. Although these models differ in specific details <abbrgrp><abbr bid="B12">12</abbr><abbr bid="B14">14</abbr></abbrgrp>, they are essentially indistinguishable in terms of breakpoint analysis. They all share the requirement that the re-annealing is mediated by microhomology, and they also suggest that templated DNA from nearby sequences can be introduced at the breakpoints <abbrgrp><abbr bid="B12">12</abbr><abbr bid="B14">14</abbr></abbrgrp>. Although these models have also been proposed to mediate the formation of simple CNVs, it is challenging to distinguish the signatures of these microhomology-mediated replication models from those of NHEJ (and associated MMEJ). In principle, one could distinguish between the two when there are additional nucleotides present at CNV breakpoints: replication-based models would predict that the additional nucleotides correspond to templated DNA (that is, the extra nucleotides were copied from a nearby location) while NHEJ/MMEJ would predict that the additional nucleotides correspond to filler DNA (that is, the extra nucleotides were randomly incorporated).</p>
<p>Most of the work in CNV breakpoint identification has been restricted to mammalian genomes, and in particular to the human genome <abbrgrp><abbr bid="B19">19</abbr><abbr bid="B20">20</abbr><abbr bid="B21">21</abbr><abbr bid="B22">22</abbr><abbr bid="B23">23</abbr></abbrgrp>. In humans (as in other mammals) CNVs are significantly enriched close to segmental duplications <abbrgrp><abbr bid="B8">8</abbr><abbr bid="B12">12</abbr></abbrgrp>. These regions were initially proposed, and subsequently shown to be, CNV hotspots predominantly through facilitating NAHR <abbrgrp><abbr bid="B7">7</abbr><abbr bid="B8">8</abbr><abbr bid="B12">12</abbr></abbrgrp>. However, not all human CNV hotspots are associated with segmental duplications; in fact, a sizeable fraction is not <abbrgrp><abbr bid="B7">7</abbr><abbr bid="B21">21</abbr></abbrgrp>. Here, we aim to further our understanding of the mutational mechanisms underlying the formation of CNVs by extending breakpoint analysis to the <it>D</it>. <it>melanogaster </it>genome. CNVs are as widespread in the fly as in mammalian genomes <abbrgrp><abbr bid="B5">5</abbr><abbr bid="B24">24</abbr><abbr bid="B25">25</abbr></abbrgrp>, and CNV hotspots have been identified in both <it>D. melanogaster </it><abbrgrp><abbr bid="B9">9</abbr></abbrgrp> and its sister species, <it>D. simulans </it><abbrgrp><abbr bid="B10">10</abbr></abbrgrp>. Although patterns of copy number variation share many similarities between humans and flies, the two genomes have very different genomic architectures. For example, while segmental duplications comprise approximately 5 % of the human and mouse genomes, they comprise only 1% of the fly genome <abbrgrp><abbr bid="B26">26</abbr></abbrgrp>. Similarly, while transposable elements comprise approximately 50% of the human genome, they only correspond to 20% of the fly genome, where they are mostly restricted to pericentromeric regions and the fourth chromosome <abbrgrp><abbr bid="B27">27</abbr></abbrgrp>. (The same holds true for segmental duplications <abbrgrp><abbr bid="B26">26</abbr></abbrgrp>.) Our goal was to take advantage of the differences in genome architecture between flies and humans in order to dissect the contribution of different genomic features to the formation of CNVs. We have done this by examining two distinct sets of CNVs: one generated using long Roche/454 sequencing reads <abbrgrp><abbr bid="B28">28</abbr></abbrgrp> and the other using high-resolution tiling microarrays <abbrgrp><abbr bid="B5">5</abbr></abbrgrp>. The use of these two dataset sets has enabled us to overcome many of the potential biases associated with each individual method if used alone. Our results indicate that fly CNVs share several of the striking characteristics observed for human CNVs: (1) a paucity of breakpoints associated with both microhomology and additional nucleotides inserted/deleted at the breakpoints; (2) an enrichment of non-B DNA sequences at the CNV breakpoints; and (3) a significant fraction of both recurrent and complex CNVs. Importantly, however, the different architectural organization of the fly genome does appear to shape patterns of copy number variation: homology-based pathways (notably NAHR) play a minor role in the formation of fly CNVs, including recurrent CNVs. Our data indicate that in flies non-homologous pathways underlie most CNV formation for both simple and complex events. One important consequence is that in flies most insertions do not correspond to duplications of previously existing sequence but are instead created <it>de novo </it>by the random insertion of nucleotides and/or small repeats from nearby sequences.</p>
</sec>
<sec><st><p>Results</p></st>
<sec><st><p>Precise breakpoint detection of CNVs from a 454 sequencing dataset</p></st>
<p>Sackton and colleagues sequenced at low coverage (approximately 0.2x) the genomes of nine <it>D. melanogaster </it>strains using Roche/454 technology <abbrgrp><abbr bid="B28">28</abbr></abbrgrp>. These genome sequences were used to evaluate the extent to which population genomic inferences could be made from low/sparse genomic coverage. Sackton and colleagues identified not only SNPs, but also transposable elements and CNVs. However, the latter were identified using a paired-end framework that did not provide the exact breakpoints of the CNVs. Here, we employ a different approach to detect CNVs based on split-read mapping that is capable of detecting CNVs with precise breakpoint resolution (that is, single nucleotide resolution). Defining what is the minimum size of a variant for it to be considered a CNV as opposed to an indel is largely arbitrary and often reflects the degree of resolution of the platform used to identify those variants. While initial CNV studies defined these variants as being at least 1 kb in length, more recent studies (for example, 1000 Genomes Project <abbrgrp><abbr bid="B21">21</abbr></abbrgrp>) use 50 bp as the lower limit for calling a variant a CNV. In agreement with the previous literature on <it>Drosophila </it>CNVs <abbrgrp><abbr bid="B5">5</abbr><abbr bid="B10">10</abbr></abbrgrp>, here we use 25 bp as the lower limit to classify insertions, deletions, and duplications as CNVs.</p>
<p>We downloaded the raw data for the nine genomes sequenced by Sackton and colleagues <abbrgrp><abbr bid="B28">28</abbr></abbrgrp> and aligned the reads against the <it>D. melanogaster </it>reference genome using the aligner Mosaik <abbrgrp><abbr bid="B29">29</abbr></abbrgrp>. We discarded all reads that mapped to the reference genome and focused only on the subset of the reads that failed to map. We re-aligned these reads to the reference genome using BLAT <abbrgrp><abbr bid="B30">30</abbr></abbrgrp> (see Methods). Because BLAT was designed to align mRNA onto genomic DNA, it does not penalize the existence of large gaps between the reads and the reference genome and provides the exact location of those gaps. By parsing the BLAT results we identified all reads that: (1) had a deletion larger than 25 bp in relation to the reference; (2) had an insertion larger than 25 bp in relation to the reference; and (3) mapped to two different locations with the 3' end of the read mapping 5' of the 5' end of the read (the pattern created by a tandem duplication). Because the nine genomes were sequenced at low coverage, our goal was not to identify all existing CNVs but instead to create a high-quality dataset of CNV breakpoints. To that effect, we applied a series of filters to minimize false-positive calls. Briefly, we required that each breakpoint was seen in at least two independent reads (from the same genome or from different genomes), that those two reads were not PCR duplicates, that the breakpoint was not located within the last 10 bp of the ends of the reads and that the breakpoint mapped to the euchromatic region of the genome. We also excluded from the dataset all deletions/insertions that corresponded to transposable element polymorphisms (that is, the deleted/inserted sequence mapped exclusively to annotated transposable elements). Finally, we identified the exact breakpoint configuration by re-aligning the reads supporting each of the breakpoints to the reference genome sequence using Clustal <abbrgrp><abbr bid="B31">31</abbr><abbr bid="B32">32</abbr></abbrgrp>.</p>
<p>Using this pipeline, we identified 447 deletions and 197 insertions larger than 25 bp segregating in the nine genomes. Because we required that at least two independent reads supported each breakpoint we biased our sample toward CNVs segregating in multiple genomes (as opposed to being private to one of the genomes). A total of 72% of CNV calls are supported by reads from at least two of the nine genomes, with only 28% of the CNVs supported by multiple reads from the same genome. This result is expected given the sparseness of the genomic data.</p>
<p>We evaluated the quality of our calls by confirming a subset of these variants by PCR and Sanger sequencing. Out of 32 CNVs tested, all were confirmed by PCR and sequencing. Sanger sequencing supported not only the existence of the CNVs but also the precise breakpoint configuration. We tested an additional set of eight CNVs that were filtered out from the final dataset because the reads supporting them were potential PCR duplicates. Again, all eight CNVs were confirmed, suggesting this was a fairly conservative filter. However, because our pipeline was able to identify a large number of CNV breakpoints (<it>n </it>= 644), and because our focus is on inference of mechanisms of CNV formation from sequence patterns using high-confidence CNV calls, we favored the more conservative dataset that minimized the number of false-positives.</p>
<p>To investigate the existence of potential differences between the mutational mechanisms underlying the formation of insertions and deletions, we used the <it>D. simulans </it>reference genome, and a parsimony approach, to polarize the calls (see Methods). Out of 447 deletions, 338 were confirmed to be deletions segregating in the sequenced <it>D. melanogaster </it>strains, 13 were re-classified as insertions in the reference genome, and 96 could not be polarized. Out of 197 insertions, 37 were confirmed to be insertions segregating in the sequenced <it>D. melanogaster </it>strains, with 123 being re-classified as deletions in the reference genome, and 37 could not be polarized.</p>
<p>Sizes of the identified insertions and deletions ranged from 25 bp to 7.5 kb, with a median size of 34 bp and a mean size of 76 bp. The split-read method imposes no limit to the size of the deletions detected, but insertions are only detected if they are completely encompassed within a read. For this reason, the largest insertion detected in comparison to the reference genome sequence (that is, before polarization) was only 64 bp. Figure <figr fid="F1">1</figr> shows the distribution of deletions, insertions, and unpolarized calls overlapping different functional contexts. Only nine of the 644 CNVs (1%) overlap coding exons: five are completely contained within the exon and four overlap both exonic and intronic sequence. All five CNVs located within coding exons have sizes that are multiples of three, suggesting they do not lead to frameshift mutations.</p>
<fig id="F1"><title><p>Figure 1</p></title><caption><p>Genomic context of the CNVs detected in this study</p></caption><text>
   <p><b>Genomic context of the CNVs detected in this study</b>.</p>
</text><graphic file="gb-2012-13-12-r119-1"/></fig>
</sec>
<sec><st><p>Most insertions are not tandem duplications and correspond to <it>de novo </it>DNA</p></st>
<p>After polarization, our dataset included 50 insertions: 13 present in the reference genome sequence and 37 segregating in the strains sequenced. Of the 50 insertions, only two (4%) are tandem duplications, whereby the inserted sequence is a copy of a stretch of DNA already present in the genome (at a nearby location). Of the remaining 48 insertions, seven (14%) correspond to simple expansions of dinucleotides or small repeats flanking the insertions, and 41 (82%) have no match to the reference genome sequence and were thus classified as 'filler DNA' <abbrgrp><abbr bid="B13">13</abbr></abbrgrp>. Filler DNA is a common outcome of the repair of DNA double-strand breaks by NHEJ in flies <abbrgrp><abbr bid="B33">33</abbr><abbr bid="B34">34</abbr></abbrgrp> and other organisms <abbrgrp><abbr bid="B35">35</abbr></abbrgrp>. Filler DNA has been observed in several studies of DNA repair that use artificial DNA constructs where DNA double-strand breaks are induced and the products of the DNA repair can be recovered and sequenced. In most cases, only a few nucleotides (or none) are added to the repaired junctions, but in some instances large insertions are created <abbrgrp><abbr bid="B13">13</abbr><abbr bid="B33">33</abbr><abbr bid="B35">35</abbr></abbrgrp>.</p>
<p>Filler DNA has been proposed to also include rearrangements of direct and inverted repeats located in nearby sequences <abbrgrp><abbr bid="B33">33</abbr></abbrgrp>. We therefore investigated how much of each insertion classified as filler DNA could be attributed to both direct and inverted repeats present in its neighboring sequences. We considered four different window sizes to define neighboring sequences: 30 bp, 60 bp, 90 bp, and 120 bp directly upstream and downstream from the insertion breakpoints. We then quantified the number of nucleotides in the insertions that matched neighboring sequences (see Methods). We also applied this procedure to a set of 41,000 simulated insertions that we created by shuffling the genomic coordinates of the actual insertions within each chromosome (retaining the insertions sizes). The goal was to determine how much overlap between a given stretch of DNA and its neighboring sequences is expected by chance. The boxplots in Figure <figr fid="F2">2A</figr> show the distribution of the proportion of nucleotides in insertions (and nucleotides in the simulated insertions) that match neighboring sequences. For the two smallest window sizes (30 bp and 60 bp upstream and downstream from the insertions), the proportion of nucleotides in insertions that could be attributed to the copying of small stretches of DNA from neighboring sequences was significantly higher than what is expected by chance (Wilcoxon rank sum test, <it>P </it>= 0.002 and <it>P </it>= 0.03, respectively). Accordingly, there is an excess of insertions with nucleotides matching neighboring repeats over the random expectation for the smallest window size (30 bp): 46% of insertions have nucleotides that match repeats in neighboring sequences <it>vs</it>. 27% of random sequences (Fisher's exact test, <it>P </it>= 0.008; Figure <figr fid="F2">2B</figr>). When larger window sizes are considered, a much larger fraction of insertions (and of nucleotides within those insertions) matches repeats in neighboring sequences. However, this is not different from what is observed for the set of simulated insertions (Figures <figr fid="F2">2A</figr> and <figr fid="F2">2B</figr>). Importantly, the matching repeats are typically small stretches of DNA (approximately 7 to 13 bp) and so even when present, they represent only a small fraction of the total number of inserted bases.</p>
<fig id="F2"><title><p>Figure 2</p></title><caption><p>The contribution of nearby sequences to the formation of <it>de novo </it>insertions</p></caption><text>
   <p><b>The contribution of nearby sequences to the formation of <it>de novo </it>insertions</b>. (<b>A</b>) Proportion of nucleotides in insertions and matching controls that match small stretches of DNA present in nearby sequences for different window sizes (30 bp, 60 bp, 90 bp, and 120 bp windows). <it>P </it>values refer to Wilcoxon rank sum tests. (<b>B</b>) Percentage of insertions and matching controls that have at least one small stretch of DNA sequence also found in flanking regions for different window sizes (30 bp, 60 bp, 90 bp, and 120 bp windows). <it>P </it>values refer to Fisher's exact tests.</p>
</text><graphic file="gb-2012-13-12-r119-2"/></fig>
<p>These data suggest that most insertions in <it>D. melanogaster </it>do not correspond to tandem duplications or to expansions of di- or tri-nucleotides or repeats, but instead that they are the product of the random incorporation of nucleotides and of the copying of small stretches of DNA from nearby sequences as part of the process of DNA repair. Although anecdotal, the fact that the two tandem duplications identified are also two of the largest insertions might be interpreted as suggesting that larger insertions may indeed correspond mostly to tandem duplications while smaller insertions (that is, smaller than 60 bp) will mostly correspond to novel stretches of DNA sequence. The observation that most insertions in <it>Drosophila </it>correspond to novel DNA sequence contrasts with a previous observation made for the human genome, where most recent insertions (1 to 100 bp; appeared after the human-chimpanzee split) were determined to correspond to tandem duplications <abbrgrp><abbr bid="B36">36</abbr></abbrgrp>.</p>
</sec>
<sec><st><p>Distinct classes of CNV breakpoints</p></st>
<p>The CNVs in our dataset fall into four breakpoint classes: (1) 41% have simple ends associated with small stretches of microhomology (minimum of 2 bp); (2) 35% have blunt ends; (3) 22% have complex ends with additional nucleotides added or deleted to the breakpoint; and (4) 2% have complex ends (nucleotides added or deleted) and are also associated with stretches of microhomology (Figure <figr fid="F3">3</figr>). Microhomology is almost exclusively associated with simple ends, with only 5% of the breakpoints with microhomology also having additional inserted/deleted nucleotides at the breakpoints. This result mirrors the observations made for human CNVs where only a minority of breakpoints with microhomology also had inserted/deleted nucleotides at the breakpoint (Table <tblr tid="T1">1</tblr> <abbrgrp><abbr bid="B19">19</abbr><abbr bid="B20">20</abbr></abbrgrp>). There are no differences between insertions and deletions in the relative proportions of the four different types of breakpoints (Chi-square test, <it>P </it>= 0.54).</p>
<fig id="F3"><title><p>Figure 3</p></title><caption><p>Distribution of CNVs among the different classes of breakpoints</p></caption><text>
   <p><b>Distribution of CNVs among the different classes of breakpoints</b>.</p>
</text><graphic file="gb-2012-13-12-r119-3"/></fig>
<tbl id="T1"><title><p>Table 1</p></title><caption><p>Comparison of the types of CNV breakpoints identified in <it>Drosophila </it>and humans.</p></caption><tblbdy cols="8">
      <r>
         <c ca="left">
            <p>
               <b>Type of breakpoint</b>
            </p>
         </c>
         <c cspan="2" ca="left">
            <p>
               <b>
                  <it>Drosophila</it>
               </b>
            </p>
         </c>
         <c cspan="2" ca="left">
            <p>
               <b>Human (Conrad <it>et al</it>.)</b>
               <sup>a</sup>
            </p>
         </c>
         <c cspan="2" ca="left">
            <p>
               <b>Human (Kidd <it>et al</it>.)</b>
               <sup>b</sup>
            </p>
         </c>
         <c ca="left">
            <p>
               <b>Molecular mechanism(s)</b>
            </p>
         </c>
      </r>
      <r>
         <c>
            <p/>
         </c>
         <c cspan="6">
            <hr/>
         </c>
         <c>
            <p/>
         </c>
      </r>
      <r>
         <c>
            <p/>
         </c>
         <c ca="left">
            <p>
               <b>
                  <it>n</it>
               </b>
            </p>
         </c>
         <c ca="left">
            <p>
               <b>%</b>
            </p>
         </c>
         <c ca="left">
            <p>
               <b>
                  <it>n</it>
               </b>
            </p>
         </c>
         <c>
            <p>%</p>
         </c>
         <c ca="left">
            <p>
               <b>
                  <it>n</it>
               </b>
            </p>
         </c>
         <c ca="left">
            <p>
               <b>%</b>
            </p>
         </c>
         <c>
            <p/>
         </c>
      </r>
      <r>
         <c cspan="8">
            <hr/>
         </c>
      </r>
      <r>
         <c ca="left">
            <p>Blunt</p>
         </c>
         <c ca="left">
            <p>223</p>
         </c>
         <c ca="left">
            <p>35</p>
         </c>
         <c ca="left">
            <p>58</p>
         </c>
         <c ca="left">
            <p>19</p>
         </c>
         <c ca="left">
            <p>82</p>
         </c>
         <c ca="left">
            <p>11</p>
         </c>
         <c ca="left">
            <p>NHEJ</p>
         </c>
      </r>
      <r>
         <c ca="left">
            <p>Microhomology</p>
         </c>
         <c ca="left">
            <p>262</p>
         </c>
         <c ca="left">
            <p>41</p>
         </c>
         <c ca="left">
            <p>151</p>
         </c>
         <c ca="left">
            <p>50</p>
         </c>
         <c ca="left">
            <p>289</p>
         </c>
         <c ca="left">
            <p>39</p>
         </c>
         <c ca="left">
            <p>MMEJ, replication-associated repair</p>
         </c>
      </r>
      <r>
         <c ca="left">
            <p>Blunt and large stretches of sequence identity (&#179;20 bp)</p>
         </c>
         <c ca="left">
            <p>2</p>
         </c>
         <c ca="left">
            <p>0.3</p>
         </c>
         <c ca="left">
            <p>3</p>
         </c>
         <c ca="left">
            <p>1</p>
         </c>
         <c ca="left">
            <p>219</p>
         </c>
         <c ca="left">
            <p>29</p>
         </c>
         <c ca="left">
            <p>SSA, NAHR, replication-associated repair</p>
         </c>
      </r>
      <r>
         <c ca="left">
            <p>Inserted/deleted bases</p>
         </c>
         <c ca="left">
            <p>143</p>
         </c>
         <c ca="left">
            <p>22</p>
         </c>
         <c ca="left">
            <p>81</p>
         </c>
         <c ca="left">
            <p>27</p>
         </c>
         <c ca="left">
            <p>153</p>
         </c>
         <c ca="left">
            <p>21</p>
         </c>
         <c ca="left">
            <p>NHEJ, replication-associated repair</p>
         </c>
      </r>
      <r>
         <c ca="left">
            <p>Inserted/deleted bases and microhomology</p>
         </c>
         <c ca="left">
            <p>14</p>
         </c>
         <c ca="left">
            <p>2</p>
         </c>
         <c ca="left">
            <p>9</p>
         </c>
         <c ca="left">
            <p>3</p>
         </c>
         <c ca="left">
            <p>3<sup>c</sup></p>
         </c>
         <c ca="left">
            <p>0.4</p>
         </c>
         <c ca="left">
            <p>MMEJ, replication-associated repair</p>
         </c>
      </r>
      <r>
         <c ca="left">
            <p>Total</p>
         </c>
         <c ca="left">
            <p>644</p>
         </c>
         <c>
            <p/>
         </c>
         <c ca="left">
            <p>302</p>
         </c>
         <c>
            <p/>
         </c>
         <c ca="left">
            <p>743</p>
         </c>
         <c>
            <p/>
         </c>
         <c>
            <p/>
         </c>
      </r>
   </tblbdy><tblfn>
      <p>The dataset of Conrad <it>et al</it>. refers only to deletions while the dataset of Kidd <it>et al</it>. includes both deletions and insertions. Excluded from both datasets of human CNVs were those variants classified as VNTRs (variable number of tandem repeats) and as transposable elements insertions. Further excluded from the dataset of Conrad <it>et al</it>. were 13 deletions that were also associated with inversions. Because Conrad <it>et al</it>. required only 1 bp of identical sequence to call a breakpoint as being associated with microhomology, we re-classified the entire dataset so that only deletions associated with at least 2 bp of identical sequence at the breakpoint were classified as being associated with microhomology.</p>
   </tblfn></tbl>
<p>The definition of what constitutes a breakpoint associated with microhomology differs across studies with some authors requiring only 1 bp of identical sequence at the breakpoint (for example, <abbrgrp><abbr bid="B19">19</abbr></abbrgrp>), while others require a minimum of 2 bp or more <it>(</it>for example, <abbrgrp><abbr bid="B20">20</abbr></abbrgrp>). In order to determine the minimum number of identical nucleotides present at a breakpoint that are functionally relevant for CNV formation, we determined the number of nucleotides associated with three distinct types of microhomology for each of the 644 breakpoints in our dataset (Additional file <supplr sid="S1">1</supplr>, Figure S1). Microhomology of type I is the mechanistically-relevant form of microhomology associated with CNV formation: the deletion occurs between two sequences with microhomology such that one of the sequences becomes part of the deletion (the converse occurs in the case of an insertion). Microhomologies of types II and III (Additional file <supplr sid="S1">1</supplr>, Figure S1A) are not mechanistically associated with the formation of CNVs but can be used to determine the empirical expectation of finding a similar sequence of <it>n </it>nucleotides close to the breakpoints by chance. As shown in Additional file <supplr sid="S1">1</supplr>, Figure S1B, only for 2 bp or more do we find a significant excess of microhomology of type I <it>versus </it>the other two types (proportions test, <it>P </it>= 2.2 &#215; 10<sup>-12</sup>). As a result, in this study we required a minimum of 2 bp of identical sequence to classify a breakpoint as showing evidence for microhomology. Of the 248 deletions associated with microhomology, only two have a stretch of microhomology &gt;20 bp. Thus, at most only two of the deletions in our dataset could have been created by SSA. This is likely to be an over-estimate because previous work in <it>Drosophila </it>has suggested that larger stretches of sequence identity are required to mediate SSA <abbrgrp><abbr bid="B37">37</abbr></abbrgrp>. All other CNVs associated with microhomology consequently are either the product of NHEJ, MMEJ, or of replication-associated repair.</p>
<suppl id="S1">
<title><p>Supplementary Figure 1</p></title>
<text><p><b>Evaluation of the minimum number of identical nucleotides present at the breakpoint that is required for microhomology-mediated CNV formation</b>. (<b>A</b>) Schematic representation of the different classes of microhomology (type I refers to the mechanistically relevant form of microhomology associated with CNV formation). (<b>B</b>) Number of breakpoints showing <it>n </it>identical nucleotides for the three classes of microhomology.</p></text>
<file name="gb-2012-13-12-r119-S1.PPT">
   <p>Click here for file</p>
</file>
</suppl>
<p>CNV breakpoints harboring complex ends (that is, additional bases present at the breakpoint) are significantly larger than CNVs associated with blunt ends, irrespective of the presence/absence of microhomology (median size of 43 bp <it>vs</it>. 32 bp, Wilcoxon rank sum test <it>P </it>= 1 &#215; 10<sup>-13</sup>). For 10 of 157 breakpoints with complex ends, the stretches of additional nucleotides inserted are large enough (<it>&gt;</it>20 bp) that they could potentially be mapped to the genome. If replication-based repair mechanisms are involved, the sequences of inserted bases are expected to map to the genome, often close to the deleted sequences. If NHEJ (or a form of alternative end-joining) is involved, the inserted bases should correspond to randomly inserted nucleotides and/or to rearrangements of repeats from nearby sequences (as seen for most insertions). There is no good genomic sequence match for any of the stretches of inserted bases. Furthermore, for seven of the 10 breakpoints, there are small stretches of identity between the inserted bases and nearby sequences that resemble the type of alignments seen between <it>de novo </it>DNA insertions and nearby sequences. These data favor the hypothesis that these CNVs are a consequence of NHEJ or alternative end-joining repair.</p>
<p>Table <tblr tid="T1">1</tblr> compares the types of breakpoints identified in this study with those of two previous surveys of human CNV breakpoints <abbrgrp><abbr bid="B19">19</abbr><abbr bid="B20">20</abbr></abbrgrp>. There are two main differences between the types of breakpoints observed in <it>Drosophila </it>and in humans. The first is that in <it>Drosophila </it>there is a higher proportion of blunt breakpoints, a common outcome of NHEJ (35% in <it>Drosophila </it><it>vs</it>. 11% to 19% in humans). The second is that in <it>Drosophila </it>breakpoints are rarely associated with large stretches of high sequence identity, the hallmark of SSA and NAHR, while in humans Kidd and colleagues found that almost one-third of all breakpoints bore the hallmarks of these pathways <abbrgrp><abbr bid="B20">20</abbr></abbrgrp>. As is clear from Table <tblr tid="T1">1</tblr>, the two surveys of human CNVs found a very different proportion of breakpoints potentially associated with NAHR (1% in Conrad <it>et al</it>. <it>vs</it>. 29% in Kidd <it>et al</it>.). This difference is likely a consequence of the different experimental approaches used between the studies. Conrad and colleagues used a microarray capture strategy to identify the breakpoints of a subset of CNVs identified in a previous study <abbrgrp><abbr bid="B6">6</abbr></abbrgrp>, which may have biased their sample against CNVs associated with NAHR. Kidd and colleagues, on the other hand, identified CNV breakpoints using capillary sequencing of fosmid clone inserts, a powerful approach to sample the full spectrum of CNVs. Further support for a sizeable portion of human CNVs being associated with NAHR, comes from two other studies: one estimated that approximately 28% of breakpoints are associated with NAHR <abbrgrp><abbr bid="B22">22</abbr></abbrgrp> and the other put it closer to 10% to 15% <abbrgrp><abbr bid="B6">6</abbr></abbrgrp>. Motivated by the observation that different technological approaches can produce different results regarding the role played by NAHR in the formation of CNVs, we decided to re-analyze a dataset of 3,639 <it>Drosophila </it>CNVs identified using high-resolution tiling arrays <abbrgrp><abbr bid="B5">5</abbr></abbrgrp> and determine if the observation that there is a paucity of <it>Drosophila </it>CNVs associated with NAHR is robust to the CNV detection platform used.</p>
</sec>
<sec><st><p>NAHR plays a minor role in the formation of CNVs in <it>Drosophila</it></p></st>
<p>Emerson and colleagues <abbrgrp><abbr bid="B5">5</abbr></abbrgrp> used tiling arrays covering the <it>Drosophila </it>genome at a resolution of 36 bp to identify 3,639 CNVs (2,211 duplications and 1,428 deletions) segregating in the genomes of 15 worldwide strains of <it>D. melanogaster</it>. Microarrays can only probe unique regions of the genome (that is, the probes in the microarray have to map to a unique genomic location), which means that they are biased against detecting additional duplications of regions of the genome that have already been recently duplicated. However, they are unbiased at detecting duplications of unique sequence where copy number changes from one copy (two copies in a diploid genome) to two copies (three or four copies in a diploid genome depending on the duplication being homozygous or heterozygous), irrespective of the presence/absence of flanking duplications. Therefore, we examined the breakpoints of these 3,639 CNVs for the presence of stretches of high-sequence identity in order to determine the contribution of homology-based mechanisms (such as SSA and NAHR) to the formation of CNVs in <it>Drosophila</it>.</p>
<p>Unlike the 454 data, microarray data do not provide the exact breakpoint location. As a consequence, to look for the presence of stretches of high-sequence identity we considered the sequences 500 bp upstream and downstream the predicted CNV breakpoint and the CNV sequence itself. We looked for two types of sequence homology: (1) stretches at least 30 bp in size with a sequence identity of at least 98% (type I; hallmark of SSA); and (2) stretches at least 200 bp in size with a sequence identity of at least 95% (type II; hallmark of both NAHR and SSA) (see Methods).</p>
<p>We found that only 2% (74/3639) of all CNVs were associated with sequence homology of type I (capable of mediating SSA), and 2.6% (95/3639) with sequence homology of type II (capable of mediating SSA or NAHR). Because deletions in this dataset were associated with a high false-positive rate (47%), we also restricted these analyses only to duplications (false-positive rate of 14%). Among the set of duplications, only 2.1% (46/2211) are associated with sequence homology type I, and 2.3% (51/2211) with sequence homology of type II. Therefore, these results support the observation made using the 454 reads that homology-based mechanisms (SSA and NAHR) play a very limited role in the formation of CNVs in <it>Drosophila</it>.</p>
<p>Because both next generation sequencing and microarray technologies are biased against the detection of CNVs in non-unique regions of the genome (that is, segmental duplications and transposable elements) inferences about the importance of homology-based mechanisms are necessarily restricted to unique regions of the genome. However, unlike the human genome where segmental duplications and transposable elements can be found throughout the euchromatin, in <it>Drosophila </it>most repetitive elements are confined to the regions surrounding the centromeres (which have very low rates of recombination) with only a minority of these elements present in regions of the euchromatin with normal recombination rates <abbrgrp><abbr bid="B26">26</abbr><abbr bid="B27">27</abbr></abbrgrp>. Hence, our work suggests that outside of pericentromeric and telomeric regions, homology-based mechanisms play a minor role in CNV formation in <it>Drosophila</it>.</p>
</sec>
<sec><st><p>Very high rate of CNV recurrence in <it>Drosophila</it></p></st>
<p>CNVs are classified as recurrent when different individuals carry independent but overlapping CNVs. The proportion of recurrent CNVs in the human genome has been estimated to be between 6% and 29% <abbrgrp><abbr bid="B6">6</abbr></abbrgrp>. The sparseness of the 454 dataset prevents us from estimating from these data the proportion of recurrent CNVs in <it>Drosophila</it>. Therefore, in order to evaluate whether CNV recurrence is a common phenomenon in this taxon, we selected 26 genomic regions known to harbor at least one deletion in at least two strains based on the high-resolution tiling array dataset, and screened them in 15 worldwide strains by PCR and Sanger sequencing. These deletions are all located in the euchromatin, their mean size is similar to the mean size of the whole set of deletions and were predicted to range in frequency from 2 to 11 (median 2). Among the 26 regions, 12 (46%) harbored more than one overlapping CNV, suggesting a high rate of CNV recurrence in <it>Drosophila</it>.</p>
<p>Sanger sequencing of these 26 regions showed that the CNVs identified with the tiling arrays have identical characteristics to those identified with the 454 reads. There is no difference in the distribution of breakpoints types present in the 454 dataset and in the set of 36 CNVs (33 deletions and three insertions) segregating in the 26 regions described above (a total of 42 CNVs were detected but for six (mostly tandem duplications) the breakpoints were not fully sequenced). In addition, similar to what was observed in the 454 dataset, CNVs with breakpoints harboring additional bases were, on average, larger than CNVs with simple breakpoints (that is, blunt ends with or without microhomology) (median 432 bp <it>vs</it>. 211 bp, respectively; Wilcoxon rank sum test, <it>P </it>= 0.005).</p>
<p>There was no difference in the distribution of breakpoint types between recurrent and non-recurrent CNVs. Furthermore, just as seen for the non-recurrent set, the recurrent CNVs were not associated with large stretches of sequence identity that might suggest their generation through NAHR. Instead, these data suggest that recurrent CNVs are mediated by non-homologous repair mechanisms. Among the 12 regions showing recurrent CNVs, three also show evidence for the presence of complex CNVs. These occur when a single mutational event generates several breakpoints, that is multiple closely located CNVs segregating within the same individual. In these three regions the distance between distinct breakpoints ranged from 82 bp to 325 bp. This association between complex CNVs (multiple CNVs within the same individual) and recurrent CNVs (multiple CNVs segregating in different individuals) suggests that some regions of the <it>Drosophila </it>genome are particularly unstable, and generate both complex events within individuals as well as independent but overlapping mutations between individuals. Though the sample size is small, these data suggest that complex CNVs may correspond to as much as 12% (3/26) of all <it>Drosophila </it>CNVs, a higher proportion than the 5% estimated for the human genome <abbrgrp><abbr bid="B6">6</abbr></abbrgrp>.</p>
</sec>
<sec><st><p>Non-B DNA structures are enriched in CNV breakpoints</p></st>
<p>DNA conformations that do not correspond to the right-handed Watson-Crick double-helix are collectively termed non-B DNA <abbrgrp><abbr bid="B38">38</abbr><abbr bid="B39">39</abbr></abbrgrp>. These include sequences with Z-DNA motifs, quadruplex-forming motifs, inverted repeats, mirror repeats, and direct repeats <abbrgrp><abbr bid="B40">40</abbr></abbrgrp>. Non-B DNA sequences have been found associated with the causal variants of several human diseases and have been proposed to cause genetic instability by impairing DNA repair and DNA replication <abbrgrp><abbr bid="B38">38</abbr><abbr bid="B39">39</abbr></abbrgrp>. Because errors during DNA repair and DNA replication are the ultimate causes of CNVs, we tested for the presence of non-B DNA sequence at the CNV breakpoints identified using the 454 data.</p>
<p>We focused on those variants that were detected by the presence of gaps in the reads of the sequenced genomes in comparison to the reference genome (<it>n </it>= 447) so that we could extract the CNV region and the flanking regions directly from the reference genome. For a control dataset, we shuffled the coordinates of the CNV breakpoints (25 bp within the CNV and the 200 bp immediately flanking it 5' and 3') randomly within chromosomes, so that there were 10 times more control sequences than CNV breakpoints. For both CNV breakpoints and control sequences, we identified non-B DNA sequences using the non-B DNA Motif Search Tool <abbrgrp><abbr bid="B40">40</abbr></abbrgrp>. Figure <figr fid="F4">4</figr> shows the distribution of non-B DNA sequences across the 200 bp flanking the CNVs. In strong contrast to control sequences (in grey), which show a uniform distribution of non-B DNA sequences across their length, CNVs (in red) are enriched with non-B DNA sequences precisely at the breakpoints. Furthermore, there is a significantly higher number of CNV breakpoints (defined as the region spanning 25 bp within the CNV and the 25 bp immediately flanking it) associated with non-B DNA structures when compared to the control sequences: 11% <it>vs</it>. 5% (Fisher's exact test, <it>P </it>= 1.3 &#215; 10<sup>-5</sup>).</p>
<fig id="F4"><title><p>Figure 4</p></title><caption><p>Distribution of non-B DNA sequences in the regions surrounding CNVs</p></caption><text>
   <p><b>Distribution of non-B DNA sequences in the regions surrounding CNVs</b>. In gray is the background expectation (determined from 4,470 control sequences) for the presence of non-B DNA sequences in a given stretch of DNA and in red the actual distribution of non-B DNA sequences surrounding the set of CNVs identified in the 454 dataset.</p>
</text><graphic file="gb-2012-13-12-r119-4"/></fig>
<p>Some classes of non-B DNA sequences are more common than others (in both CNVs and control sequences), but for most we found a shift in the location of these repeats/motifs towards the CNV breakpoint when compared to the control sequences (Additional file <supplr sid="S2">2</supplr>, Figure S2), suggesting that most classes of non-B DNA sequences are associated with CNV formation. We found the non-B DNA sequences equally associated with the three classes of breakpoints (that is blunt breakpoints, breakpoints associated with microhomology, and breakpoints containing additional nucleotides inserted or deleted; Fisher's exact test, <it>P </it>= 0.98). However, we found a significantly higher proportion of insertions associated with non-B DNA sequences than deletions (Fisher's exact test, <it>P </it>= 0.002). The presence of non-B DNA sequences at a significant fraction of CNV breakpoints suggests a potential causal role for these sequences in CNV formation in flies.</p>
<suppl id="S2">
<title><p>Supplementary Figure 2</p></title>
<text><p><b>Distribution of non-B DNA motifs in relation to CNV breakpoints</b>. The beanplots in orange refer to distribution of non-B DNA repeats in the sequences flanking the CNV breakpoints (combines upstream and downstream sequences) while the beanplots in grey refer to control sequences. The red line marks the location of the CNV breakpoint (at position 25 bp of 225 bp of total sequence). Small lines refer to individual observations (control sequences have 10x more data) while the longer black line refers to the average of the distribution. Each beanplot refers to a specific type of non-B DNA motif.</p></text>
<file name="gb-2012-13-12-r119-S2.PPT">
   <p>Click here for file</p>
</file>
</suppl>
</sec>
</sec>
<sec><st><p>Discussion</p></st>
<p>The detailed analysis of <it>Drosophila </it>CNV breakpoints suggests that non-homologous repair mechanisms are responsible for the formation of the majority of the variants. This is true not only for simple CNVs, but also for those that are recurrent and complex. We excluded a significant role for homology-based pathways (that is, NAHR and SSA) in the formation of CNVs because only a minority of these variants (approximately 3%) are flanked by stretches of high sequence identity. We also found little support for replication-associated mechanisms; the large stretches of additional nucleotides present at 10 breakpoints consisted of filler DNA, a result more consistent with NHEJ than with replication-associated repair. The presence of microhomology at CNV breakpoints is, however, consistent with NHEJ, MMEJ, and replication-associated repair (for example, <abbrgrp><abbr bid="B41">41</abbr></abbrgrp>). Determining exactly which pathway(s) are responsible for the different types of CNV breakpoints identified in our study will require the analysis of CNV breakpoints from fly mutants lacking the specific genetic requirements for each pathway (for example, <abbrgrp><abbr bid="B42">42</abbr><abbr bid="B43">43</abbr></abbrgrp>).</p>
<p>In the human and mouse genomes, NHEJ/MMEJ also underlie a large fraction of CNVs, though a sizeable fraction of CNVs are also mediated by NAHR (approximately 18% to 35%) <abbrgrp><abbr bid="B19">19</abbr><abbr bid="B20">20</abbr><abbr bid="B21">21</abbr><abbr bid="B22">22</abbr><abbr bid="B44">44</abbr></abbrgrp>. The difference in the preponderance of NHEJ/MMEJ in flies and mammalian genomes does not have to reflect intrinsic differences between these taxa in the relative usage of the different repair pathways (HR <it>vs</it>. NHEJ). In fact, NAHR and SSA are highly efficient in repairing DNA double-strand breaks in flies when these occur in artificial constructs flanked by repeats that can mediate these pathways <abbrgrp><abbr bid="B37">37</abbr></abbrgrp>. Instead, the difference we observe between the taxa in the preponderance of homology-based mechanisms to the formation of CNVs likely reflects the different genomic architectures of the genomes: abundant and widespread presence of segmental duplications and transposable elements throughout mammalian genomes and less abundant and more restricted location (to pericentromeric and telomeric regions) of these elements in the <it>Drosophila </it>genome.</p>
<p>The <it>Drosophila </it>CNVs used in this work are significantly smaller than the published human CNVs. As a consequence, there is the possibility that some of the differences found between the two taxa may reflect different mutational mechanisms operating on CNVs of different size. We note, however, that in flies we found no differences between the breakpoints identified using the 454 data and those found using the high-resolution tiling array data despite the fact that the latter are significantly larger. Although the CNVs identified with the high-resolution tiling arrays are still smaller than those identified for the human genome, their size range already shows a significant overlap with that of the human CNVs used in this study. The absence of large CNVs in the fly genome likely reflects the much higher compactness of this genome (a much higher gene density means that large CNVs would overlap multiple genes) and the greater strength of purifying selection.</p>
<p>We have attempted to circumvent technical biases in CNV detection by making sure that our observations were robust to the platforms used to identify CNVs (that is, both next generation sequencing and hybridization based platforms). Still, our conclusions have to be necessarily restricted to the euchromatic sequence located outside of both pericentromeric and telomeric regions. The latter are highly enriched with transposable elements and segmental duplications making CNV detection extremely challenging irrespective of the platform used. The analysis of the CNV data generated by tiling arrays suggests both a high level of CNV recurrence and complexity (and coupling between the two) that can only be fully explored with technologies that are not biased against the detection of these classes of variants. The development of strobe sequencing technology and of ever larger reads <abbrgrp><abbr bid="B44">44</abbr></abbrgrp> will greatly enhance this effort. By collecting a large number of breakpoints from a sparse low-coverage genomic dataset, we have demonstrated that the analysis of CNV breakpoints does not depend on high coverage datasets, instead read size is likely to matter the most.</p>
<p>Despite a minor role for NAHR, the fly genome is still punctuated by CNV hotspots <abbrgrp><abbr bid="B9">9</abbr><abbr bid="B10">10</abbr></abbrgrp>, that is, regions of the genome experiencing higher CNV mutation rates. These hotspots may share the properties of the mammalian CNV hotspots that are not associated with NAHR <abbrgrp><abbr bid="B7">7</abbr><abbr bid="B45">45</abbr></abbrgrp>. Unlike NAHR-mediated hotspots, where the reason for genomic instability is relatively well understood (that is, the presence of repeats leads to misalignments between homologous regions during DNA repair leading to further duplications/deletions), it is not known what causes the instability associated with the remaining hotspots. One possibility is that these correspond to regions more prone to DNA breaks and/or that are harder to repair. In support of this hypothesis we do find that non-B DNA sequences, which are capable of impairing both DNA replication and DNA repair, are significantly enriched at CNV breakpoints.</p>
<p>One of the most surprising results stemming from this work is that in <it>Drosophila </it>most insertions correspond to <it>de novo </it>sequences. These novel stretches of DNA sequence are as large as 60 bp in our dataset, potentially translating into the addition of 20 novel amino acids to a protein sequence. While it is likely that most of these insertions are deleterious, the occasional beneficial mutation could dramatically change the protein sequence in one single mutational event. This would allow very fast protein sequence evolution between closely related species. The frequent creation of novel stretches of DNA sequence as a consequence of DNA repair could have implications for the generation of genetic novelties and genome evolution in general.</p>
</sec>
<sec><st><p>Conclusions</p></st>
<p>Our results suggest that the different architectural features of the <it>Drosophila </it>and human genomes shape the mutation processes responsible for generating duplications, deletions, and insertions. Homology-based pathways contribute significantly more to the formation of CNVs in humans than <it>Drosophila </it>because of the abundance and widespread presence of segmental duplications and transposable elements in humans that can mediate HR. Instead, non-homologous repair is responsible for most CNVs in flies, including complex and recurrent CNVs. Non-homologous repair is also responsible for the creation of insertions made of <it>de novo </it>sequence, which have the potential to mediate rapid protein evolution. In addition, we show that non-B DNA sequences are enriched at CNV breakpoints, which makes these sequences good candidates for being associated with regions of higher CNV instability.</p>
</sec>
<sec><st><p>Methods</p></st>
<sec><st><p>Detection of CNV breakpoints in the 454 data</p></st>
<p>Split-read methods were first applied to long Sanger sequencing reads <abbrgrp><abbr bid="B45">45</abbr></abbrgrp> and CNVs were detected by identifying those reads which, when mapped to the reference genome exhibit a 'split' signature, either a gap in the reference genome (which suggests an insertion in the read, Additional file <supplr sid="S3">3</supplr>, Figure S3), a gap in the read (which suggests a deletion in the read, Additional file <supplr sid="S3">3</supplr>, Figure S3), or two sections of the read mapping to the genome with their positions flipped (which suggests a tandem duplication). Until recently, split-read methods were not widely used because of the small size of the reads produced by next generation sequencing platforms. Roche/454 technology is capable of generating reads &gt;100 bp in size, however eukaryotic genome sequencing projects have predominantly relied on Illumina reads, which only recently achieved the 100 bp mark <abbrgrp><abbr bid="B46">46</abbr></abbrgrp>. With these longer reads, split-read methods can readily identify with precise resolution the breakpoints of duplications, deletions, insertions, inversions, and translocations, as recently shown by the 1000 genomes project <abbrgrp><abbr bid="B21">21</abbr></abbrgrp>.</p>
<suppl id="S3">
<title><p>Supplementary Figure 3</p></title>
<text><p><b>Description of the split-read approach used to detect deletions and insertions and the rational for polarizing the CNV calls</b>.</p></text>
<file name="gb-2012-13-12-r119-S3.PPT">
   <p>Click here for file</p>
</file>
</suppl>
<p>Sackton and colleagues used 454 technology to sequence at low coverage (approximately 0.2x) the genomes of nine <it>D. melanogaster </it>strains (three from an African population and six from a North Carolina population) <abbrgrp><abbr bid="B28">28</abbr></abbrgrp>. We downloaded the original 454 reads (mean read size of 105 bp) from the Short Read Archive (SRP001156) and aligned them to the release 5 of the <it>D. melanogaster </it>genome using Mosaik (version 1.1.0021) <abbrgrp><abbr bid="B29">29</abbr></abbrgrp>. We used the following Mosaik parameters to conduct the alignments: -hs 15 -mmp 0.05 -mhp 100 -act 26 -p 8 -bw 51. We discarded all reads that Mosaik mapped to the genome and kept only those that could not be mapped. We then used BLAT (version 3.4) <abbrgrp><abbr bid="B30">30</abbr></abbrgrp> to map the latter reads. We ran BLAT using two sets of parameters: -fastMap and -oneOff==1. Finally, we detected CNV breakpoints with custom Perl scripts that parse the BLAT output and identify the split-read signature detailed in the Results section.</p>
<p>As discussed in the Results section, we applied a series of filters in an attempt to minimize the number of false-positive calls. One of the filters applied was the removal of all CNVs supported exclusively by reads that could be PCR duplicates; these were defined as reads with the same exact start position but that could vary in their end position. Because of this filter all seven putative tandem duplications identified by our pipeline were excluded from the final CNV dataset. Another filter was the exclusion of all CNVs where at least 80% of the mutated sequence mapped to a known TE (TE annotation downloaded from FlyBase <abbrgrp><abbr bid="B47">47</abbr></abbrgrp> release 5.29). After CNVs were identified, we re-aligned all supporting reads once again to the reference <it>D. melanogaster </it>genome using Clustal <abbrgrp><abbr bid="B31">31</abbr><abbr bid="B32">32</abbr></abbrgrp> and those were the alignments used to classify the CNVs into the four classes of breakpoints. The CNV calls were polarized using the syntenic alignments between <it>D. melanogaster </it>and <it>D. simulans </it><abbrgrp><abbr bid="B47">47</abbr></abbrgrp>. If a deletion is called in one of the nine sequenced genomes but is also present (with similar breakpoints) in the <it>D. simulans </it>genome, then the most parsimonious explanation is that the variant is actually an insertion in the reference <it>D. melanogaster </it>genome (Additional file <supplr sid="S3">3</supplr>, Figure S3). Similarly, if an insertion is called in one of the nine genomes but a similar insertion is found in <it>D. simulans</it>, then the most parsimonious explanation is that we are detecting a deletion in the reference <it>D. melanogaster </it>genome (Additional file <supplr sid="S3">3</supplr>, Figure S3). CNVs were annotated (as exonic, intronic, and intergenic) using release 5.33 (retrieved from FlyBase <abbrgrp><abbr bid="B47">47</abbr></abbrgrp>).</p>
</sec>
<sec><st><p>Evaluating the contribution of nearby sequences to the formation of <it>de novo </it>insertions</p></st>
<p>We used standalone Blast <abbrgrp><abbr bid="B48">48</abbr></abbrgrp> (ncbi-blast-2.2.25) to identify stretches of high sequence identity between <it>de-novo </it>insertions and its neighboring sequences (for the window sizes defined in Results). We generated the random control sequences using BEDTools (shuffleBed; version 2.13.3) <abbrgrp><abbr bid="B49">49</abbr></abbrgrp>.</p>
</sec>
<sec><st><p>Comparison of CNV breakpoints identified in <it>Drosophila </it>and human genomes</p></st>
<p>Table <tblr tid="T1">1</tblr> compares the types of CNV breakpoints identified in <it>Drosophila </it>and in two independent human datasets: one generated by Conrad and colleagues <abbrgrp><abbr bid="B19">19</abbr></abbrgrp> and the other by Kidd and colleagues <abbrgrp><abbr bid="B20">20</abbr></abbrgrp>. Both surveys of human CNVs included the identification of small tandem repeats (variable number tandem repeat (VNTR)) and of variants associated with the movement of transposable elements. Because studying these variants was not an aim of this work, Table <tblr tid="T1">1</tblr> only refers to breakpoints of deletions and insertions. We also excluded from the dataset generated by Conrad and colleagues the 13 deletions (out of 315) that were also associated with inversions. Finally, as discussed in the Results section we only consider microhomology when there are at least 2 bp of identical sequence present at the breakpoint. That meant re-classifying the breakpoints identified by Conrad and colleagues because they only required 1 bp to classify a CNV breakpoint as being associated with microhomology (detailed breakpoint information was made available by the authors as Supplementary Material) <abbrgrp><abbr bid="B19">19</abbr></abbrgrp>.</p>
</sec>
<sec><st><p>Evaluating the roles of NAHR and SSA</p></st>
<p>To test for sequence identity shared between regions within CNV coordinates and flanking DNA, three sequence databases were generated for both the 454 and microarray data, using the reference <it>Drosophila </it>genome (version 5.27): (1) CNV sequence; (2) 5' flanking sequence; (3) 3' flanking sequence. The 454 data provide precise CNV breakpoints and, based on these coordinates, we extracted 200 bp 5' and 3' of the CNV. The microarray data do not provide exact breakpoints, and for these data we defined the distal ends of the flanking sequences to be 500 bp 5' or 3' of the CNV coordinates. The proximal coordinates of the flanking sequences were set to extend 25% the length of the CNV 3' of the start of the CNV, or 25% the length of the CNV 5' of the end of the CNV. BLAT <abbrgrp><abbr bid="B30">30</abbr></abbrgrp> (blatSuite.34) was used to search for sequence identity between: (1) 5' flanking sequence and the DNA within the CNV coordinates; (2) 3' flanking sequence and the DNA within the CNV coordinates; and (3) 5' flanking sequence and 3' flanking sequence. The data were filtered to return two datasets for each of these searches. The first filter was set to accept stretches of &#8805;30 bp that possessed &#8805;98% sequence identity; the second was set to accept stretches of &#8805;200 bp that possessed &#8805;95% sequence identity. The microarray results were further filtered to remove all 'self-hits' that resulted from the flanking sequences overlapping the CNV coordinates. Fasta files were generated for all sequences meeting the above criteria and were screened for repetitive sequences using RepeatMasker <abbrgrp><abbr bid="B50">50</abbr></abbrgrp> (settings: abblast search engine, default speed/sensitivity, <it>D. melanogaster </it>annotations).</p>
</sec>
<sec><st><p>Identification of recurrent and complex CNVs in the tiling array data</p></st>
<p>We randomly selected 26 regions of the <it>D. melanogaster </it>genome that were identified by Emerson and colleagues as having deletions and that had been confirmed by PCR <abbrgrp><abbr bid="B5">5</abbr></abbrgrp>. We screened these 26 regions in the same 15 natural populations analyzed by Emerson and colleagues. We identified recurrent CNVs by the presence of bands of different size (generated using the same pairs of primers) in different populations. We then sequenced these different bands by Sanger sequencing.</p>
<p>All statistical analyses were done using the statistical package R <abbrgrp><abbr bid="B51">51</abbr></abbrgrp> and the application Rstudio.</p>
</sec>
<sec><st><p>Data availability</p></st>
<p>The Sanger sequences of the breakpoints of simple and complex CNVs initially identified using the tiling array data have been deposited in GenBank (<ext-link ext-link-id="KC138560" ext-link-type="gen">KC138560</ext-link>-<ext-link ext-link-id="KC138678" ext-link-type="gen">KC138678</ext-link>).</p>
</sec>
</sec>
<sec><st><p>List of abbreviations</p></st>
<p>CNVs: Copy number variants; FoSTeS: Fork stalling and template switching; HR: Homologous recombination; LCRs: Low-copy repeats; MMBIR: Microhomology-mediated break-induced replication; MMEJ: Microhomology-mediated end-joining; NAHR: Non-allelic homologous-recombination; NHEJ: Non-homologous end-joining; SRS: Serial replication slippage; SSA: Single-strand annealing; VNTRs: Variable number of tandem repeats.</p>
</sec>
<sec><st><p>Competing interests</p></st>
<p>The authors confirm that they have no competing interests in the conduct of this research or preparation of this paper.</p>
</sec>
<sec><st><p>Authors' contributions</p></st>
<p>All authors read and approved the final manuscript. MC-M, JRA, and AGC designed the study. MC-M carried most of the analyses with contributions from JRA. MC-M wrote the paper with contributions from JRA and AGC.</p>
</sec>
</bdy>
<bm>
<ack>
<sec><st><p>Acknowledgements</p></st>
<p>MC-M was supported by a postdoctoral fellowship from the Portuguese Foundation for Science and Technology (co-financed by POPH/FSE) and this work was supported in part by NIH grants R01 HG 003229 and R01 AI 064950 to AGC.</p>
</sec>
</ack>
<refgrp><bibl id="B1"><title><p>Copy number variation in human health, disease, and evolution.</p></title><aug><au><snm>Zhang</snm><fnm>F</fnm></au><au><snm>Gu</snm><fnm>W</fnm></au><au><snm>Hurles</snm><fnm>ME</fnm></au><au><snm>Lupski</snm><fnm>JR</fnm></au></aug><source>Annu Rev Genomics Hum Genet</source><pubdate>2009</pubdate><volume>10</volume><fpage>451</fpage><lpage>481</lpage><xrefbib><pubidlist><pubid idtype="doi">10.1146/annurev.genom.9.081307.164217</pubid><pubid idtype="pmpid" link="fulltext">19715442</pubid></pubidlist></xrefbib></bibl><bibl id="B2"><title><p>CNVs: harbingers of a rare variant revolution in psychiatric genetics.</p></title><aug><au><snm>Malhotra</snm><fnm>D</fnm></au><au><snm>Sebat</snm><fnm>J</fnm></au></aug><source>Cell</source><pubdate>2012</pubdate><volume>148</volume><fpage>1223</fpage><lpage>1241</lpage><xrefbib><pubidlist><pubid idtype="doi">10.1016/j.cell.2012.02.039</pubid><pubid idtype="pmpid" link="fulltext">22424231</pubid></pubidlist></xrefbib></bibl><bibl id="B3"><title><p>Diet and the evolution of human amylase gene copy number variation.</p></title><aug><au><snm>Perry</snm><fnm>GH</fnm></au><au><snm>Dominy</snm><fnm>NJ</fnm></au><au><snm>Claw</snm><fnm>KG</fnm></au><au><snm>Lee</snm><fnm>AS</fnm></au><au><snm>Fiegler</snm><fnm>H</fnm></au><au><snm>Redon</snm><fnm>R</fnm></au><au><snm>Werner</snm><fnm>J</fnm></au><au><snm>Villanea</snm><fnm>FA</fnm></au><au><snm>Mountain</snm><fnm>JL</fnm></au><au><snm>Misra</snm><fnm>R</fnm></au><au><snm>Carter</snm><fnm>NP</fnm></au><au><snm>Lee</snm><fnm>C</fnm></au><au><snm>Stone</snm><fnm>AC</fnm></au></aug><source>Nat Genet</source><pubdate>2007</pubdate><volume>39</volume><fpage>1256</fpage><lpage>1260</lpage><xrefbib><pubidlist><pubid idtype="doi">10.1038/ng2123</pubid><pubid idtype="pmcid">2377015</pubid><pubid idtype="pmpid" link="fulltext">17828263</pubid></pubidlist></xrefbib></bibl><bibl id="B4"><title><p>The origin and evolution of new genes.</p></title><aug><au><snm>Cardoso-Moreira</snm><fnm>M</fnm></au><au><snm>Long</snm><fnm>M</fnm></au></aug><source>Methods Mol Biol</source><pubdate>2012</pubdate><volume>856</volume><fpage>161</fpage><lpage>186</lpage><xrefbib><pubidlist><pubid idtype="doi">10.1007/978-1-61779-585-5_7</pubid><pubid idtype="pmpid" link="fulltext">22399459</pubid></pubidlist></xrefbib></bibl><bibl id="B5"><title><p>Natural selection shapes genome-wide patterns of copy-number polymorphism in Drosophila melanogaster.</p></title><aug><au><snm>Emerson</snm><fnm>JJ</fnm></au><au><snm>Cardoso-Moreira</snm><fnm>M</fnm></au><au><snm>Borevitz</snm><fnm>JO</fnm></au><au><snm>Long</snm><fnm>M</fnm></au></aug><source>Science</source><pubdate>2008</pubdate><volume>320</volume><fpage>1629</fpage><lpage>1631</lpage><xrefbib><pubidlist><pubid idtype="doi">10.1126/science.1158078</pubid><pubid idtype="pmpid" link="fulltext">18535209</pubid></pubidlist></xrefbib></bibl><bibl id="B6"><title><p>Origins and functional impact of copy number variation in the human genome.</p></title><aug><au><snm>Conrad</snm><fnm>DF</fnm></au><au><snm>Pinto</snm><fnm>D</fnm></au><au><snm>Redon</snm><fnm>R</fnm></au><au><snm>Feuk</snm><fnm>L</fnm></au><au><snm>Gokcumen</snm><fnm>O</fnm></au><au><snm>Zhang</snm><fnm>Y</fnm></au><au><snm>Aerts</snm><fnm>J</fnm></au><au><snm>Andrews</snm><fnm>TD</fnm></au><au><snm>Barnes</snm><fnm>C</fnm></au><au><snm>Campbell</snm><fnm>P</fnm></au><au><snm>Fitzgerald</snm><fnm>T</fnm></au><au><snm>Hu</snm><fnm>M</fnm></au><au><snm>Ihm</snm><fnm>CH</fnm></au><au><snm>Kristiansson</snm><fnm>K</fnm></au><au><snm>Macarthur</snm><fnm>DG</fnm></au><au><snm>Macdonald</snm><fnm>JR</fnm></au><au><snm>Onyiah</snm><fnm>I</fnm></au><au><snm>Pang</snm><fnm>AW</fnm></au><au><snm>Robson</snm><fnm>S</fnm></au><au><snm>Stirrups</snm><fnm>K</fnm></au><au><snm>Valsesia</snm><fnm>A</fnm></au><au><snm>Walter</snm><fnm>K</fnm></au><au><snm>Wei</snm><fnm>J</fnm></au><au><cnm>Wellcome Trust Case Control Consortium</cnm></au><au><snm>Tyler-Smith</snm><fnm>C</fnm></au><au><snm>Carter</snm><fnm>NP</fnm></au><au><snm>Lee</snm><fnm>C</fnm></au><au><snm>Scherer</snm><fnm>SW</fnm></au><au><snm>Hurles</snm><fnm>ME</fnm></au></aug><source>Nature</source><pubdate>2010</pubdate><volume>464</volume><fpage>704</fpage><lpage>712</lpage><xrefbib><pubidlist><pubid idtype="doi">10.1038/nature08516</pubid><pubid idtype="pmcid">3330748</pubid><pubid idtype="pmpid" link="fulltext">19812545</pubid></pubidlist></xrefbib></bibl><bibl id="B7"><title><p>Hotspots for copy number variation in chimpanzees and humans.</p></title><aug><au><snm>Perry</snm><fnm>GH</fnm></au><au><snm>Tchinda</snm><fnm>J</fnm></au><au><snm>McGrath</snm><fnm>SD</fnm></au><au><snm>Zhang</snm><fnm>J</fnm></au><au><snm>Picker</snm><fnm>SR</fnm></au><au><snm>C&#225;ceres</snm><fnm>AM</fnm></au><au><snm>Iafrate</snm><fnm>AJ</fnm></au><au><snm>Tyler-Smith</snm><fnm>C</fnm></au><au><snm>Scherer</snm><fnm>SW</fnm></au><au><snm>Eichler</snm><fnm>EE</fnm></au><au><snm>Stone</snm><fnm>AC</fnm></au><au><snm>Lee</snm><fnm>C</fnm></au></aug><source>Proc Natl Acad Sci USA</source><pubdate>2006</pubdate><volume>103</volume><fpage>8006</fpage><lpage>8011</lpage><xrefbib><pubidlist><pubid idtype="doi">10.1073/pnas.0602318103</pubid><pubid idtype="pmcid">1472420</pubid><pubid idtype="pmpid" link="fulltext">16702545</pubid></pubidlist></xrefbib></bibl><bibl id="B8"><title><p>Mutational and selective effects on copy-number variants in the human genome.</p></title><aug><au><snm>Cooper</snm><fnm>GM</fnm></au><au><snm>Nickerson</snm><fnm>DA</fnm></au><au><snm>Eichler</snm><fnm>EE</fnm></au></aug><source>Nat Genet</source><pubdate>2007</pubdate><volume>39</volume><fpage>S22</fpage><lpage>29</lpage><xrefbib><pubidlist><pubid idtype="doi">10.1038/ng2054</pubid><pubid idtype="pmpid" link="fulltext">17597777</pubid></pubidlist></xrefbib></bibl><bibl id="B9"><title><p>Mutational bias shaping fly copy number variation: implications for genome evolution.</p></title><aug><au><snm>Cardoso-Moreira</snm><fnm>MM</fnm></au><au><snm>Long</snm><fnm>M</fnm></au></aug><source>Trends Genet</source><pubdate>2010</pubdate><volume>26</volume><fpage>243</fpage><lpage>247</lpage><xrefbib><pubidlist><pubid idtype="doi">10.1016/j.tig.2010.03.002</pubid><pubid idtype="pmcid">2878862</pubid><pubid idtype="pmpid" link="fulltext">20416969</pubid></pubidlist></xrefbib></bibl><bibl id="B10"><title><p>Drosophila duplication hotspots are associated with late-replicating regions of the genome.</p></title><aug><au><snm>Cardoso-Moreira</snm><fnm>M</fnm></au><au><snm>Emerson</snm><fnm>JJ</fnm></au><au><snm>Clark</snm><fnm>AG</fnm></au><au><snm>Long</snm><fnm>M</fnm></au></aug><source>PLoS Genet</source><pubdate>2011</pubdate><volume>7</volume><fpage>e1002340</fpage><xrefbib><pubidlist><pubid idtype="doi">10.1371/journal.pgen.1002340</pubid><pubid idtype="pmcid">3207856</pubid><pubid idtype="pmpid" link="fulltext">22072977</pubid></pubidlist></xrefbib></bibl><bibl id="B11"><title><p>Mechanisms for human genomic rearrangements.</p></title><aug><au><snm>Gu</snm><fnm>W</fnm></au><au><snm>Zhang</snm><fnm>F</fnm></au><au><snm>Lupski</snm><fnm>JR</fnm></au></aug><source>Pathogenetics</source><pubdate>2008</pubdate><volume>1</volume><fpage>4</fpage><xrefbib><pubidlist><pubid idtype="doi">10.1186/1755-8417-1-4</pubid><pubid idtype="pmcid">2583991</pubid><pubid idtype="pmpid" link="fulltext">19014668</pubid></pubidlist></xrefbib></bibl><bibl id="B12"><title><p>Mechanisms of change in gene copy number.</p></title><aug><au><snm>Hastings</snm><fnm>PJ</fnm></au><au><snm>Lupski</snm><fnm>JR</fnm></au><au><snm>Rosenberg</snm><fnm>SM</fnm></au><au><snm>Ira</snm><fnm>G</fnm></au></aug><source>Nat Rev Genet</source><pubdate>2009</pubdate><volume>10</volume><fpage>551</fpage><lpage>564</lpage><xrefbib><pubidlist><pubid idtype="pmcid">2864001</pubid><pubid idtype="pmpid" link="fulltext">19597530</pubid></pubidlist></xrefbib></bibl><bibl id="B13"><title><p>MMEJ repair of double-strand breaks (director's cut): deleted sequences and alternative endings.</p></title><aug><au><snm>McVey</snm><fnm>M</fnm></au><au><snm>Lee</snm><fnm>SE</fnm></au></aug><source>Trends Genet</source><pubdate>2008</pubdate><volume>24</volume><fpage>529</fpage><lpage>538</lpage><xrefbib><pubidlist><pubid idtype="doi">10.1016/j.tig.2008.08.007</pubid><pubid idtype="pmpid" link="fulltext">18809224</pubid></pubidlist></xrefbib></bibl><bibl id="B14"><title><p>Complex human chromosomal and genomic rearrangements.</p></title><aug><au><snm>Zhang</snm><fnm>F</fnm></au><au><snm>Carvalho</snm><fnm>CM</fnm></au><au><snm>Lupski</snm><fnm>JR</fnm></au></aug><source>Trends Genet</source><pubdate>2009</pubdate><volume>25</volume><fpage>298</fpage><lpage>307</lpage><xrefbib><pubidlist><pubid idtype="doi">10.1016/j.tig.2009.05.005</pubid><pubid idtype="pmpid" link="fulltext">19560228</pubid></pubidlist></xrefbib></bibl><bibl id="B15"><title><p>Characterizing complex structural variation in germline and somatic genomes.</p></title><aug><au><snm>Quinlan</snm><fnm>AR</fnm></au><au><snm>Hall</snm><fnm>IM</fnm></au></aug><source>Trends Genet</source><pubdate>2012</pubdate><volume>28</volume><fpage>43</fpage><lpage>53</lpage><xrefbib><pubidlist><pubid idtype="doi">10.1016/j.tig.2011.10.002</pubid><pubid idtype="pmcid">3249479</pubid><pubid idtype="pmpid" link="fulltext">22094265</pubid></pubidlist></xrefbib></bibl><bibl id="B16"><title><p>A DNA replication mechanism for generating nonrecurrent rearrangements associated with genomic disorders.</p></title><aug><au><snm>Lee</snm><fnm>JA</fnm></au><au><snm>Carvalho</snm><fnm>CM</fnm></au><au><snm>Lupski</snm><fnm>JR</fnm></au></aug><source>Cell</source><pubdate>2007</pubdate><volume>131</volume><fpage>1235</fpage><lpage>1247</lpage><xrefbib><pubidlist><pubid idtype="doi">10.1016/j.cell.2007.11.037</pubid><pubid idtype="pmpid" link="fulltext">18160035</pubid></pubidlist></xrefbib></bibl><bibl id="B17"><title><p>A microhomology-mediated break-induced replication model for the origin of human copy number variation.</p></title><aug><au><snm>Hastings</snm><fnm>PJ</fnm></au><au><snm>Ira</snm><fnm>G</fnm></au><au><snm>Lupski</snm><fnm>JR</fnm></au></aug><source>PLoS Genet</source><pubdate>2009</pubdate><volume>5</volume><fpage>e1000327</fpage><xrefbib><pubidlist><pubid idtype="doi">10.1371/journal.pgen.1000327</pubid><pubid idtype="pmcid">2621351</pubid><pubid idtype="pmpid" link="fulltext">19180184</pubid></pubidlist></xrefbib></bibl><bibl id="B18"><title><p>Meta-analysis of indels causing human genetic disease: mechanisms of mutagenesis and the role of local DNA sequence complexity.</p></title><aug><au><snm>Chuzhanova</snm><fnm>NA</fnm></au><au><snm>Anassis</snm><fnm>EJ</fnm></au><au><snm>Ball</snm><fnm>EV</fnm></au><au><snm>Krawczak</snm><fnm>M</fnm></au><au><snm>Cooper</snm><fnm>DN</fnm></au></aug><source>Hum Mutat</source><pubdate>2003</pubdate><volume>21</volume><fpage>28</fpage><lpage>44</lpage><xrefbib><pubidlist><pubid idtype="doi">10.1002/humu.10146</pubid><pubid idtype="pmpid" link="fulltext">12497629</pubid></pubidlist></xrefbib></bibl><bibl id="B19"><title><p>Mutation spectrum revealed by breakpoint sequencing of human germline CNVs.</p></title><aug><au><snm>Conrad</snm><fnm>DF</fnm></au><au><snm>Bird</snm><fnm>C</fnm></au><au><snm>Blackburne</snm><fnm>B</fnm></au><au><snm>Lindsay</snm><fnm>S</fnm></au><au><snm>Mamanova</snm><fnm>L</fnm></au><au><snm>Lee</snm><fnm>C</fnm></au><au><snm>Turner</snm><fnm>DJ</fnm></au><au><snm>Hurles</snm><fnm>ME</fnm></au></aug><source>Nat Genet</source><pubdate>2010</pubdate><volume>42</volume><fpage>385</fpage><lpage>4291</lpage><xrefbib><pubidlist><pubid idtype="doi">10.1038/ng.564</pubid><pubid idtype="pmcid">3428939</pubid><pubid idtype="pmpid" link="fulltext">20364136</pubid></pubidlist></xrefbib></bibl><bibl id="B20"><title><p>A human genome structural variation sequencing resource reveals insights into mutational mechanisms.</p></title><aug><au><snm>Kidd</snm><fnm>JM</fnm></au><au><snm>Graves</snm><fnm>T</fnm></au><au><snm>Newman</snm><fnm>TL</fnm></au><au><snm>Fulton</snm><fnm>R</fnm></au><au><snm>Hayden</snm><fnm>HS</fnm></au><au><snm>Malig</snm><fnm>M</fnm></au><au><snm>Kallicki</snm><fnm>J</fnm></au><au><snm>Kaul</snm><fnm>R</fnm></au><au><snm>Wilson</snm><fnm>RK</fnm></au><au><snm>Eichler</snm><fnm>EE</fnm></au></aug><source>Cell</source><pubdate>2010</pubdate><volume>143</volume><fpage>837</fpage><lpage>847</lpage><xrefbib><pubidlist><pubid idtype="doi">10.1016/j.cell.2010.10.027</pubid><pubid idtype="pmcid">3026629</pubid><pubid idtype="pmpid" link="fulltext">21111241</pubid></pubidlist></xrefbib></bibl><bibl id="B21"><title><p>Mapping copy number variation by population-scale genome sequencing.</p></title><aug><au><snm>Mills</snm><fnm>RE</fnm></au><au><snm>Walter</snm><fnm>K</fnm></au><au><snm>Stewart</snm><fnm>C</fnm></au><au><snm>Handsaker</snm><fnm>RE</fnm></au><au><snm>Chen</snm><fnm>K</fnm></au><au><snm>Alkan</snm><fnm>C</fnm></au><au><snm>Abyzov</snm><fnm>A</fnm></au><au><snm>Yoon</snm><fnm>SC</fnm></au><au><snm>Ye</snm><fnm>K</fnm></au><au><snm>Cheetham</snm><fnm>RK</fnm></au><au><snm>Chinwalla</snm><fnm>A</fnm></au><au><snm>Conrad</snm><fnm>DF</fnm></au><au><snm>Fu</snm><fnm>Y</fnm></au><au><snm>Grubert</snm><fnm>F</fnm></au><au><snm>Hajirasouliha</snm><fnm>I</fnm></au><au><snm>Hormozdiari</snm><fnm>F</fnm></au><au><snm>Iakoucheva</snm><fnm>LM</fnm></au><au><snm>Iqbal</snm><fnm>Z</fnm></au><au><snm>Kang</snm><fnm>S</fnm></au><au><snm>Kidd</snm><fnm>JM</fnm></au><au><snm>Konkel</snm><fnm>MK</fnm></au><au><snm>Korn</snm><fnm>J</fnm></au><au><snm>Khurana</snm><fnm>E</fnm></au><au><snm>Kural</snm><fnm>D</fnm></au><au><snm>Lam</snm><fnm>HY</fnm></au><au><snm>Leng</snm><fnm>J</fnm></au><au><snm>Li</snm><fnm>R</fnm></au><au><snm>Li</snm><fnm>Y</fnm></au><au><snm>Lin</snm><fnm>CY</fnm></au><au><snm>Luo</snm><fnm>R</fnm></au><au><cnm>1000 Genomes Project</cnm></au><etal/></aug><source>Nature</source><pubdate>2011</pubdate><volume>470</volume><fpage>59</fpage><lpage>65</lpage><xrefbib><pubidlist><pubid idtype="doi">10.1038/nature09708</pubid><pubid idtype="pmcid">3077050</pubid><pubid idtype="pmpid" link="fulltext">21293372</pubid></pubidlist></xrefbib></bibl><bibl id="B22"><title><p>Nucleotide-resolution analysis of structural variants using BreakSeq and a breakpoint library.</p></title><aug><au><snm>Lam</snm><fnm>HY</fnm></au><au><snm>Mu</snm><fnm>XJ</fnm></au><au><snm>St&#252;tz</snm><fnm>AM</fnm></au><au><snm>Tanzer</snm><fnm>A</fnm></au><au><snm>Cayting</snm><fnm>PD</fnm></au><au><snm>Snyder</snm><fnm>M</fnm></au><au><snm>Kim</snm><fnm>PM</fnm></au><au><snm>Korbel</snm><fnm>JO</fnm></au><au><snm>Gerstein</snm><fnm>MB</fnm></au></aug><source>Nat Biotechnol</source><pubdate>2010</pubdate><volume>28</volume><fpage>47</fpage><lpage>55</lpage><xrefbib><pubidlist><pubid idtype="doi">10.1038/nbt.1600</pubid><pubid idtype="pmcid">2951730</pubid><pubid idtype="pmpid" link="fulltext">20037582</pubid></pubidlist></xrefbib></bibl><bibl id="B23"><title><p>Genome-wide mapping and assembly of structural variant breakpoints in the mouse genome.</p></title><aug><au><snm>Quinlan</snm><fnm>AR</fnm></au><au><snm>Clark</snm><fnm>RA</fnm></au><au><snm>Sokolova</snm><fnm>S</fnm></au><au><snm>Leibowitz</snm><fnm>ML</fnm></au><au><snm>Zhang</snm><fnm>Y</fnm></au><au><snm>Hurles</snm><fnm>ME</fnm></au><au><snm>Mell</snm><fnm>JC</fnm></au><au><snm>Hall</snm><fnm>IM</fnm></au></aug><source>Genome Res</source><pubdate>2010</pubdate><volume>20</volume><fpage>623</fpage><lpage>635</lpage><xrefbib><pubidlist><pubid idtype="doi">10.1101/gr.102970.109</pubid><pubid idtype="pmcid">2860164</pubid><pubid idtype="pmpid" link="fulltext">20308636</pubid></pubidlist></xrefbib></bibl><bibl id="B24"><title><p>A portrait of copy-number polymorphism in Drosophila melanogaster.</p></title><aug><au><snm>Dopman</snm><fnm>EB</fnm></au><au><snm>Hartl</snm><fnm>DL</fnm></au></aug><source>Proc Natl Acad Sci USA</source><pubdate>2007</pubdate><volume>104</volume><fpage>19920</fpage><lpage>19925</lpage><xrefbib><pubidlist><pubid idtype="doi">10.1073/pnas.0709888104</pubid><pubid idtype="pmcid">2148398</pubid><pubid idtype="pmpid" link="fulltext">18056801</pubid></pubidlist></xrefbib></bibl><bibl id="B25"><title><p>Validation of rearrangement break points identified by paired-end sequencing in natural populations of Drosophila melanogaster.</p></title><aug><au><snm>Cridland</snm><fnm>JM</fnm></au><au><snm>Thornton</snm><fnm>KR</fnm></au></aug><source>Genome Biol Evol</source><pubdate>2010</pubdate><volume>2</volume><fpage>83</fpage><lpage>101</lpage><xrefbib><pubidlist><pubid idtype="doi">10.1093/gbe/evq001</pubid><pubid idtype="pmcid">2839345</pubid><pubid idtype="pmpid" link="fulltext">20333226</pubid></pubidlist></xrefbib></bibl><bibl id="B26"><title><p>A model of segmental duplication formation in Drosophila melanogaster.</p></title><aug><au><snm>Fiston-Lavier</snm><fnm>AS</fnm></au><au><snm>Anxolab&#233;h&#232;re</snm><fnm>D</fnm></au><au><snm>Quesneville</snm><fnm>H</fnm></au></aug><source>Genome Res</source><pubdate>2007</pubdate><volume>17</volume><fpage>1458</fpage><lpage>1470</lpage><xrefbib><pubidlist><pubid idtype="doi">10.1101/gr.6208307</pubid><pubid idtype="pmcid">1987339</pubid><pubid idtype="pmpid" link="fulltext">17726166</pubid></pubidlist></xrefbib></bibl><bibl id="B27"><title><p>Recurrent insertion and duplication generate networks of transposable element sequences in the Drosophila melanogaster genome.</p></title><aug><au><snm>Bergman</snm><fnm>CM</fnm></au><au><snm>Quesneville</snm><fnm>H</fnm></au><au><snm>Anxolab&#233;h&#232;re</snm><fnm>D</fnm></au><au><snm>Ashburner</snm><fnm>M</fnm></au></aug><source>Genome Biol</source><pubdate>2006</pubdate><volume>7</volume><fpage>R112</fpage><xrefbib><pubidlist><pubid idtype="doi">10.1186/gb-2006-7-11-r112</pubid><pubid idtype="pmcid">1794594</pubid><pubid idtype="pmpid" link="fulltext">17134480</pubid></pubidlist></xrefbib></bibl><bibl id="B28"><title><p>Population genomic inferences from sparse high-throughput sequencing of two populations of Drosophila melanogaster.</p></title><aug><au><snm>Sackton</snm><fnm>TB</fnm></au><au><snm>Kulathinal</snm><fnm>RJ</fnm></au><au><snm>Bergman</snm><fnm>CM</fnm></au><au><snm>Quinlan</snm><fnm>AR</fnm></au><au><snm>Dopman</snm><fnm>EB</fnm></au><au><snm>Carneiro</snm><fnm>M</fnm></au><au><snm>Marth</snm><fnm>GT</fnm></au><au><snm>Hartl</snm><fnm>DL</fnm></au><au><snm>Clark</snm><fnm>AG</fnm></au></aug><source>Genome Biol Evol</source><pubdate>2009</pubdate><volume>1</volume><fpage>449</fpage><lpage>465</lpage><xrefbib><pubidlist><pubid idtype="pmcid">2839279</pubid><pubid idtype="pmpid" link="fulltext">20333214</pubid></pubidlist></xrefbib></bibl><bibl id="B29"><title><p>Mosaik Aligner</p></title><url>http://code.google.com/p/mosaik-aligner/</url></bibl><bibl id="B30"><title><p>BLAT - The BLAST-like alignment tool.</p></title><aug><au><snm>Kent</snm><fnm>WJ</fnm></au></aug><source>Genome Res</source><pubdate>2002</pubdate><volume>4</volume><fpage>656</fpage><lpage>664</lpage></bibl><bibl id="B31"><title><p>ClustalW and ClustalX version 2.</p></title><aug><au><snm>Larkin</snm><fnm>MA</fnm></au><au><snm>Blackshields</snm><fnm>G</fnm></au><au><snm>Brown</snm><fnm>NP</fnm></au><au><snm>Chenna</snm><fnm>R</fnm></au><au><snm>McGettigan</snm><fnm>PA</fnm></au><au><snm>McWilliam</snm><fnm>H</fnm></au><au><snm>Valentin</snm><fnm>F</fnm></au><au><snm>Wallace</snm><fnm>IM</fnm></au><au><snm>Wilm</snm><fnm>A</fnm></au><au><snm>Lopez</snm><fnm>R</fnm></au><au><snm>Thompson</snm><fnm>JD</fnm></au><au><snm>Gibson</snm><fnm>TJ</fnm></au><au><snm>Higgins</snm><fnm>DG</fnm></au></aug><source>Bioinformatics</source><pubdate>2007</pubdate><volume>23</volume><fpage>2947</fpage><lpage>2948</lpage><xrefbib><pubidlist><pubid idtype="doi">10.1093/bioinformatics/btm404</pubid><pubid idtype="pmpid" link="fulltext">17846036</pubid></pubidlist></xrefbib></bibl><bibl id="B32"><title><p>A new bioinformatics analysis tools framework at EMBL-EBI.</p></title><aug><au><snm>Goujon</snm><fnm>M</fnm></au><au><snm>McWilliam</snm><fnm>H</fnm></au><au><snm>Li</snm><fnm>W</fnm></au><au><snm>Valentin</snm><fnm>F</fnm></au><au><snm>Squizzato</snm><fnm>S</fnm></au><au><snm>Paern</snm><fnm>J</fnm></au><au><snm>Lopez</snm><fnm>R</fnm></au></aug><source>Nucleic Acids Res</source><pubdate>2010</pubdate><volume>38</volume><fpage>W695</fpage><lpage>699</lpage><xrefbib><pubidlist><pubid idtype="doi">10.1093/nar/gkq313</pubid><pubid idtype="pmcid">2896090</pubid><pubid idtype="pmpid" link="fulltext">20439314</pubid></pubidlist></xrefbib></bibl><bibl id="B33"><title><p>Gamma-irradiation stimulates homology-directed DNA double-strand break repair in Drosophila embryo.</p></title><aug><au><snm>Ducau</snm><fnm>J</fnm></au><au><snm>Bregliano</snm><fnm>JC</fnm></au><au><snm>de La Roche Saint-Andr&#233;</snm><fnm>C</fnm></au></aug><source>Mutat Res</source><pubdate>2000</pubdate><volume>460</volume><fpage>69</fpage><lpage>80</lpage><xrefbib><pubidlist><pubid idtype="doi">10.1016/S0921-8777(00)00017-3</pubid><pubid idtype="pmpid" link="fulltext">10856836</pubid></pubidlist></xrefbib></bibl><bibl id="B34"><title><p>The homologous chromosome is an effective template for the repair of mitotic DNA double-strand breaks in Drosophila.</p></title><aug><au><snm>Rong</snm><fnm>YS</fnm></au><au><snm>Golic</snm><fnm>KG</fnm></au></aug><source>Genetics</source><pubdate>2003</pubdate><volume>165</volume><fpage>1831</fpage><lpage>1842</lpage><xrefbib><pubidlist><pubid idtype="pmcid">1462885</pubid><pubid idtype="pmpid" link="fulltext">14704169</pubid></pubidlist></xrefbib></bibl><bibl id="B35"><title><p>Non-homologous DNA end joining in plant cells is associated with deletions and filler DNA insertions.</p></title><aug><au><snm>Gorbunova</snm><fnm>V</fnm></au><au><snm>Levy</snm><fnm>AA</fnm></au></aug><source>Nucleic Acids Res</source><pubdate>1997</pubdate><volume>25</volume><fpage>4650</fpage><lpage>4657</lpage><xrefbib><pubidlist><pubid idtype="doi">10.1093/nar/25.22.4650</pubid><pubid idtype="pmcid">147090</pubid><pubid idtype="pmpid" link="fulltext">9358178</pubid></pubidlist></xrefbib></bibl><bibl id="B36"><title><p>The majority of recent short DNA insertions in the human genome are tandem duplications.</p></title><aug><au><snm>Messer</snm><fnm>PW</fnm></au><au><snm>Arndt</snm><fnm>PF</fnm></au></aug><source>Mol Biol Evol</source><pubdate>2007</pubdate><volume>24</volume><fpage>1190</fpage><lpage>1197</lpage><xrefbib><pubidlist><pubid idtype="doi">10.1093/molbev/msm035</pubid><pubid idtype="pmpid" link="fulltext">17322553</pubid></pubidlist></xrefbib></bibl><bibl id="B37"><title><p>Efficient repair of DNA breaks in Drosophila: evidence for single-strand annealing and competition with other repair pathways.</p></title><aug><au><snm>Preston</snm><fnm>CR</fnm></au><au><snm>Engels</snm><fnm>W</fnm></au><au><snm>Flores</snm><fnm>C</fnm></au></aug><source>Genetics</source><pubdate>2002</pubdate><volume>161</volume><fpage>711</fpage><lpage>720</lpage><xrefbib><pubidlist><pubid idtype="pmcid">1462149</pubid><pubid idtype="pmpid" link="fulltext">12072467</pubid></pubidlist></xrefbib></bibl><bibl id="B38"><title><p>On the sequence-directed nature of human gene mutation: the role of genomic architecture and the local DNA sequence environment in mediating gene mutations underlying human inherited disease.</p></title><aug><au><snm>Cooper</snm><fnm>DN</fnm></au><au><snm>Bacolla</snm><fnm>A</fnm></au><au><snm>F&#233;rec</snm><fnm>C</fnm></au><au><snm>Vasquez</snm><fnm>KM</fnm></au><au><snm>Kehrer-Sawatzki</snm><fnm>H</fnm></au><au><snm>Chen</snm><fnm>JM</fnm></au></aug><source>Hum Mutat</source><pubdate>2011</pubdate><volume>32</volume><fpage>1075</fpage><lpage>1099</lpage><xrefbib><pubidlist><pubid idtype="doi">10.1002/humu.21557</pubid><pubid idtype="pmcid">3177966</pubid><pubid idtype="pmpid" link="fulltext">21853507</pubid></pubidlist></xrefbib></bibl><bibl id="B39"><title><p>Non-B DNA structure-induced genetic instability.</p></title><aug><au><snm>Wang</snm><fnm>G</fnm></au><au><snm>Vasquez</snm><fnm>KM</fnm></au></aug><source>Mutat Res</source><pubdate>2006</pubdate><volume>598</volume><fpage>103</fpage><lpage>119</lpage><xrefbib><pubidlist><pubid idtype="doi">10.1016/j.mrfmmm.2006.01.019</pubid><pubid idtype="pmpid" link="fulltext">16516932</pubid></pubidlist></xrefbib></bibl><bibl id="B40"><title><p>Non-B DB: a database of predicted non-B DNA-forming motifs in mammalian genomes.</p></title><aug><au><snm>Cer</snm><fnm>RZ</fnm></au><au><snm>Bruce</snm><fnm>KH</fnm></au><au><snm>Mudunuri</snm><fnm>US</fnm></au><au><snm>Yi</snm><fnm>M</fnm></au><au><snm>Volfovsky</snm><fnm>N</fnm></au><au><snm>Luke</snm><fnm>BT</fnm></au><au><snm>Bacolla</snm><fnm>A</fnm></au><au><snm>Collins</snm><fnm>JR</fnm></au><au><snm>Stephens</snm><fnm>RM</fnm></au></aug><source>Nucleic Acids Res</source><pubdate>2011</pubdate><volume>39</volume><fpage>D383</fpage><lpage>391</lpage><xrefbib><pubidlist><pubid idtype="doi">10.1093/nar/gkq1170</pubid><pubid idtype="pmcid">3013731</pubid><pubid idtype="pmpid" link="fulltext">21097885</pubid></pubidlist></xrefbib></bibl><bibl id="B41"><title><p>De novo CNV formation in mouse embryonic stem cells occurs in the absence of Xrcc4-dependent nonhomologous end joining.</p></title><aug><au><snm>Arlt</snm><fnm>MF</fnm></au><au><snm>Rajendran</snm><fnm>S</fnm></au><au><snm>Birkeland</snm><fnm>SR</fnm></au><au><snm>Wilson</snm><fnm>TE</fnm></au><au><snm>Glover</snm><fnm>TW</fnm></au></aug><source>PLoS Genet</source><pubdate>2012</pubdate><volume>8</volume><fpage>e1002981</fpage><xrefbib><pubidlist><pubid idtype="doi">10.1371/journal.pgen.1002981</pubid><pubid idtype="pmcid">3447954</pubid><pubid idtype="pmpid" link="fulltext">23028374</pubid></pubidlist></xrefbib></bibl><bibl id="B42"><title><p>Dual roles for DNA polymerase theta in alternative end-joining repair of double-strand breaks in Drosophila.</p></title><aug><au><snm>Chan</snm><fnm>SH</fnm></au><au><snm>Yu</snm><fnm>AM</fnm></au><au><snm>McVey</snm><fnm>M</fnm></au></aug><source>PLoS Genet</source><pubdate>2010</pubdate><volume>6</volume><fpage>e1001005</fpage><xrefbib><pubidlist><pubid idtype="doi">10.1371/journal.pgen.1001005</pubid><pubid idtype="pmcid">2895639</pubid><pubid idtype="pmpid" link="fulltext">20617203</pubid></pubidlist></xrefbib></bibl><bibl id="B43"><title><p>Synthesis-dependent microhomology-mediated end joining accounts for multiple types of repair junctions.</p></title><aug><au><snm>Yu</snm><fnm>AM</fnm></au><au><snm>McVey</snm><fnm>M</fnm></au></aug><source>Nucleic Acids Res</source><pubdate>2010</pubdate><volume>38</volume><fpage>5706</fpage><lpage>5717</lpage><xrefbib><pubidlist><pubid idtype="doi">10.1093/nar/gkq379</pubid><pubid idtype="pmcid">2943611</pubid><pubid idtype="pmpid" link="fulltext">20460465</pubid></pubidlist></xrefbib></bibl><bibl id="B44"><title><p>Challenges in studying genomic structural variant formation mechanisms: the short-read dilemma and beyond.</p></title><aug><au><snm>Onishi-Seebacher</snm><fnm>M</fnm></au><au><snm>Korbel</snm><fnm>JO</fnm></au></aug><source>Bioessays</source><pubdate>2011</pubdate><volume>33</volume><fpage>840</fpage><lpage>850</lpage><xrefbib><pubidlist><pubid idtype="doi">10.1002/bies.201100075</pubid><pubid idtype="pmpid" link="fulltext">21959584</pubid></pubidlist></xrefbib></bibl><bibl id="B45"><title><p>An initial map of insertion and deletion (INDEL) variation in the human genome.</p></title><aug><au><snm>Mills</snm><fnm>RE</fnm></au><au><snm>Luttig</snm><fnm>CT</fnm></au><au><snm>Larkins</snm><fnm>CE</fnm></au><au><snm>Beauchamp</snm><fnm>A</fnm></au><au><snm>Tsui</snm><fnm>C</fnm></au><au><snm>Pittard</snm><fnm>WS</fnm></au><au><snm>Devine</snm><fnm>SE</fnm></au></aug><source>Genome Res</source><pubdate>2006</pubdate><volume>16</volume><fpage>1182</fpage><lpage>1190</lpage><xrefbib><pubidlist><pubid idtype="doi">10.1101/gr.4565806</pubid><pubid idtype="pmcid">1557762</pubid><pubid idtype="pmpid" link="fulltext">16902084</pubid></pubidlist></xrefbib></bibl><bibl id="B46"><title><p>A decade's perspective on DNA sequencing technology.</p></title><aug><au><snm>Mardis</snm><fnm>ER</fnm></au></aug><source>Nature</source><pubdate>2011</pubdate><volume>470</volume><fpage>198</fpage><lpage>203</lpage><xrefbib><pubidlist><pubid idtype="doi">10.1038/nature09796</pubid><pubid idtype="pmpid" link="fulltext">21307932</pubid></pubidlist></xrefbib></bibl><bibl id="B47"><title><p>FlyBase 101--the basics of navigating FlyBase.</p></title><aug><au><snm>McQuilton</snm><fnm>P</fnm></au><au><snm>St Pierre</snm><fnm>SE</fnm></au><au><snm>Thurmond</snm><fnm>J</fnm></au><au><cnm>FlyBase Consortium</cnm></au></aug><source>Nucleic Acids Res</source><pubdate>2012</pubdate><volume>40</volume><fpage>D706</fpage><lpage>714</lpage><xrefbib><pubidlist><pubid idtype="doi">10.1093/nar/gkr1030</pubid><pubid idtype="pmcid">3245098</pubid><pubid idtype="pmpid" link="fulltext">22127867</pubid></pubidlist></xrefbib></bibl><bibl id="B48"><title><p>Gapped BLAST and PSI-BLAST: a new generation of protein database search programs.</p></title><aug><au><snm>Altschul</snm><fnm>SF</fnm></au><au><snm>Madden</snm><fnm>TL</fnm></au><au><snm>Sch&#228;ffer</snm><fnm>AA</fnm></au><au><snm>Zhang</snm><fnm>J</fnm></au><au><snm>Zhang</snm><fnm>Z</fnm></au><au><snm>Miller</snm><fnm>W</fnm></au><au><snm>Lipman</snm><fnm>DJ</fnm></au></aug><source>Nucleic Acids Res</source><pubdate>1997</pubdate><volume>25</volume><fpage>3389</fpage><lpage>3402</lpage><xrefbib><pubidlist><pubid idtype="doi">10.1093/nar/25.17.3389</pubid><pubid idtype="pmcid">146917</pubid><pubid idtype="pmpid" link="fulltext">9254694</pubid></pubidlist></xrefbib></bibl><bibl id="B49"><title><p>BEDTools: a flexible suite of utilities for comparing genomic features.</p></title><aug><au><snm>Quinlan</snm><fnm>AR</fnm></au><au><snm>Hall</snm><fnm>IM</fnm></au></aug><source>Bioinformatics</source><pubdate>2010</pubdate><volume>26</volume><fpage>841</fpage><lpage>842</lpage><xrefbib><pubidlist><pubid idtype="doi">10.1093/bioinformatics/btq033</pubid><pubid idtype="pmcid">2832824</pubid><pubid idtype="pmpid" link="fulltext">20110278</pubid></pubidlist></xrefbib></bibl><bibl id="B50"><title><p>RepeatMasker Open-3.0.</p></title><aug><au><snm>Smit</snm><fnm>AFA</fnm></au><au><snm>Hubley</snm><fnm>R</fnm></au><au><snm>Green</snm><fnm>P</fnm></au></aug><pubdate>1996</pubdate><url>http://www.repeatmasker.org</url></bibl><bibl id="B51"><title><p>A Language and Environment for Statistical Computing. R Foundation for Statistical Computing.</p></title><aug><au><cnm>R Development Core Team: R</cnm></au></aug><pubdate>2008</pubdate><url>http://www.R-project.org/</url></bibl></refgrp>
</bm>
</art>