Genome Biology

official impact factor 6.89

Open Access Research

High resolution discovery and confirmation of copy number variants in 90 Yoruba Nigerians

Hajime Matsuzaki, Pei-Hua Wang, Jing Hu, Rich Rava and Glenn K Fu*

Author Affiliations

Affymetrix, Inc., 3420 Central Expressway, Santa Clara, CA 95051, USA

For all author emails, please log on.

Genome Biology 2009, 10:R125 doi:10.1186/gb-2009-10-11-r125

Published: 9 November 2009

Additional files

Additional data file 1:

Figure S1 is a description of the chip designs. Figure S1A: the sequential 49-mer probes against the genome were dispersed across the three chip designs. Probes corresponding to extraneous matches to the genome in the central 16 nucleotides were omitted from the designs. Figure S1B: probes on the CNV-typing design were organized into probe partitions corresponding to putative CNVs from the genome scan (in red), reported CNVs from whole-genome sequencing studies (Levy et al. [18] and Wheeler et al. [19]; in blue), and CNV regions in the DGV (November 2008) overlapping in at least two database records (in green). The five example partitions correspond to regions of varying length, and are represented by up to 50 probes each; regions less than 500 bp have fewer probes because the probe spacing is capped at 10 bp per probe. A partition can map to more than one CNV; conversely, a CNV can be represented by one or more partitions. Figure S2 shows regions with reported CNVs in proximity. Two example regions of width approximately 200 kb (Figure S2A) and approximately 20 kb (Figure S2B) are displayed in Nexus chromosome views (BioDiscovery), along with DGV browser views [51]. The Nexus views show the percentage of Yoruba samples with observed gains and losses in green and red, respectively. DGV records that were paired with putative CNVs are colored with blue stripes. In the first example (Figure S2A), the DGV records with red stripes more closely match the smaller CNVs. Figure S3 shows cell line artifacts. The initial smoothed segmentation analysis, displayed in Nexus (BioDiscovery) drill-down views, showed disproportionately high gain events across chromosome 12 in Yoruba sample NA19193 (Figure S3A) and chromosome 9 in sample NA19208 (Figure S3B). The green and red bars along the chromosome pictograms mark regions with gains and losses, respectively. These observations are consistent with previously reported lymphoblastoid cell-line artifacts, namely mosaic duplications, in these samples [12,44]. Figure S4 shows probe GC filtering and correction. Figure S4A: a handful of samples, including NA18870, showed disproportionately high numbers of events in the initial segmentation analysis when using all probes and without GC correction. The plots show log ratios across the range of probe GC content for a random sampling of 20,000 probes in chromosome 20 from the b-chip experiment run on sample NA18870. Before probe filtering and correction, there is a noticeable 'fishtail' of high log ratios corresponding to higher GC content, which manifests in artificially high numbers of gain events. Similarly, in samples with high numbers of loss events, tails of low log ratios were observed. Figure S4B: segmentation analysis results for chromosome 20 from the b-chip experiment run on sample NA18870. The average log ratios of the delineated segments are plotted against their lengths (in log scale). Before the filtering and correction, there is a noticeable bolus of segments with lengths between approximately 500 kb and approximately 5 kb that have average log ratios well above the threshold value of 0.25 for gains. There are also longer segments up to approximately 500 kb that have ratios above the 0.25 threshold. After filtering and correction, however, the number of segments above or below the 0.25 and -0.25 thresholds is much fewer, and the vast majority of segments have average ratios hovering close to 0, indicative of non-events, and lengths close to approximately 200 kb, which is in line with the windows of 750 probes in the segmentation analysis. Figure S5 is a receiver operator characteristic (ROC) analysis of chromosome X. The ROC curves show the tradeoff between false positives and sensitivity [52], based on comparing chromosome X probes in female and male samples. Log2 ratios were calculated for all 90 samples at approximately 470,000 chromosome X probes in each chip design, using median signals based only on female samples. Ratios close to 0 are indicative of two copies of chromosome X (non-events), while lower ratios, particularly in male samples, are indicative of one copy (surrogate 'loss' events). Thresholds for the ratios were varied from -3.0 to 3.0 in 0.01 increments, and at each increment the cumulative fraction of probes below the thresholds were determined separately for female and male samples. The ROC curves show sensitivity as the fraction of probes in males below the threshold, and the false positives as the log of the fraction of probes in females below the threshold. Consecutive probes were averaged to generate the family of smoothed curves. The ROC curve for the b-chip design is shown in panel A, while panel B shows the result of combining consecutive inter-digitated probes from the three chip designs. At the segmentation threshold of -0.25 for loss events, the b-chip had sensitivity of 0.83 and false positives of 0.08 without smoothing (all in panel A), and sensitivity of 0.91 and false positives of 0.008 when smoothing with eight probes (smooth 8 in panel A). The sensitivity was 0.88 and false positives 0.05 when smoothing over three probes combined from the three designs (smooth 3 in panel B). Compared with just the b-chip alone, these ROC measures for the combined probes suggested higher performance at the same effective resolution. However, the segmentation with the combined probes resulted in greater variation in the tallies of events in individual samples, compared to probes from each design separately. Because the ROC measures are based on aggregating the entire sample set, subtle variations that manifest at the individual sample level may not be apparent. Figure S6 is a threshold titration curve. CBS event thresholds were titrated from 0.35 (most stringent) down to 0.10. Sensitivity (y-axis) represents the proportion of empirically detected events on the CNV-typing array at 1,153 McCarroll CNVs and 6,578 putative CNVs, compared to reference events at the 732 McCarroll reference CNVs and validation PCRs (listed as REF-McCarroll-Sel and pcr-GS in Additional data file 8), respectively. Because of the possibility of false-positive calls in the McCarroll et al. [14] study, and the small sampling size of the PCRs, the sensitivity estimates are approximations useful for comparing CBS thresholds, and not absolute measures. The x-axis, 1 - Specificity, represents the proportion of possibly false-positive events called on the CNV-typing array at McCarroll CNVs compared to reported diploid calls at the 234 non-event CNVs from the McCarroll et al. [14] study (listed as REF-NonPoly6papers in Additional data file 8). Any instances of false-negative events missed in the McCarroll et al. [14] study that were actually called in our survey will artificially lower the specificity estimates. Figure S7 are plots of event segment Log2 ratios. Log2 ratios of 97,953 event segments at the 6,368 confirmed CNVs were grouped based on rounded segment lengths (Figure S7A), or rounded numbers of probes (log2) in the segments (Figure S7B), and summarized in box-plots for either gain or loss events. Box-plots show medians and interquartile ranges, with whiskers extending to maximum or minimum values within 1.5 times the 75th or 25th percentiles, respectively. The width of boxes is proportional to the number of events. Table S3 shows breakpoint mapping. Table S3A: amplicon bands corresponding to 19 loss events at 16 regions were excised from gels and sequenced. Shown are the build 36 reference sequences 50 nucleotides upstream and downstream of the mapped breakpoints of the loss events. Differences from the reference sequence in individual Yoruba samples are in lower case. The actual lengths (len) based on the breakpoints are listed. At putative CNV locus_ids 3262, 3689, and 5439, the non-event DNA in Yoruba pairs also had actual events at the exact same breakpoints as in the event DNAs. Table S3B: the 16 regions with successful breakpoint sequences are listed along with the closest matching records in the DGV (March 2009). Table S5 is a summary of quantitative PCR results at 16 putative CNVs. DNAs were run in pairs with one having an observed gain event, and the other with no event on the genome-scan arrays. Cycle thresholds (Ct) were normalized against GAPDH PCRs, and compared in each DNA pair. In all cases, the event DNA had a lower Ct value. The status of each pair was marked based on differences in normalized Ct values: confirm or maybe (ambiguous). The differences in the Ct values were scaled, such that difference less than or equal to 0.6 are represented by one '+' symbol, and differences greater than 0.6 are represented by a proportionate number of two or more '+' symbols (Scaled_Diff_Ct). At four of the CNVs, the difference in Ct values was dramatic, indicative of homozygous losses in the non-event DNAs, rather than gains in the event DNAs. Table S6 is a list of references cited in the DGV alongside the methods used in the studies. For the pair-wise comparison shown in Figure 3, methods from the cited references were classified into the six categories. The numbers of overlapping confirmed CNVs from our work is also listed. Table S7 is a comparison of Yoruba event calls among six studies. Reported events were compared in all possible pairs of six recent studies that included one or more Yoruba individuals. Just as in the comparisons shown in Table 2, for each Yoruba in common, events were matched based on the longest overlap, and agreement was determined by comparing loss versus gain events, and not integer copy numbers. The percentage of events that overlapped reflects the relative degree of missed events in either of the studies in the paired comparisons.

Format: PDF Size: 852KB Download file

This file can be viewed with: Adobe Acrobat Reader

Open Data

Additional data file 2:

Each putative CNV identified in the genome scan was assigned a unique identifier (locus_id). CNVs with locus_id numbers starting at 100,000 were from the smoothed segmentation analysis. Chromosome locations are on genome build 36. Confirmed CNVs had at least one Yoruba with an event on the CNV-typing array. For confirmed CNVs that overlapped at least one DGV record (March 2009), the closest matching record (variation_id) is listed along with its build 36 coordinates, length, cited reference, and discovery method. Regions were flagged as 'Complex' if both a loss and gain event were observed in the same individual.

Format: PDF Size: 784KB Download file

This file can be viewed with: Adobe Acrobat Reader

Open Data

Additional data file 3:

Gel images correspond to 4% agarose (E-gel), gradient polyacrlyamide (PA gel), and 1% agarose (1% gel) electrophoresis gels. DNAs were run in pairs with one having an observed event (Event DNA_ID), and the other without an observed event (non-event DNA_ID). Confirmation calls (Call Lane 1 or 2) were made based on amplicon length differences in each DNA pair, and marked the status of each pair: confirm, maybe (ambiguous), no (no evidence of event), or fail (PCR did not yield expected amplicons). At a subset of regions, amplicon bands were excised and sequenced (seq). The lengths of the putative CNVs are also listed.

Format: PDF Size: 278KB Download file

This file can be viewed with: Adobe Acrobat Reader

Open Data

Additional data file 4:

(A) Event calls at confirmed CNVs are compared against consensus references from the Wang et al. [15] and McCarroll et al. [14] studies. Calls in red are in disagreement with the reference, and calls in blue are cases of possible false-positive calls not in the reference. Missed gain and loss events are shown as blue and red boxes, respectively. Consensus among the references and agreement with the references were determined by comparing loss versus gain events, and not integer copy numbers. Trio_ids are detailed in (D). (B) Calls reported in the McCarroll et al. [14] study are compared against consensus reference from our survey and the Wang et al. [15] study. (C) Calls reported in the Wang et al. [15] study are compared against consensus reference from our survey and the McCarroll et al. [14] study. (D) Yoruba trios were arbitrarily assigned trio_ids. The DNA_ids of the 90 Yoruba are listed with the corresponding trio_ids.

Format: PDF Size: 84KB Download file

This file can be viewed with: Adobe Acrobat Reader

Open Data

Additional data file 5:

Primer sequences, along with sizes of the expected amplicons.

Format: PDF Size: 39KB Download file

This file can be viewed with: Adobe Acrobat Reader

Open Data

Additional data file 6:

Log 2 ratios of the event segments are also listed, along with event coordinates on genome build 36.

Format: TXT Size: 6.6MB Download file

Open Data

Additional data file 7:

Observed events on the CNV-typing array in the 90 Yoruba at 1,153 CNVs reported in the McCarroll et al. [14] study (listed as chp-McCarroll2008) and at regions from the Levy et al. [18] and Wheeler et al. [19] studies as summarized in Table 4 (listed as chp-LevyWheel, chp-LevyOnly, and chp-WheelerOnly).

Format: TXT Size: 5.1MB Download file

Open Data

Additional data file 8:

When available, event calls were listed as integer copy numbers from 0 to 4, reported copy numbers > 4 were listed as 4, and no-calls were listed as -1. In papers that reported only loss (deletion) or gain, the calls were listed as 1 or 3, respectively. For papers with genome positions in build 35, the liftOver utility at UCSC [53] was used to map coordinates on build 36. Also listed are diploid calls in Yoruba from the McCarroll et al. [14] study (listed as REF-NonPoly6papers), and event calls based on PCR (listed as pcr-GS).

Format: TXT Size: 3.4MB Download file

Open Data