Email updates

Keep up to date with the latest news and content from Genome Biology and BioMed Central.

Open Access Highly Accessed Method

Sniper: improved SNP discovery by multiply mapping deep sequenced reads

Daniel F Simola12 and Junhyong Kim13*

Author Affiliations

1 Department of Biology, University of Pennsylvania, 433 S. University Ave, Philadelphia, PA 19104, USA

2 Department of Cell and Developmental Biology, University of Pennsylvania, 421 Curie Blvd, Philadelphia, PA 19104, USA

3 Penn Genome Frontiers Institute, University of Pennsylvania, 433 S. University Ave, Philadelphia, PA 19104, USA

For all author emails, please log on.

Genome Biology 2011, 12:R55  doi:10.1186/gb-2011-12-6-r55

Published: 20 June 2011

Additional files

Additional file 1:

Texts S1 to S4. Text S1: description of the analysis of repetitive elements contributing to non-unique alignments. Text S2: details of our Bayesian probability model for SNP detection. Text S3: performance estimates based on comparison to the Sanger validated data set reported in Harsimendy et al. [9]. Text S4: performance estimates obtained when varying the expected base-call sequencing error rate parameter compared to actual error rate.

Format: PDF Size: 1.1MB Download file

This file can be viewed with: Adobe Acrobat Reader

Open Data

Additional file 2:

Figure S1 - simulated paired-end read multiplicity distributions for the human genome. The number of valid alignments against the Homo sapiens genome is reported as a proportion of the 2 × 106 randomly sampled PE reads used in alignment, varying fragment length (250, 500, 750, or 1,000 nucleotides) and read length (30, 60, 90, or 120 nucleotides). The proportion of read multiplicity averaged over reads of the same length is shown in each figure.

Format: PDF Size: 353KB Download file

This file can be viewed with: Adobe Acrobat Reader

Open Data

Additional file 3:

Table S1 - redundancy structure analysis for human NGS data. This Excel file describes the number of reads and number of alignments for each read map type (unique (UNI), best-guess (BEST), and total max-d (ALL)) associated with the human genome.

Format: XLS Size: 46KB Download file

This file can be viewed with: Microsoft Excel Viewer

Open Data

Additional file 4:

Figure S2 - examples of false read mapping. False read mapping occurs when a sequenced read is incorrectly aligned to its reference sequence (that is, to the incorrect location in the genome). This is most likely to occur in the presence of closely related sequences existing in replicate in the reference sequence and can result from either (a) SNP occurrence or (b) base-call sequencing error in the sample genome, such that the similarity between reads containing a variant (or false) allele and the reference genome decreases at one locus and increases at another (false) locus. Instances of false mapping consequently decrease the chance of a SNP call at a true locus and increase the chance of a false SNP call at the wrong locus.

Format: PDF Size: 320KB Download file

This file can be viewed with: Adobe Acrobat Reader

Open Data

Additional file 5:

Table S2 - complete spurious mapping statistics. This Excel file contains read map statistics for reads that overlap SNP loci.

Format: XLS Size: 81KB Download file

This file can be viewed with: Microsoft Excel Viewer

Open Data

Additional file 6:

Figure S3 - relationship between posterior probability and read degeneracy. Box and whisker plots showing the distribution of posterior probabilities Q (stringency) for all SNPs identified in each of five replicates for two different simulations (ribosomal protein loci (RPL) and 2 × RPL + 10%), grouped by per-locus degeneracy. For example, the 0 ≤ < d < 0.1 group contains loci with at least a 1/10 ratio of alignments at another locus versus alignments overlapping the locus of interest. Box plots represent the entire distribution of Q values for each degeneracy bin, where the red line indicates median and the box indicates the 25th and 75th percentiles. SNPs were obtained using the 25-fold coverage simulations allowing k ≤ 1 mismatch.

Format: PDF Size: 99KB Download file

This file can be viewed with: Adobe Acrobat Reader

Open Data

Additional file 7:

Figure S4 - per-locus distributions of degeneracy. Cumulative distributions of per-locus degeneracy are shown for each of the six reference genomic DNA templates used in this study. Degeneracy is defined as the ratio of d, the number of alignments for a read that overlap loci other than the locus of interest, to the read depth at a locus of interest. (Alternatively, Alignments/Reads - 1.) For example, a ratio of 1 indicates that every read overlapping the locus of interest has two valid alignments in the reference genome. Loci are binned into 12 groups and the cumulative frequency of all loci is reported at each degeneracy group. Estimates are shown using the ALL read map with k = 1, 2, and 3 mismatches.

Format: PDF Size: 420KB Download file

This file can be viewed with: Adobe Acrobat Reader

Open Data

Additional file 8:

Table S3 - complete statistics for human genotyping performance comparison. This file contains performance statistics based on the Harsimendy et al. [9] human data set for Sniper, Maq, and SOAPsnp at different stringency levels for ALL, UNI, and BEST read map types; total SNP loci identified; and putative novel SNPs identified by Sniper. For each program and read map approach, genotypes for four individuals were compared to those determined by ABI Sanger sequencing. The benchmark set (Sanger \ NGS) contains 253 SNPs. True positive rates (TPRs) and false discovery error rates (FDRs) were estimated from these comparisons (top rows). In parentheses, from left to right: matching genotypes; positions identified as SNP but differing in genotype; SNPs not identified by Sanger; Sanger SNPs not identified by program. Sniper SNPs are reported using a stringency threshold on the phred-like posterior probability: Q = -10 log10(1 - P), where Q ≥ 13 (P < 0.05). Maq SNPs were generated using Q ≥ 13 minimum consensus quality, Qadj ≥ 13 minimum adjacent quality, and prior probability of SNP PSNP = 0.001, with default settings otherwise. Soap SNPs were generated using default settings, '-t -u -n', and a 2:1 transition:transversion ratio for prior probability, and PSNP = 0.001. Four SNPs are predicted by Sniper, Maq, and SOAPsnp for k = 2 mismatches, identified by comparison to Sanger benchmark data (Additional file 8). Statistics were generated using Sniper.

Format: XLS Size: 39KB Download file

This file can be viewed with: Microsoft Excel Viewer

Open Data

Additional file 9:

Table S4 - read map statistics for the Harsimendy et al. NGS data set. This Excel file details our read maps for the Harsimendy et al. [9] human NGS data.

Format: XLS Size: 21KB Download file

This file can be viewed with: Microsoft Excel Viewer

Open Data

Additional file 10:

Figure S5 - human genotyping performance across coverage levels. Accuracy bar charts and Receiver operating characteristic (ROC)-style curves for human SNP identification at six coverage levels. Reads from one individual (NA17156) were subsampled randomly from the complete approximately 188-fold coverage data in five replicates. Each subsampled read set was independently aligned to the human genome using ALL, UNI, or BEST maps with k = 1, 2, or 3 mismatches and genotyped using Sniper. (a) Bar charts reporting genotyping accuracy for each condition. Error bars show ± standard error of the mean. (b) ROC-style curves are shown as 1 - accuracy versus sensitivity. Three stringency levels (Q ≥ 13, 40, 90) are shown for each curve.

Format: PDF Size: 530KB Download file

This file can be viewed with: Adobe Acrobat Reader

Open Data

Additional file 11:

Figure S6 - simulation negative control. (a) Accuracy bar charts and (b) ROC-style curves showing performance of ALL, UNI, and BEST maps on our negative control synthetic DNA template (2 × RPL +0%). Five unknown sample genomes were generated from each reference template by adding SNPs randomly to a proportion of 0.001. Read sets were sampled from each sample genome to one of four coverage levels (25-fold, 50-fold, 100-fold, 200-fold). Read sets were independently aligned to their respective reference genome using ALL, UNI, or BEST maps with k = 1, 2, or 3 mismatches and genotyped using Sniper. (a) Bar charts report genotyping accuracy for each condition. Error bars show ± standard error of the mean over five replicates. (b) ROC-style curves are shown as the number of false positive calls versus sensitivity. Three stringency levels for Q (13, 40, 90) are shown for each curve.

Format: PDF Size: 412KB Download file

This file can be viewed with: Adobe Acrobat Reader

Open Data

Additional file 12:

Figure S7 - genotyping accuracy for simulated data. Bar charts are shown reporting SNP identification accuracy on four synthetic genomic DNA templates (RPL, 2 × RPL +2%, 2 × RPL +5%, 2 × RPL +10%). Five unknown sample genomes were generated from each reference template by adding SNPs randomly to a proportion of 0.001. Read sets were sampled from each sample genome to one of nine coverage levels (4-fold, 10-fold, 25-fold, 32-fold, 50-fold, 75-fold, 100-fold, 150-fold, 200-fold). Read sets were independently aligned to their respective reference genome using ALL, UNI, or BEST maps with k = 1, 2, or 3 mismatches and genotyped using Sniper. Error bars show ± standard error of the mean over five replicates.

Format: PDF Size: 764KB Download file

This file can be viewed with: Adobe Acrobat Reader

Open Data

Additional file 13:

Figure S8 - genotyping receiver operating characteristic curves for simulated data sets. Receiver operating characteristic (ROC)-style curves are shown reporting SNP identification performance on four synthetic genomic DNA templates (RPL, 2 × RPL + 2%, 2 × RPL + 5%, 2 × RPL + 10%). Five unknown sample genomes were generated from each reference template by adding SNPs randomly to a proportion of 0.001. Read sets were sampled from each sample genome to one of nine coverage levels (4-fold, 10-fold, 25-fold, 32-fold, 50-fold, 75-fold, 100-fold, 150-fold, 200-fold). Read sets were independently aligned to their respective reference genome using ALL, UNI, or BEST maps with k = 1, 2, or 3 mismatches and genotyped using Sniper. Plots show the number of false positive SNPs plus the number of calls with no read coverage versus sensitivity.

Format: PDF Size: 1.4MB Download file

This file can be viewed with: Adobe Acrobat Reader

Open Data

Additional file 14:

Figure S9 - genotyping performance for simulated data. Estimates of true positive rate, false positive rate, and false discovery rate are provided for our four synthetic templates for all mapping strategies and mismatch conditions, based on 50× simulated read sets and genotyped using a Q ≥ 40 stringency cutoff. Estimates for the BESTNO-Q strategy (best no-guess mapping using read quality values for mapping) were based on default Bowtie settings (-n mode with -l 28 -e 70).

Format: PDF Size: 405KB Download file

This file can be viewed with: Adobe Acrobat Reader

Open Data

Additional file 15:

Table S5 - performance estimates under variable sequencing error rates. This Excel file provides performance estimates as described in Text S4 in Additional file 1.

Format: XLS Size: 23KB Download file

This file can be viewed with: Microsoft Excel Viewer

Open Data

Additional file 16:

Table S6 - comparison of novel Sniper SNPs with the HapMap collection. This Excel file contains SNP calls and statistics for a HapMap-intersecting subset of the 454 Sniper SNPs not identified by GATK and the 412 SNPs differing in genotype from GATK.

Format: XLSX Size: 50KB Download file

Open Data

Additional file 17:

Python software implementation of Sniper (version 1.5.8).

Format: ZIP Size: 191KB Download file

Open Data