Open Access Highly Accessed Research

Characterizing and measuring bias in sequence data

Michael G Ross*, Carsten Russ, Maura Costello, Andrew Hollinger, Niall J Lennon, Ryan Hegarty, Chad Nusbaum and David B Jaffe

Author Affiliations

The Broad Institute, 7 Cambridge Center, Cambridge, MA 02142, USA

For all author emails, please log on.

Genome Biology 2013, 14:R51  doi:10.1186/gb-2013-14-5-r51

Published: 29 May 2013

Additional files

Additional file 1:

The 'bad promoters' list for Human assembly 19 (GRCh 37), as described in the main text and method section, computed from HiSeq v2 data set A2; intervals are annotated with gene names and the coverage ratios used to select them (see Materials and methods for details).

Format: TXT Size: 44KB Download file

Open Data

Additional file 2:

The 'bad promoters' list for Human assembly 19 (GRCh 37), as described in the main text and method section, computed from HiSeq v3 data set A3; intervals are annotated with gene names and the coverage ratios used to select them (see Materials and methods for details).

Format: TXT Size: 43KB Download file

Open Data

Additional file 3:

The supplementary tables referred to in the text.

Format: DOCX Size: 89KB Download file

Open Data

Additional file 4:

Figure S1 - Human error rates as a function of GC composition and reference. Each graph shows mismatch (light blue), deletion (dark blue), and insertion (maroon) rates (y-axis) as a function of GC composition (x-axis). Data are shown for the human NA12878 sample sequenced by Illumina HiSeq (Table 2, data set 14) and Ion Torrent PGM (Table 2, data set 15) aligned both to the standard Human assembly 19 (GRCh37) reference and to the NA12878-specific diploid reference created by the Gerstein lab [37]. Error rates are only plotted for GC percentages for which there are at least 1,000 100-base windows in Human assembly 19.

Format: PDF Size: 144KB Download file

This file can be viewed with: Adobe Acrobat Reader

Open Data

Additional file 5:

Figure S2 - Human error rates as a function of homopolymer length and reference. Each graph shows mismatch (light blue), deletion (dark blue), and insertion (maroon) rates (y-axis) within homopolymers of various lengths (x-axis). Data are plotted from human sample NA12878 as sequenced by Illumina HiSeq (Table 2, data set 14) and Ion Torrent PGM (Table 2, data set 15) and aligned both to the standard Human assembly 19 (GRCh37) reference and to the NA12878-specific diploid reference created by the Gerstein lab [37].

Format: PDF Size: 63KB Download file

This file can be viewed with: Adobe Acrobat Reader

Open Data

Additional file 6:

The intervals of the human reference that had less than 0.1 relative coverage in data set 14 and could not be categorized as biological variations or as similar to known bias motifs. Also included are the GC content fraction and homopolymer N50 for each interval.

Format: CSV Size: 1.2MB Download file

Open Data

Additional file 7:

The SRA numbers for all Illumina, Ion Torrent, and Pacific Biosciences data used in the paper.

Format: XLSX Size: 123KB Download file

Open Data