Table 2

Identifying low-quality reads and their contribution to the error rate

Data selection

Percent of reads

Error rate


All reads

100.0%

0.49%

Reads with no Ns

94.4%

0.24%

Reads with one or more Ns

5.6%

4.7%

Reads with length ≥81 and ≤108

98.8%

0.33%

Reads with length <81 or >108

1.2%

18.9%

Reads with no Ns and length ≥81 and ≤108

93.3%

0.20%

Reads with no proximal errors

97.0%

0.45%

Reads with fewer than three proximal errors

>99.99%

0.48%

Reads with more than three proximal errors

<0.01%

12.2%

Reads with no Ns and length ≥81 and ≤108 and no proximal errors

90.6%

0.16%


Removing reads with Ns is the most effective means we found of removing low-quality data and improving the error rates. Read lengths that are either longer or shorter than expected, and are outside the peak of common reads, also correlate strongly with incorrect reads.

Huse et al. Genome Biology 2007 8:R143   doi:10.1186/gb-2007-8-7-r143

Open Data