Table 2

Identifying low-quality reads and their contribution to the error rate

Data selection
Percent of reads
Error rate

All reads
100.0%
0.49%
Reads with no Ns
94.4%
0.24%
Reads with one or more Ns
5.6%
4.7%
Reads with length ≥81 and ≤108
98.8%
0.33%
Reads with length <81 or >108
1.2%
18.9%
Reads with no Ns and length ≥81 and ≤108
93.3%
0.20%
Reads with no proximal errors
97.0%
0.45%
Reads with fewer than three proximal errors
>99.99%
0.48%
Reads with more than three proximal errors
<0.01%
12.2%
Reads with no Ns and length ≥81 and ≤108 and no proximal errors
90.6%
0.16%

Removing reads with Ns is the most effective means we found of removing low-quality data and improving the error rates. Read lengths that are either longer or shorter than expected, and are outside the peak of common reads, also correlate strongly with incorrect reads.

Huse et al. Genome Biology 2007 8:R143   doi:10.1186/gb-2007-8-7-r143