|
Identifying low-quality reads and their contribution to the error rate |
||
| Data selection |
Percent of reads |
Error rate |
|
|
||
| All reads |
100.0% |
0.49% |
| Reads with no Ns |
94.4% |
0.24% |
| Reads with one or more Ns |
5.6% |
4.7% |
| Reads with length ≥81 and ≤108 |
98.8% |
0.33% |
| Reads with length <81 or >108 |
1.2% |
18.9% |
| Reads with no Ns and length ≥81 and ≤108 |
93.3% |
0.20% |
| Reads with no proximal errors |
97.0% |
0.45% |
| Reads with fewer than three proximal errors |
>99.99% |
0.48% |
| Reads with more than three proximal errors |
<0.01% |
12.2% |
| Reads with no Ns and length ≥81 and ≤108 and no proximal errors |
90.6% |
0.16% |
|
Removing reads with Ns is the most effective means we found of removing low-quality data and improving the error rates. Read lengths that are either longer or shorter than expected, and are outside the peak of common reads, also correlate strongly with incorrect reads. | ||
Huse et al. Genome Biology 2007 8:R143 doi:10.1186/gb-2007-8-7-r143 |
||