|
Resolution: standard / high Figure 1.
Receiver operating characteristic (ROC) curves for two non-coding RNA prediction algorithms,
ClosingBp (Bradley RK, Uzilov AV, Skinner M, BendaƱa YR, Barquist L and Holmes I,
submitted) and EVOFOLD [39] (implemented using XRATE), using GSIMULATOR and SIMGENOME models to estimate the
false positive discovery rate. These curves illustrate the general principle that
the more realistic a simulation model, the higher the estimated false positive rate
(FPR). This trend is independent of the gene-prediction algorithm used. The upper
panes show results for GSIMULATOR: it is seen that more complex indel length distributions
(N) and, in particular, context-dependence (K) both increase the FPR. The lower panes show results for SIMGENOME and component
models, where the FPR is increased by including gaps (which amplify fluctuations in
information content, due to their typically being treated as 'missing information')
and genomic features (some of which evolve at a slower rate than neutral sequence).
The reason that the asymptotic sensitivity is less than 1.0 is that our benchmark
used a sliding-window approach, predicting at most one non-coding RNA (ncRNA) in each
window. Our set of real ncRNAs was taken from multi-genome Drosophila alignments produced by the PECAN program [50]; in each case, to ensure a fair comparison, we took a window of the PECAN alignment
surrounding the annotated ncRNA, with the size of this window matching the size of
the sliding-window that was used on the simulated null data. Some of the positive
ncRNAs in these PECAN-aligned windows score so poorly under the gene prediction model
- for example, due to inaccuracies in the PECAN alignment of that window - that the
predicted ncRNA is consistently placed in the wrong location within the window. These
real ncRNAs are, therefore, never detected, no matter how low the scoring threshold,
setting an upper limit on the achievable sensitivity.
Varadarajan et al. Genome Biology 2008 9:R147 doi:10.1186/gb-2008-9-10-r147 |