|
Resolution: standard / high Figure 1.
Modeling pipeline. Genes longer than 4,100 bp were extended and divided into 81 bins. The chromatin
feature density in each bin is logarithm-transformed and then used to determine the
best bin (the bin that has the strongest correlation with the expression values).
To avoid log2(0), a pseudocount is added to each bin, which is then optimized using
one-third of genes in each dataset (D1) and then applied to the other two-thirds of
genes in the datasets (D2) for the rest of the analysis. D2 was divided into training
set (TR) and testing set (TS) in a ten-fold cross-validation manner. A two-step model
was built using the training set. First, a classification model C(X) was learned to
discriminate the 'on' and 'off' genes, followed by a regression model R(X) for predicting
the expression levels of the 'on' genes. Finally, the correlation between the predicted
expression values for testing set, C(TS_X)*R(TS_X), and the measured expression values
of testing set (TS_Y) was used to measure the overall performance of the model. TSS,
transcription start site; TTS, transcription termination site; RMSE, root-mean-square
error.
Dong et al. Genome Biology 2012 13:R53 doi:10.1186/gb-2012-13-9-r53 |