Resolution:
## Figure 2.
Quantitative relationship between chromatin feature and expression. (a) Scatter plot of predicted expression values using the two-step prediction model (random
forests classification model and linear regression model) versus the measured PolyA+
cytosolic RNA from K562 cells measured by CAGE. Each blue dot represents one gene.
The red dashed line indicates the linear fit between measured and predicted expression
values, which are highly correlated (PCC r = 0.9, P-value <2.2 × 10^{-16}), indicating a quantitative relationship between chromatin features and expression
levels. The accuracy for the overall model is indicated by RMSE (root-mean-square
error), which is 1.9. Accuracy for the classification model is indicated by AUC (area
under the ROC curve), which is 0.95. The accuracy for the regression model is r = 0.77 (RMSE = 2.3). (b) The relative importance of chromatin features in the two-step model. The most important
features for the classifier (upper panel) include H3K9ac, H3K4me3, and DNase I hypersensitivity,
while the most important features for the regressor (bottom panel) include H3K79me2,
H3K36me3, and DNase I hypersensitivity. (c) Summary of overall prediction accuracy on 78 expression experiments on whole cell,
cytosolic or nuclear RNA from seven cell lines. The bars are sorted by correlation
coefficient in decreasing order for each high throughput technique (CAGE, RNA-PET
and RNA-Seq). Each bar is composed of several colors, corresponding to the relative
contribution of each feature in the regression model. The red dashed line represents
median PCC r = 0.83. Code for cell lines: K, K562; G, GM12878; 1, H1-hESC; H, HepG2; E, HeLa-S3;
N, NHEK; U, HUVEC. Code for RNA extraction: +, PolyA+; -, PolyA-. Code for cell compartment:
W, whole cell; C, cytosol; N, nucleus.
Dong |