Open Access Research

A simple metric of promoter architecture robustly predicts expression breadth of human genes suggesting that most transcription factors are positive regulators

Laurence D Hurst1, Oxana Sachenkova23, Carsten Daub3, Alistair RR Forrest48, the FANTOM consortium and Lukasz Huminiecki23567*

Author Affiliations

1 Department of Biology and Biochemistry, University of Bath, Bath BA2 7AY, UK

2 Department of Biochemistry and Biophysics, Stockholm University, Stockholm, Sweden

3 Science for Life Laboratory, SciLifeLab, Stockholm, Sweden

4 RIKEN Omics Science Center, Yokohama, Japan

5 Department of Cell and Molecular Biology, Karolinska Institutet, Stockholm, Sweden

6 BILS bioinformatics infrastructure for life sciences, Stockholm, Sweden

7 Department of Immunology Genetics and Pathology, Uppsala University, Uppsala, Sweden

8 Division of Genomic Technologies, RIKEN Center for Life Science Technologies, Yokohama, Kanagawa, Japan

For all author emails, please log on.

Genome Biology 2014, 15:413  doi:10.1186/s13059-014-0413-3

Published: 31 July 2014



Conventional wisdom holds that, owing to the dominance of features such as chromatin level control, the expression of a gene cannot be readily predicted from knowledge of promoter architecture. This is reflected, for example, in a weak or absent correlation between promoter divergence and expression divergence between paralogs. However, an inability to predict may reflect an inability to accurately measure or employment of the wrong parameters. Here we address this issue through integration of two exceptional resources: ENCODE data on transcription factor binding and the FANTOM5 high-resolution expression atlas.


Consistent with the notion that in eukaryotes most transcription factors are activating, the number of transcription factors binding a promoter is a strong predictor of expression breadth. In addition, evolutionarily young duplicates have fewer transcription factor binders and narrower expression. Nonetheless, we find several binders and cooperative sets that are disproportionately associated with broad expression, indicating that models more complex than simple correlations should hold more predictive power. Indeed, a machine learning approach improves fit to the data compared with a simple correlation. Machine learning could at best moderately predict tissue of expression of tissue specific genes.


We find robust evidence that some expression parameters and paralog expression divergence are strongly predictable with knowledge of transcription factor binding repertoire. While some cooperative complexes can be identified, consistent with the notion that most eukaryotic transcription factors are activating, a simple predictor, the number of binding transcription factors found on a promoter, is a robust predictor of expression breadth.