Genome Biology

official impact factor 6.89

Open Access Highly Access Research

Text-mining assisted regulatory annotation

Stein Aerts1,2*, Maximilian Haeussler3, Steven van Vooren4, Obi L Griffith5, Paco Hulpiau6, Steven JM Jones5, Stephen B Montgomery7, Casey M Bergman8* and The Open Regulatory Annotation Consortium

Author Affiliations

1 Laboratory of Neurogenetics, Department of Molecular and Developmental Genetics, VIB, Leuven, B-3000, Belgium

2 Department of Human Genetics, Katholieke Universiteit Leuven School of Medicine, Herestraat, Leuven, B-3000, Belgium

3 Institut de Neurosciences A Fessard, Centre National de la Rechere Scientifique, Gif-sur-Yvette, 91 198, France

4 Department of Electrical Engineering, Katholieke Universiteit Leuven, Heverlee, B-3001, Belgium

5 Canada's Michael Smith Genome Sciences Centre, British Columbia Cancer Agency, Vancouver, V5Z 4E6, Canada

6 VIB Department for Molecular Biomedical Research, Ghent University, Ghent, 9052, Belgium

7 Wellcome Trust Sanger Institute, Hinxton, CB10 1SA, UK

8 Faculty of Life Sciences, University of Manchester, Oxford Road, Manchester, M13 9PT, UK

For all author emails, please log on.

Genome Biology 2008, 9:R31 doi:10.1186/gb-2008-9-2-r31

Published: 13 February 2008

Abstract

Background

Decoding transcriptional regulatory networks and the genomic cis-regulatory logic implemented in their control nodes is a fundamental challenge in genome biology. High-throughput computational and experimental analyses of regulatory networks and sequences rely heavily on positive control data from prior small-scale experiments, but the vast majority of previously discovered regulatory data remains locked in the biomedical literature.

Results

We develop text-mining strategies to identify relevant publications and extract sequence information to assist the regulatory annotation process. Using a vector space model to identify Medline abstracts from papers likely to have high cis-regulatory content, we demonstrate that document relevance ranking can assist the curation of transcriptional regulatory networks and estimate that, minimally, 30,000 papers harbor unannotated cis-regulatory data. In addition, we show that DNA sequences can be extracted from primary text with high cis-regulatory content and mapped to genome sequences as a means of identifying the location, organism and target gene information that is critical to the cis-regulatory annotation process.

Conclusion

Our results demonstrate that text-mining technologies can be successfully integrated with genome annotation systems, thereby increasing the availability of annotated cis-regulatory data needed to catalyze advances in the field of gene regulation.