Log on / register
BioMed Central home | Journals A-Z | Feedback | Support | My details
.refereed research
 |  |  |  |  | 


Open AccessHighly AccessResearch

Text-mining assisted regulatory annotation

Stein Aerts1,2 email, Maximilian Haeussler3 email, Steven van Vooren4 email, Obi L Griffith5 email, Paco Hulpiau6 email, Steven JM Jones5 email, Stephen B Montgomery7 email, Casey M Bergman8 email and The Open Regulatory Annotation Consortium email

1Laboratory of Neurogenetics, Department of Molecular and Developmental Genetics, VIB, Leuven, B-3000, Belgium

2Department of Human Genetics, Katholieke Universiteit Leuven School of Medicine, Herestraat, Leuven, B-3000, Belgium

3Institut de Neurosciences A Fessard, Centre National de la Rechere Scientifique, Gif-sur-Yvette, 91 198, France

4Department of Electrical Engineering, Katholieke Universiteit Leuven, Heverlee, B-3001, Belgium

5Canada's Michael Smith Genome Sciences Centre, British Columbia Cancer Agency, Vancouver, V5Z 4E6, Canada

6VIB Department for Molecular Biomedical Research, Ghent University, Ghent, 9052, Belgium

7Wellcome Trust Sanger Institute, Hinxton, CB10 1SA, UK

8Faculty of Life Sciences, University of Manchester, Oxford Road, Manchester, M13 9PT, UK

author email corresponding author email

Genome Biology 2008, 9:R31doi:10.1186/gb-2008-9-2-r31

Published: 13 February 2008

Subject areas: Genome studies, Bioinformatics

Abstract

Background

Decoding transcriptional regulatory networks and the genomic cis-regulatory logic implemented in their control nodes is a fundamental challenge in genome biology. High-throughput computational and experimental analyses of regulatory networks and sequences rely heavily on positive control data from prior small-scale experiments, but the vast majority of previously discovered regulatory data remains locked in the biomedical literature.

Results

We develop text-mining strategies to identify relevant publications and extract sequence information to assist the regulatory annotation process. Using a vector space model to identify Medline abstracts from papers likely to have high cis-regulatory content, we demonstrate that document relevance ranking can assist the curation of transcriptional regulatory networks and estimate that, minimally, 30,000 papers harbor unannotated cis-regulatory data. In addition, we show that DNA sequences can be extracted from primary text with high cis-regulatory content and mapped to genome sequences as a means of identifying the location, organism and target gene information that is critical to the cis-regulatory annotation process.

Conclusion

Our results demonstrate that text-mining technologies can be successfully integrated with genome annotation systems, thereby increasing the availability of annotated cis-regulatory data needed to catalyze advances in the field of gene regulation.


© 1999-2009 BioMed Central Ltd unless otherwise stated. Part of Springer Science+Business Media.