Email updates

Keep up to date with the latest news and content from Genome Biology and BioMed Central.

This article is part of the supplement: The BioCreative II - Critical Assessment for Information Extraction in Biology Challenge

Open Access Research

Uncovering protein interaction in abstracts and text using a novel linear model and word proximity networks

Alaa Abi-Haidar12, Jasleen Kaur1, Ana Maguitman3, Predrag Radivojac1, Andreas Rechtsteiner4, Karin Verspoor5, Zhiping Wang6 and Luis M Rocha12*

Author Affiliations

1 School of Informatics, Indiana University, 107 S. Indiana Ave. Bloomington, IN 47405, USA

2 FLAD (Fundação Luso-Americana para o Desenvolvimento) Computational Biology Collaboratorium, Instituto Gulbenkian de Ciência, Rua da Quinta Grande, 6 P-2780-156 Oeiras, Portugal

3 Departamento de Ciencias e Ingenería de la Computación, Universidad Nacional del Sur, Avenida Alem 1253, Bahía Blanca, Buenos Aires, Argentina

4 Center for Genomics and Bioinformatics, Indiana University, 107 S. Indiana Ave. Bloomington, IN 47405, USA

5 Modeling, Algorithms and Informatics Group, Los Alamos National Laboratory, 1350 Central, MS C330 Los Alamos, NM 87545, USA

6 Biostatistics, School of Medicine, Indiana University, 107 S. Indiana Ave. Bloomington, IN 47405, USA

For all author emails, please log on.

Genome Biology 2008, 9(Suppl 2):S11  doi:10.1186/gb-2008-9-s2-s11

Published: 1 September 2008

Abstract

Background:

We participated in three of the protein-protein interaction subtasks of the Second BioCreative Challenge: classification of abstracts relevant for protein-protein interaction (interaction article subtask [IAS]), discovery of protein pairs (interaction pair subtask [IPS]), and identification of text passages characterizing protein interaction (interaction sentences subtask [ISS]) in full-text documents. We approached the abstract classification task with a novel, lightweight linear model inspired by spam detection techniques, as well as an uncertainty-based integration scheme. We also used a support vector machine and singular value decomposition on the same features for comparison purposes. Our approach to the full-text subtasks (protein pair and passage identification) includes a feature expansion method based on word proximity networks.

Results:

Our approach to the abstract classification task (IAS) was among the top submissions for this task in terms of measures of performance used in the challenge evaluation (accuracy, F-score, and area under the receiver operating characteristic curve). We also report on a web tool that we produced using our approach: the Protein Interaction Abstract Relevance Evaluator (PIARE). Our approach to the full-text tasks resulted in one of the highest recall rates as well as mean reciprocal rank of correct passages.

Conclusion:

Our approach to abstract classification shows that a simple linear model, using relatively few features, can generalize and uncover the conceptual nature of protein-protein interactions from the bibliome. Because the novel approach is based on a rather lightweight linear model, it can easily be ported and applied to similar problems. In full-text problems, the expansion of word features with word proximity networks is shown to be useful, although the need for some improvements is discussed.