Log on / register
BioMed Central home | Journals A-Z | Feedback | Support | My details
.refereed research
 |  |  |  |  | 


This article is part of the supplement: The BioCreative II - Critical Assessment for Information Extraction in Biology Challenge .

Open AccessResearch

Uncovering protein interaction in abstracts and text using a novel linear model and word proximity networks

Alaa Abi-Haidar1,2, Jasleen Kaur1, Ana Maguitman3, Predrag Radivojac1, Andreas Rechtsteiner4, Karin Verspoor5, Zhiping Wang6 and Luis M Rocha1,2 email

School of Informatics, Indiana University, 107 S. Indiana Ave. Bloomington, IN 47405, USA

FLAD (Fundação Luso-Americana para o Desenvolvimento) Computational Biology Collaboratorium, Instituto Gulbenkian de Ciência, Rua da Quinta Grande, 6 P-2780-156 Oeiras, Portugal

Departamento de Ciencias e Ingenería de la Computación, Universidad Nacional del Sur, Avenida Alem 1253, Bahía Blanca, Buenos Aires, Argentina

Center for Genomics and Bioinformatics, Indiana University, 107 S. Indiana Ave. Bloomington, IN 47405, USA

Modeling, Algorithms and Informatics Group, Los Alamos National Laboratory, 1350 Central, MS C330 Los Alamos, NM 87545, USA

Biostatistics, School of Medicine, Indiana University, 107 S. Indiana Ave. Bloomington, IN 47405, USA

author email corresponding author email

Genome Biology 2008, 9(Suppl 2):S11doi:10.1186/gb-2008-9-s2-s11

Published: 1 September 2008

Abstract

Background:

We participated in three of the protein-protein interaction subtasks of the Second BioCreative Challenge: classification of abstracts relevant for protein-protein interaction (interaction article subtask [IAS]), discovery of protein pairs (interaction pair subtask [IPS]), and identification of text passages characterizing protein interaction (interaction sentences subtask [ISS]) in full-text documents. We approached the abstract classification task with a novel, lightweight linear model inspired by spam detection techniques, as well as an uncertainty-based integration scheme. We also used a support vector machine and singular value decomposition on the same features for comparison purposes. Our approach to the full-text subtasks (protein pair and passage identification) includes a feature expansion method based on word proximity networks.

Results:

Our approach to the abstract classification task (IAS) was among the top submissions for this task in terms of measures of performance used in the challenge evaluation (accuracy, F-score, and area under the receiver operating characteristic curve). We also report on a web tool that we produced using our approach: the Protein Interaction Abstract Relevance Evaluator (PIARE). Our approach to the full-text tasks resulted in one of the highest recall rates as well as mean reciprocal rank of correct passages.

Conclusion:

Our approach to abstract classification shows that a simple linear model, using relatively few features, can generalize and uncover the conceptual nature of protein-protein interactions from the bibliome. Because the novel approach is based on a rather lightweight linear model, it can easily be ported and applied to similar problems. In full-text problems, the expansion of word features with word proximity networks is shown to be useful, although the need for some improvements is discussed.


© 1999-2010 BioMed Central Ltd unless otherwise stated. Part of Springer Science+Business Media.