Table 6

Sources of errors for the gene mention normalization

n

Cause

Evidence or examples


False negatives

Evidence from abstract/closest lexicon entry


24

Polluting tokens

spectrin betaIV/spectrin beta non-erythrocytic

35

Unrecognized variations (orthographic,

DCoHm/DCOHM

lexical, structural, morphological)

prothrombin/thrombin

4

Segmentation of name failed

hOBP (IIb)/hOBPIIb

2

Syntactically unrelated

polycomblike/PHD finger protein

66

Removed by filtering step


False positives

Examples, with EntrezGene ID


30

Triggered by wrong name boundary

type II IL-1 receptor

30

Context filtering (reference to cell etc.)

CD4+

22

TF*IDF filter

five EGF-like domains; ARC complex

11

Disambiguation picked wrong gene

Nup358 (440872 instead of 5903)

8

Abbreviation resolution failed

Wolf-Hirschhorn syndrome (WHS)

4

Wrong species

Notch1 (...) murine tissues

2

Overlap of names not recognized

2

NER missed correct ID

TR2 (8740 instead of 10587)

26

Multiple identifiers for one name

40

Other


Analysis of errors that occurred during gene identification, false negatives and false positives, and examples of errors. Words in italics are the parts recognized in longer compound names. NER, named entity recognition.

Hakenberg et al. Genome Biology 2008 9(Suppl 2):S14   doi:10.1186/gb-2008-9-s2-s14

Open Data