Table 6 |
||
|
Sources of errors for the gene mention normalization |
||
|
n |
Cause |
Evidence or examples |
|
|
||
|
False negatives |
Evidence from abstract/closest lexicon entry |
|
|
|
||
|
24 |
Polluting tokens |
spectrin betaIV/spectrin beta non-erythrocytic |
|
35 |
Unrecognized variations (orthographic, |
DCoHm/DCOHM |
|
lexical, structural, morphological) |
prothrombin/thrombin |
|
|
4 |
Segmentation of name failed |
hOBP (IIb)/hOBPIIb |
|
2 |
Syntactically unrelated |
polycomblike/PHD finger protein |
|
66 |
Removed by filtering step |
|
|
|
||
|
False positives |
Examples, with EntrezGene ID |
|
|
|
||
|
30 |
Triggered by wrong name boundary |
type II IL-1 receptor |
|
30 |
Context filtering (reference to cell etc.) |
CD4+ |
|
22 |
TF*IDF filter |
five EGF-like domains; ARC complex |
|
11 |
Disambiguation picked wrong gene |
Nup358 (440872 instead of 5903) |
|
8 |
Abbreviation resolution failed |
Wolf-Hirschhorn syndrome (WHS) |
|
4 |
Wrong species |
Notch1 (...) murine tissues |
|
2 |
Overlap of names not recognized |
|
|
2 |
NER missed correct ID |
TR2 (8740 instead of 10587) |
|
26 |
Multiple identifiers for one name |
|
|
40 |
Other |
|
|
|
||
|
Analysis of errors that occurred during gene identification, false negatives and false positives, and examples of errors. Words in italics are the parts recognized in longer compound names. NER, named entity recognition. |
||
|
Hakenberg et al. Genome Biology 2008 9(Suppl 2):S14 doi:10.1186/gb-2008-9-s2-s14 |
||