Table 4

GN results: performance impact of the seven heuristics used to normalize gene names on the development data.

Rule

Example

P

R

F


0

0.783

0.469

0.586

1

Substitution: Roman letters > arabic numerals

carbonic andydrase XI to carbonic andydrase 11

0.778

0.492

0.603

2

Substitution: Greek letters > single letters

AP-2alpha to AP-2a

0.779

0.497

0.607

3

Normalization of case

CAMK2A to camk2a

0.787

0.619

0.693

4

Removal: parenthesized materials

sialyltransferase 1 (beta-galactoside alpha-2,6-sialytransferase) to sialyltransferase 1

0.782

0.623

0.694

5

Removal: punctuation

VLA-2 to VLA2

0.768

0.667

0.714

6

Removal: spaces

calcineurin B to calcineurinB

0.784

0.742

0.762

7

Removal: strings < 2 characters

P

0.827

0.727

0.774


Presented are the seven heuristics used to normalize gene names in both lexicon construction and during processing of the gene tagger output, and the performance on the development data after each step was performed. GN, gene normalization.

Baumgartner et al. Genome Biology 2008 9(Suppl 2):S9   doi:10.1186/gb-2008-9-s2-s9

Open Data