Table 1

The 44 selected sequences within the ENCODE region



Random picks Mouse homology





Sequence Set
Manual picks
Low
Medium
High
Gene density

Training
ENm006
ENr132
ENr231
ENr333
High



ENr232
ENr334


ENm004
-
ENr222
ENr323
Medium



ENr223
ENr324


-
ENr111
-
-
Low


ENr114



Test
ENm002
ENr131
ENr233
ENr331
High

ENm005
ENr133

ENr332


ENm007





ENm008





ENm009





ENm010





ENm011





ENm001
ENr121
ENr221
ENr321
Medium

ENm003
ENr122

ENr322


ENm012
ENr123




ENm013





ENm014





-
ENr112
ENr211
ENr311
Low


ENr113
ENr212
ENr312




ENr213
ENr313


ENCODE sequences were assigned to either the training or the test set based on annotation data availability (see the section 'The EGASP experiment'). For the performance evaluation, only the test set sequences were used. The numeric code for the randomly picked sequence names correspond to the non-exonic conservation with the mouse genome, the density of previously identified genes, and the sequence number, respectively; numbers vary from 1 (low), to 3 (high). Manually selected sequences range in size from 500 kbp to 2 Mbp, while random regions are 500 kbp. The selection and stratification criteria for all the sequences is described at the ENCODE project web site [34].

Guigó et al. Genome Biology 2006 7(Suppl 1):S2   doi:10.1186/gb-2006-7-s1-s2