Table 4

Steps in the LeFE algorithm

Step

Details


A

The gene expression matrix, signature vector, and gene categories (not shown in Figure 1) are entered

B

For each category Ei, the genes in the microarray (X) are split into those in Ei and those not in Ei (denoted, respectively, by two blue and 12 green rows of the gene expression matrix)

C

A negative control set consisting of C × ni genes in X but not in Ei is selected at random. Those genes are noted as elevated rows in Figure 1, which was arbitrarily drawn for C = 3. The integer constant C is used to mitigate issues of statistical imprecision associated with small categories. The default value C = 6 creates better run-to-run reproducibility and was used in the actual LeFE calculations

D

A composite matrix of gene expression Xiter is assembled by selecting the rows in X that correspond to the ni genes in Ei along with the negative control genes selected in step C. The resulting data structure, Xiter, has dimension (C × ni) + ni rows by ns columns

E

A random forest with 400 trees is built on Xiter to model the signature vector, using the default value for the random forest parameter mTry. The random forest is given no information for distinguishing between category genes and negative control genes

F

A vector I of standardized importance scores of each gene in Xiter is computed internally from the random forest. I is then divided into two sets of importance scores, IE and InotE, for the genes in Ei and the negative control set, respectively

G

The statistical significance of the gene expression evidence for rejection of the null hypothesis that the mean of IE is less than or equal to the mean of InotE (that the category genes are, on average, more important to the Random Forest model than the negative control genes) is determined by a one-sided permutation t-test. Because that test compares observed t-statistics with a null distribution of t-statistics of permuted data, it avoids using the parameterized t-distribution and is therefore nonparametric. For statistical robustness, steps C to G are repeated nr times with different, randomly selected sets of negative control genes. The covariate structures of the E and notE genes are likely to differ, but the negative control genes, selected at random from notE, are unbiased with respect to the ranking of categories in the next step and with respect to the calculation of false discovery rates

H

Applying the above procedure to a single gene category creates three outputs. The first (denoted i, in Figure 1) is a median importance score for each of the ni genes in the category. Because the median importance scores are computed within the category's multivariate Random Forest models, they reflect the genes' importance within its complex biological context. The second output (denoted ii) is the entire category's median P value from the nr permutation t-tests. The third output (denoted iii) is an importance plot, which compares the distributions of importance scores of the genes in the category and the negative control genes


Shown are the steps in the Learner of Functional Enrichment (LeFE) algorithm, with the letters in the left-hand column corresponding to those in Figure 1.

Eichler et al. Genome Biology 2007 8:R187   doi:10.1186/gb-2007-8-9-r187

Open Data