Email updates

Keep up to date with the latest news and content from Genome Biology and BioMed Central.

Open Access Method

Identification of co-regulated transcripts affecting male body size in Drosophila

Cynthia J Coffman12, Marta L Wayne3, Sergey V Nuzhdin4, Laura A Higgins3 and Lauren M McIntyre25*

Author Affiliations

1 Health Services Research and Development Biostatistics Unit, Durham VA Medical Center (152), Durham, NC 27705, USA

2 Duke University Medical Center, Department of Biostatistics and Bioinformatics, Durham, NC 27710, USA

3 Department of Zoology, University of Florida, Gainesville, FL 32611, USA

4 Department Ecology and Evolution, University of California at Davis, Davis, CA 95616, USA

5 Department of Agronomy, Purdue University, West Lafayette, IN 47907, USA

For all author emails, please log on.

Genome Biology 2005, 6:R53  doi:10.1186/gb-2005-6-6-r53


The electronic version of this article is the complete one and can be found online at: http://genomebiology.com/2005/6/6/R53


Received:20 January 2005
Revisions received:21 February 2005
Accepted:9 May 2005
Published:1 June 2005

© 2005 Coffman et al.; licensee BioMed Central Ltd.

This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Abstract

Factor analysis is an analytic approach that describes the covariation among a set of genes through the estimation of 'factors', which may be, for example, transcription factors, microRNAs (miRNAs), and so on, by which the genes are co-regulated. Factor analysis gives a direct mechanism by which to relate gene networks to complex traits. Using simulated data, we found that factor analysis clearly identifies the number and structure of factors and outperforms hierarchical cluster analysis. Noise genes, genes that are not correlated with any factor, can be distinguished even when factor structure is complex. Applied to body size in Drosophila simulans, an evolutionarily important complex trait, a factor was directly associated with body size.

Background

Unraveling complex traits requires an understanding of how genetic variation results in variation among transcript levels, proteins, and metabolites, and how this variation generates phenotypic variation. These distinct levels in the biological system are interdependent. The ability to model interactions among loci at each of these levels, and relationships between levels, is key to providing insight into complex traits. The promise of genomic and proteomic technology is in capturing variation for thousands of loci simultaneously. This affords an unprecedented opportunity to understand the consequences of genetic variation. Many studies have exploited this ability through the use of mutant analysis applied to whole-genome transcript arrays. Mutant analysis provides insight into the impact of a mutation on a gene network and whole-genome studies of transcription have revealed misexpression due to gene knockouts and have established redundancy and specificity of transcriptional regulation [1]. Cluster analysis has been successfully combined with tests of differential expression to study whole-genome response to mutation in order to develop hypotheses about co-regulation and coordinated expression [2,3].

However, the consequences of such strong perturbations are difficult to apply to pathways in non-mutant individuals. In addition, the mutations chosen usually cause a severe alteration in a single gene, such as a knockout. Natural variants introduce smaller changes in pathways [4] and natural variants may exhibit allelic differences at several loci. Natural variation in the transcriptome as a consequence of genetic variation has been demonstrated [5,6]. Natural genotypes can also be mated in a deliberate manner and the progeny of such matings can be used to estimate the genetic architecture of individual traits [7,8], and to link traits across different levels of the biological system [9,10]. We focus here on providing insight into how coordinated gene expression affects phenotype. Links between transcript abundance and phenotypic variation have been established [11-15]. What is needed now is an analytic approach that allows interpretation of the relationships among transcript levels and modeling of the link between transcript level and complex trait.

Factor analysis is an analytic approach that describes the covariation among a set of genes through the estimation of factors. One may interpret the factor as the mechanism, for example transcription factors, microRNAs (miRNAs), and so on, by which genes are co-regulated. The resulting factor model represents sets of coordinately expressed genes. Genes may participate in multiple factors. Principal components analysis, spectral map analysis and correspondence analysis are alternative multivariate techniques for microarray analysis [16,17] that can all be used in this capacity. Factor analysis, however, provides a convenient representation of the gene network by describing each gene's association with the factor as a load (between -1 and 1), where the strength of the load indicates how much influence the transcript level of that gene has upon the factor. The factor can then be examined for associations with complex traits [18]. Factor analysis is the extension of Sewell Wright's work on the correspondence among traits [19], and as such is perfectly suited for modeling the relationships among transcript levels for a set of crosses. The high dimensionality of genome-wide expression data presents special challenges. This challenge, primarily the ill-conditioned matrices resulting from such studies, has been well described and explicitly acknowledged in much of the literature on the analysis of gene-expression data [20-22]. If thousands of genes belonging to dozens of networks are simultaneously considered as current theory indicates, spurious associations may emerge and/or true associations may be obscured [23,24]. Previous applications of factor analyses to array data [25,26] dealt with this issue by an initial reduction of dimensionality through the use of cluster analysis.

Using simulation studies, we evaluate the utility of factor analysis for identifying covariation in gene-expression data and identifying underlying factors. We compare the performance of factor analysis to hierarchical clustering and tight clustering [27]. We then test the estimation of factors on a set of Drosophila lines for genes involved in the immune pathway. The immune system provides a relatively well understood set of interactions and as such allows a real data check on the applicability of factor analysis to microarray data.

A logical next step is to use factor analysis to relate variation among transcript levels to phenotypic variation, a step not possible in a cluster analysis. For Drosophila, body size is a complex trait where latitudinal clines in body size have been repeatedly demonstrated across ectotherms [28]. In D. subobscura, a body-size cline evolved in 12 decades, thus ranking body size in flies among the fastest-evolving morphological traits ever observed in natural populations [29]. The proximate reasons for these clines are complex, especially given that body size in flies is positively correlated with mating success in males [30-32]. Of further interest are data suggesting that the same genomic regions are involved in adaptation in two of these clines, South America and Australia [33]. In contrast to the immune system, there is little a priori information on how the candidates genes are related to one another. In addition, identification of factors associated with variation in body size in natural populations of Drosophila is a question of great evolutionary interest.

Results

Simulations

In the initial scenario, a sample size of 100 individuals was examined. This sample size is large for a microarray experiment, but is in the low range of the minimum sample size suggested in factor analysis methodology [34]. We simulated a high degree of correlation among genes within a factor (ρ = 0.80), three factors with a manageable number of genes associated within each factor (correlated genes: 30), and some genes not associated with any factor (noise genes: 100). We assume genetically variable lines for which differences in transcript abundance among lines was moderate within each of the three factors [35]. Factor analysis on these data was performed. Factors were identified by examining the eigenvalues of the correlation matrix [23]. The first five eigenvalues were 25.3, 21.3, 18.3, 4.3, and 4.1. The substantial drop between the third and fourth eigenvalues (from 18.3 to 4.3) indicates that three factors (the number simulated) are clearly identified, explaining 34% of the variation. We then set the number of factors in the analysis to three, and estimated factor loadings in order to examine the structure of the factor. All (100%, n = 90) of the correlated genes loaded [36] on the correct factor, with none of the noise genes loading on any factor (see Table 1). Reducing the correlation among factors, and reducing the effect size do not affect the ability of factor analysis to identify the correct underlying structure (Table 1).

Table 1. Gene expression simulations

Results of a hierarchical cluster analysis found that the three groups of genes clustered together with the noise genes which formed two distinct clusters. However, discriminating the true clusters from the noise clusters was not obvious using standard approaches. Tight clustering [27], where a resampling strategy is used to separate noise genes from signal, on these data was interesting. If the number of clusters is set to the true value of three, all 190 genes are identified as noise. If the number of clusters is set to five, 45 of the 100 noise genes are correctly identified as noise. All of the correlated genes are placed into the correct clusters. The remaining 55 noise genes are placed into clusters.

For the case with lower effect size and lower correlation, the dendrogram resulting from hierarchical cluster analysis is given in Figure 1. As in the factor analysis, the three groups of genes clustered together well, although not perfectly. Once again, however, statistics for determining the appropriate number of clusters did not clearly identify the correct number of clusters. The noise genes also seem to follow discernible clustering patterns. In tight clustering, when the correct number of clusters are specified and the number of extra clusters (k0) is set to 6-7, 23 of the 90 correlated genes are identified as noise and all of the noise genes are correctly identified. Setting the number of clusters higher results in clusters of noise genes. In this simple case, factor analysis clearly outperforms both traditional hierarchical cluster analysis and tight clustering, as it is easily able to discern the correct number of underlying clusters.

thumbnailFigure 1. Hierarchical cluster plot of simulation with two genotypes, 100 noise genes, and three factors. ρ = 0.40, effect size = 0.2, 0.4, and 0.6. Blue, noise genes; green, genes from underlying factor 1; red, genes from underlying factor 2; black, genes from underlying factor 3.

We then increased the number of genes from a total of 190 (90 in the three networks and 100 noise genes) to 1,900 (900 in the three factors and 1,000 noise genes). Using factor analysis, we easily identified the correct number of factors and 100% of the genes in each factor loaded on the correct factor (see Table 1). Lowering the correlation among genes in a factor to ρ = 0.4 resulted in the reduction of the explanatory power of the factor analysis. The number of underlying factors was correctly identified although, as expected, the total variation explained by the factors was reduced. Of the correlated genes, 66% loaded on the correct factors and only one noise gene (out of 1,000) was mistakenly placed into a factor. Given the reasonable fractions identified when the number of genes in factors differs by an order of 10 (190 versus 1,900), and the fact that our recovery of the structure was virtually unchanged, it is apparent that the number of genes in a factor does not impact on the ability of factor analysis to recover the factor structure. In contrast, hierarchical cluster analysis performs less well as the number of noise genes increases, with the noise genes increasing in their dispersion among clusters.

In a set of simulations to match our Drosophila experimental design, 10 genotypes with three replicates per genotype for a total of 30 samples (chips) were simulated. Averaging transcript abundance within each genotype removed uninteresting variation and increased resolution (data not shown). We began with three factors of 30 genes each, and 100 noise genes. The maximum difference in transcript abundance among genotypes was large (1, 2 and 3). In this case, the three factors were not clearly identified. To determine whether genes would be correctly placed inside factors, the number of factors was set to three. When difference in transcript abundance among genotypes was lowest, fewer genes were placed in the correct factor (13/30). On the other hand, where difference in transcript abundance among genotypes was highest and 100% of the correlated genes loaded on the correct factor, 29 of the genes from the other two factors and 29 noise genes were erroneously identified as a part of this factor. When correlation among genes within a factor was reduced to ρ = 0.40, the difficulty in identifying genes on the correct factor increased.

We hypothesized that the presence of many noise genes, coincident with a sample size of ten, was responsible for the difficulty in identifying the correct number of gene factors. Without the noise genes, when correlation was ρ = 0.8, each of the underlying factors was clearly identified. All (100%) of the genes were correctly placed in the corresponding factors, although several genes were placed in multiple factors. When correlation among genes in a factor was reduced to ρ = 0.40, the number of factors was underestimated and many of the genes were identified in multiple factors. This indicates a complex interplay between the difference in transcript abundance among genotypes, the sample size, the number of noise genes and the correlation structure in the identification of factors.

We also examined the impact of increasing the number of factors beyond the number of genotypes. We simulated a set of 600 genes belonging to 20 factors, with 30 genes in each factor and 100 noise genes. In an analysis with nine factors (the maximum estimable with ten genotypes), factor 1 seemed to capture the majority of all the genes in the 20 factors. When effect size was large, the majority of the noise genes (58%) were identified by their failure to load highly (greater than 0.7) on any factor, and the majority of genes that were correlated did load highly on at least one factor (96%). Lowering the correlation lowers the ability to identify correlated genes as loading highly. In this case, 30% are identified and approximately the same fraction of noise genes (60%) are identified. Consistent with previous factor analysis theory [23,36], when the number of factors is larger than the sample size the number and composition of the factors cannot be estimated.

We then wanted to determine whether hierarchical cluster analysis would resolve this structure more clearly. We plotted the results in a dendrogram where each of the simulated factors are plotted with a separate color (Figure 2). No clear pattern of clustering was found. However, the clusters do form some 'kernels', so that biological knowledge of pathways could potentially be applied to interpret some of the groupings, as is common practice. However, the clear presence of noise genes throughout the cluster structure clearly makes interpretation difficult in cases where the true structure is unknown.

thumbnailFigure 2. Hierarchical cluster plot of simulation with ten genotypes, 100 noise genes, and 20 factors. ρ = 0.4, effect size = 1, 2,...20. Blue, noise genes; other colors represent genes that should cluster together.

We then applied tight clustering to these data. When the correct number of clusters is specified, tight clustering does identify the structure of the 20 clusters. Each cluster consists of a subset of genes that belong to that cluster. Notably, it does not erroneously place noise genes into clusters. It performs less well in identifying the correct number of correlated genes failing to place 50% into clusters, instead classifying these correlated genes as noise. When a larger number of clusters (25) is specified, then there are some 'extra' clusters of noise genes and the clusters of genes are themselves not as distinct, that is 40 genes of the 600 are incorrectly grouped and 221 or 37% of the genes which should be in a cluster are designated as noise.

Overall, the simulation results indicate that focusing on a manageable number of possible factors with a measurable amount of difference in transcript abundance among lines can result in successful identification of factor structure, even when the number of genes examined is large relative to the sample size. We also find that the factor loadings can distinguish noise genes even in complex cases, although the factor structure can not be resolved clearly in those cases.

Data analysis

Loci showing evidence for transcript variation

The GeneChip Drosophila Genome Array was used for this study and of the approximately 13,500 genes on the array, 7,886 showed expression on at least one array, and of these, 4,667 showed evidence for variation among genotypes. As we are studying covariation, we restricted our examination to this list of 4,667 loci (see Additional data file 1 [Supplementary Table 1]).

The immune pathway

To provide an assessment of the performance of factor analysis on a set of well characterized genes, FlyBase [37] was queried for candidate genes involved in the immune pathway [38]. We compared the candidate genes to the list of 4,667 transcripts, and 54 genes were identified. Factor analysis on these transcript levels resulted in the identification of three factors (see Figure 3). Notably, the first factor contained all the lysogen genes present in the study, and the second factor contained all cecropins (Table 2). Factor-analysis groups co-regulated genes in a manner consistent with our understanding of the immune pathway. Hierarchical cluster analysis was also performed on this set of genes. As with the factor analysis, hierarchical cluster analysis found that the lysogen genes clustered together, as did the cecropins. However, determining the appropriate number of clusters was problematic and did not lead to any clear interpretation of the appropriate number of clusters. The final factor model for the immune genes included three factors, therefore we examined a k-means clustering analysis with three clusters for comparison. We found that 17/24 (71%) of the genes that loaded high on factor 1 were on the same cluster. This cluster also includes a few genes that loaded on factors 2 and 3. Genes that loaded high on the second and third factor were distributed among the remaining two clusters. The genes in these groups that did not load significantly on any factor were distributed among the three clusters.

thumbnailFigure 3. SCREE plots. The x-axis is the ordinal number of the eigenvalue and the y-axis is the magnitude of the eigenvalue. The number to the right of the plotted point indicates the cumulative variance explained as each factor is added. The dotted line indicates the cutoff point in the SCREE plot where there is a sharp drop off in the magnitude of the eigenvalues. The number of factors above the dotted line are the number retained for the factor analysis. (a) Body size, 92 genes; four factors are selected. (b) Immune, 53 genes; three factors are selected [36].

Table 2. Factor analysis for candidate genes for immune function

Candidate loci for body size

We again queried FlyBase for a list of genes involved in body size determination and found 92 body size candidates in our list of 4,667 transcripts. Four factors were identified (Figure 3). The identification of covarying genes in the same factor was intriguing. Of particular note is the presence of loci that are contained within quantitative trait loci (QTL) for body size on factor 1: Cdk4, trx, akt1, fru, Dr, mask, khc, and InR [33].

In our analysis, we regressed transcript level for individual candidate genes on the body size phenotype. We found 2,892 genes significant at a nominal level of 0.05 (false discovery rate (FDR), 16%, see Table 3). At a more stringent nominal threshold of 0.01, there were 14 candidate genes for body size which showed significant association between transcript abundance and phenotypic variation for male body size (FDR of 7%). Of the QTL candidates, only InR showed significant association with the phenotype.

Table 3. Factor analysis for candidate genes for body size

We then tested the hypothesis that the estimated factors were directly related to phenotypic variation in body size. In our analysis, we regressed the estimated factor (latent variable) on the phenotype body size for each genotype. The regression of factor 1 on body size showed evidence of an association between the factor and the phenotype of body size (P = 0.04, Figure 4a).

thumbnailFigure 4. Regression plots. (a) A plot of factor 1 from candidate genes for body size on the x-axis and measured male body size on the y-axis. The solid line is the regression of factor 1 on measured male body size with an estimated slope of -0.021y with a standard deviation of 0.008 and is significantly different from zero (P = 0.04). Line crosses: open square 1136; open circle (top left), 611; open triangle 3743; plus sign, 4361; multiplication sign, 6177; open diamond, g785; inverted open triangle 8599; star 99105; solid circle, 1056; open circle (middle), 3637. (b) A plot of factor 3 from candidate genes for miRNA on the x-axis and measured male body size on the y-axis. The solid line is the regression of factor 3 on measured male body size with an estimated slope of -0.020y with a standard deviation of 0.008 and is significantly different from zero (P = 0.04). Symbols as in (a).

Targets of miRNAs

Of 535 putative miRNA targets [39], 203 were contained in our set of 4,667 gene transcripts. Factor analysis resulted in the identification of four gene factors (Table 4). The second factor contained four of the same genes as factor 1 for body size (puc, Eh, mys, bon) with 76 additional genes contained in this factor (loaded at 0.40 or greater). However, this factor was not associated with body size (P = 0.55). While some of the QTL candidates are also putative targets of miRNA regulation, (Cdk4, trx, Dr) these genes did not participate in this factor but were common to the factor identified by the third factor (see Table 4), for which 69 additional genes loaded. This third factor was negatively correlated with body size (P = 0.04, see Figure 4b). (Cdk4, trx, Dr) were not associated with body size in regressions between these individual genes and body size.

Table 4. Factor analysis for putative targets of miRNAs

Discussion

We applied factor analysis to high-dimensional microarray data. Using simulated data to estimate factors, we found that when correlation among genes is strong, the number of factors and their structure can be estimated, even in the case where genes unrelated to the factor structure (noise genes) are included. We also found that when noise genes were included hierarchical cluster analysis was unable to separate the noise genes from the signal, or to correctly identify the number of clusters. When the number of genes is large relative to the sample size, as is common in array studies, the number of factors and the genes belonging to each factor can still be identified, as long as the number of factors is less than the sample size. In contrast, hierarchical cluster analysis did not identify the number of clusters even when the number of clusters was smaller than the sample size.

We found that if the number of factors is larger than the number of genotypes, the majority (58%) of noise genes still do not load on any factor, while all but 26 of the 600 correlated genes do load on at least one factor. However, the correct association between individual genes and factors is lost. We conclude that while factor analysis is effective at separating the signal from the noise, the structure of the signal is not estimable. This is consistent with reports in the literature [23,36]. Using hierarchical clustering, the number and the structure of clusters is not recovered and noise genes are scattered throughout the cluster structure. Kernels of tightly correlated genes were visible, however, indicating that kernel identification is possible in cases where biological knowledge is present. Any separation of signal from noise is purely serendipitous.

These results are not unexpected as the mathematical properties of gene-expression data or the high dimensionality of the data lead to problems for any analysis. When the number of columns (in this case observations) is less than the number of rows (or variables), the matrix is considered ill conditioned [40]. Ill-conditioned matrices can cause problems for many types of statistical analysis and can lead to overfitting of predictive models, among other problems [41,42]. Some of the effects of the ill-conditioned matrices can be mitigated; however, the problem of an overdetermined system will always exist. An example is in multiple regression, where models with more variables than we have data can not be fit [43].

Tight clustering [27] represents a significant advance over hierarchical clustering in the estimation of cluster structure for microarray data. It provides a reasonable way of identifying most noise genes. In simple cases, however, the algorithm needs to have some flexibility when specifying starting values; that is, more rather than fewer clusters improve the chances of correctly identifying true clusters. The algorithm requires that the number of clusters be specified a priori. In contrast, in many simple cases the number and structure of factors can be recovered precisely using a factor analysis. In the most complex case examined (20 factors, 600 genes and 100 noise genes), when the number of clusters is correctly specified, tight clustering identifies 100% of the noise genes and the 20 clusters with correct genes within each cluster. However, it also incorrectly identifies 50% of genes with signal as noise genes. If too many clusters are specified (25), then the number of genes identified in clusters increases to 63%, although the correct structure is no longer maintained and 20% of the noise genes are incorrectly clustered. In contrast, factor analysis separates the signal from the noise, correctly identifying 96% of the noise genes as noise and 58% of the genes as having signal. In this complex case it is difficult to say whether factor analysis or tight clustering is 'better'. Both of these approaches clearly outperform hierarchical clustering, and they are complementary in their approach. If the primary goal is to separate the correlated genes from the uncorrelated genes then factor analysis performs better than tight clustering. If determining the number and structure of factors (coexpressed genes) is the goal, using both approaches and comparing findings will be reasonable.

Clear network structures can be identified with factor analysis and can then be followed up experimentally. Furthermore, unlike cluster analysis (hierarchical or tight), which provides no summary of the clusters into a single variable, the factor loading values are directly interpretable as the degree of participation of that locus in a factor. In contrast to cluster analysis, the factor loadings for genes give us information on the relative strength of a gene on a factor. We can use factor loadings to identify the most 'significant' genes on a factor and we can use the factor loadings to remove genes that are not contributing to any factor. As such, genes that all load highly on the same factor can be said to be coordinately expressed and putatively co-regulated, and noise genes can be identified as they will not load high on any factor. Unlike clustering, factor analysis allows genes to participate in several factors, thus reflecting biological reality more accurately.

Extensions of factor analysis have been developed and include allowing for nonlinearity [44,45] and the application of Bayesian approaches [46]. Factor analysis is itself a subclass of modeling techniques known as structural equation models [47] and the ongoing theoretical interest in these approaches allows for expansion of consideration to more than the present circumstance.

We focus our Drosophila analyses on a set of genotypes that are mated according to a round-robin design where the parental lines are natural variants. Analyses of such lines allows inferences about the underlying genetic contribution to variation and inferences about variation in pathways that occurs as a result of such natural genetic variation. By using candidate loci from reverse genetic and mutational projects, the importance of these loci in a broad context can be assessed. The factor analysis for the list of genes annotated as body size candidates resulted in the estimation of four factors. Factor 1 is of great interest as several of the genes that load on this factor - Cdk4, trx, akt1, fru, Dr, mask, woc, Khc, and InR - are contained in QTL for body size [33]. This overlap is exciting as the data from this QTL analyses are independent of our factor analysis. The identification of the same set of loci lends weight to the evidence that these loci are involved with the formation of body size in a natural population. Of these loci, only InR is directly correlated with body size. Given our limited knowledge of pathways for body size, it was exciting to note that two of the genes in this factor - trx and M(2)21AB - have been shown to interact [48]. In addition, the early sex-determination cascade genes Sxl, tra, and tra2 are all associated with a different factor (factor 2).

If the expression of multiple genes is regulated by a common underlying factor such as a transcription factor or a miRNA, and this regulation exhibits genetic variation, then we expect that the gene expression among these genes in a set of genotypes from a mating design will also be correlated. Factor analysis on putative targets of miRNA control revealed that putative targets of the same miRNA often occurred on the same factor. The number of targets for each miRNA was small, and so we cannot conclude that the coexpression is greater than we would expect by chance. Nevertheless, this is the first opportunity to examine the correlation structure of these putative targets of miRNAs. We found that of the 203 genes identified as miRNA targets, there was evidence for 188 participating in at least one of the four factors. While four genes on factor 1 for body size (puc, Eh, mys, and bon), and nine additional candidate body size genes (Sh, Abd-B, trx, fng, qm, woc, Dr, and Cdk4) are putative targets of miRNAs [39], the resulting miRNA factors are uncorrelated with the factors for body size. One of the miRNA factors is associated with the body size phenotype.

In summary, factor analysis, a technique developed to discover and model underlying mechanisms in complex social and psychiatric situations, seems to offer a reasonable middle ground for gaining understanding of coordinated gene expression. Our simulations show that when the number of underlying factors is larger than the sample size, it is not straightforward to recover the structure of the simulated data, although signal can be separated from noise. In contrast, target groups, even when the number of genes is large, can be used to identify several underlying regulatory mechanisms. In the case of body size for Drosophila, factor analysis offers an exciting opportunity to estimate gene networks, as relatively little is known about how genes involved in body size work together.

Materials and methods

Simulations

In trying to estimate the structure of gene networks, one of the first questions to be addressed is whether the number of genes contributing to each network affects the ability of the analysis to determine the structure of the network. Accordingly, we varied the number of genes examined in our simulations to explore this process (see Table 1). Gene networks were simulated as a set of correlated expression values. These networks represent genes affected by some common underlying biological process and the correlation structure that results from this biological connectedness is the concrete evidence of the network. In our simulations, for each network, a mean from a gamma distribution (γ) is drawn. A gamma distribution was chosen for the mean values across genes as this distribution has been shown to fit the distribution of transcript levels seen from genes on an array [49]. Genes in the network were simulated according to a multivariate normal distribution with correlation among genes indicated by the parameter (ρ). We looked at two levels of correlation ρ = 0.40 (weak) and ρ = 0.80 (strong). Genetic variation for individual genes within a network was simulated as a linear combination of the multivariate normal mean, a fixed genotypic effect and random noise. Genotypic effects for different lines differed in magnitude. The standard deviation was set to 1, and the standard normal was used, so that differences between lines can be directly interpreted as the size of the effect, that is the effect sizes are absolute and not relative to the magnitude of the gene expression (or amount of variation) for a particular gene. When the maximum difference among genotypes is less than 0.2 the effects are small, when differences are between 0.2 and 0.6 the effects are medium and when effects are larger than 0.8 they are relatively large [35]. Between networks the maximum difference of transcript abundance (effect size) among lines was allowed to vary.

Genes not participating in any network, were also simulated. These genes, uncorrelated to each other, are hereafter referred to as noise genes. For noise genes, the mean value for gene expression was drawn from a gamma distribution and variation from that mean was random and without regard to the genotype.

Concurrently, factor analysis on the data was performed and oblique rotations were used for estimating factor loadings [36]. We used the eigenvalues to determine the number of factors according to standard factor analytic approaches [36]. The plot of the eigenvalues, against the ordinal number for the eigenvalue (SCREE plot) is examined for the last substantial drop in the magnitude of the eigenvalues and a model with the same number of factors as the number of eigenvalues before the last substantial drop are retained [23,36]. Hierarchical cluster analysis was performed using a Ward distance. Tight clustering, a resampling-based approach to cluster analysis, was performed using software provided by [27].

Drosophila lines

Isogenic lines of Drosophila simulans were made from flies caught in Wolfskill orchard [50] and crossed in a round-robin mating design as follows: the ten parental stocks (randomly sampled and independently derived from a large natural reference population) were crossed together in ten combinations to create heterozygous lines such that each parent was present twice, once as a dam and once as a sire. Crosses were (dam × sire): SIM6 × SIM11, SIM11 × SIM36, SIM36 × SIM 37, SIM37 × SIM43, SIM43 × SIM61, SIM61 × SIM77, SIM77 × SIM85, SIM85 × SIM99, SIM99 × SIM105, and SIM105 × SIM6. The resulting heterozygous genotypes were used for the experiment as follows: RNA from males was extracted and labeled 4-7 days post-eclosion [6,50]. Transcript level was estimated the average difference using MAS 5 [51]. Genes without positive signal on at least one array were removed from further consideration. The remaining genes were normalized to the chip median and log transformed. Genes lacking variation among genotypes cannot be meaningfully assessed for covariation. Accordingly, any gene that showed evidence (P ≤ 0.2) for variation in transcript level across lines [50] or evidence for additive genetic effect (P ≤ 0.05) [6] was considered in further analyses. Adult male flies were measured for body size [52].

Drosophila data analysis

Candidate gene lists were developed using FlyBase queries. The miRNA targets were taken from Enright and colleagues [39]. The resulting lists were matched against the set of loci for which we had evidence of genetic variation in transcript abundance among lines.

In the first step of the data analysis, regression of transcript abundance for individual genes that were candidates for body size, or putative miRNA targets, was conducted to test the hypothesis that individual genes transcript levels were associated with body size. The average transcript abundance for each gene i within genotype j (gij) was regressed on the average male body size (Yj) for each genotype j as follows:

Yj = μ + gij + εij.     (1)

The mean of all genotypes is μ and the random error is ε. Cluster analysis was performed on the set of immune genes. This was done using a hierarchical cluster analysis on the standardized values of gene expression. The results were plotted in a dendrogram.

Factor analysis on standardized values of gene expression, for each of the three lists (genes in the immune pathway, candidates for body size and putative miRNA targets) was conducted separately and factor loadings were estimated using an oblique rotation. The resulting set of eigenvalues was plotted in a SCREE plot, and the number of factors chosen such that the drop between eigenvalues was apparent, and a reasonable proportion of the variation was explained [36]. Each factor identified represents a gene network. Once the number of factors was identified, factor analysis was repeated for that fixed number of factors and loading values (the correlation between individual genes and the estimated factor structure) were estimated. All factor analyses were conducted in SAS (PROC FACTOR) using standard options, no special coding was required.

The effect of the factor upon the genotype is also estimable in the form of a factor value. Factor values were computed as a linear combination of the factor loading times the standardized value for the gene expression [18,23]. The factor value for genotype j (GNj) is an estimate of the impact of the network upon the genotype and is estimated as

where gij is the expression value for gene i within genotype j and li is the loading value estimated from the factor analysis for gene i across all genotypes. Differences in factor values represent differences between genotypes for the factor.

As all the genotypes are related in a mating design, we expect that differences in networks due to genetic variation which result in phenotypic variation should be detectable. Accordingly, the factor values were then regressed on the phenotype where

Yj = μ + GNj + εj     (2)

where GNj is the factor value for genotype j.

Additional data files

The following additional data is available with the online version of this paper. Additional data file 1 lists the median values of gene expression for each of the ten lines (after normalization) for the 4,667 genes found to have some evidence of a line effect. Some basic annotation information from Affymetrix is also provided along with the feature number.

Additional File 1. Line 1 through Line 10 medians for the 4667 genes considered in this analysis, after normalization (line1median) and standardization (line1median_st). Some basic annotation is also included as well as the Affymetrix feature number http://www.Flybase.com webcite, Fall 2004). The median values of gene expression for each of the ten lines (after normalization) are provided for the 4,667 genes found to have some evidence of a line effect. These are labeled as line1_median-line10_median. Standardized values are also provided and labeled as line1_median _st-line1-_median_st. Some basic annotation information from Affymetrix is also provided along with the feature number. The chromosome number was inferred based upon the cytogenetic map position.

Format: CSV Size: 1.6MB Download fileOpen Data

Acknowledgements

This work is supported by US Department of Agriculture-IFAFS N0014-94-1-0318 (L.M.M., C.J.C.), NIH GLUE Grant R24-GM65513 (S.V.N., M.L.W., L.M.M.), NIH-NIAID 5R01AI059111-02 (L.M.M.), the Purdue Agricultural Research Station, the University of Florida Microarray Core Facility, and the UC Davis Microarray Core Facility. We thank Hayden Bosworth (Duke University) for introducing us to factor analysis and Chien-Cheng (George) Tseng for helping us to implement tight clustering.

References

  1. Delneri D: The use of yeast mutant collections in genome profiling and large-scale functional analysis.

    Curr Genomics 2004, 5:59-65. Publisher Full Text OpenURL

  2. Singh A, McIntyre L, Sherman L: Microarray analysis of the genome-wide response to iron deficiency and iron reconstitution in the cyanobacterium Synechocystis sp. PCC68030.

    Plant Physiol 2003, 132:1825-1839. PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL

  3. Caldo R, Nettleton D, Wise R: Interaction-dependent gene expression in Mla-specified response to barley powdery mildew.

    Plant Cell 2004, 16:2514-2528. PubMed Abstract | Publisher Full Text OpenURL

  4. Stern D: Perspective: Evolutionary developmental biology and the problem of variation.

    Evolution 2000, 54:1079-1091. PubMed Abstract OpenURL

  5. Gibson G, Riley-Berger R, Harshman LG, Kopp A, Vacha S, Nuzhdin SV, Wayne ML: Extensive sex-specific non-additivity in gene expression in Drosophila melanogaster.

    Genetics 2004, 167:179-1799. Publisher Full Text OpenURL

  6. Wayne ML, Pan YJ, Nuzhdin S, McIntyre L: Additivity and trans-acting effects on gene expression in male Drosophila simulans.

    Genetics 2004, 168:1413-1420. PubMed Abstract | Publisher Full Text OpenURL

  7. Falconer , Mackay T: Introduction to Quantitative Genetics. Harlow, UK: Longman; 1996. OpenURL

  8. Lynch M, Walsh B: Genetics and Analysis of Quantitative Traits. Sunderland, MA: Sinauer; 1998. OpenURL

  9. Jansen R: Studying complex biological systems using multifactorial perturbation.

    Nat Rev Genet 2003, 4:145-151. PubMed Abstract | Publisher Full Text OpenURL

  10. Mackay T: The genetic architecture of quantitative traits: lessons from Drosophila.

    Curr Opin Genet Dev 2004, 14:253-257. PubMed Abstract | Publisher Full Text OpenURL

  11. Adams M, Sekelesky J: From sequence to phenotype: reverse genetics in Drosophila melanogaster.

    Nat Rev Genet 2002, 3:189-198. PubMed Abstract | Publisher Full Text OpenURL

  12. Kalidas S, Smith D: Novel genomic cDNA hybrids produce effective RNA interference in adult Drosophila.

    Neuron 2002, 33(2):177-184. PubMed Abstract | Publisher Full Text OpenURL

  13. Goto A, Blandin S, Royet J, Reichhart J, Levashina E: Silencing of Toll pathway components by direct injection of double-stranded RNA into Drosophila adult flies.

    Nucleic Acids Res 2003, 31:6619-6623. PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL

  14. Johnston DS: The art and design of genetic screens: Drosophila melanogaster.

    Nat Rev Genet 2002, 3:176-188. PubMed Abstract | Publisher Full Text OpenURL

  15. Hammond S, Caudy A, Hannon G: Post-transcriptional gene silencing by double-stranded RNA.

    Nat Rev Genet 2001, 2:110-119. PubMed Abstract | Publisher Full Text OpenURL

  16. Fellenberg K, Hauser NC, Brors B, Neutzner A, Hoheisel JD, Vingron M: Correspondence analysis applied to microarray data.

    Proc Natl Acad Sci USA 2001, 98:10781-10786. PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL

  17. Wouters L, Gohlmann H, Bijnens L, Kass S, Mohlenbergs G, Lewi P: Graphical exploration of gene expression data: a comparative study of three gene expression methods.

    Biometrics 2003, 59:1131-1139. PubMed Abstract | Publisher Full Text OpenURL

  18. Hatcher L: A Step-by-Step Approach to Using the SAS System for Factor Analysis and Structural Equation Modeling. Cary, NC: SAS; 1994. OpenURL

  19. Wright S: The method of path coefficients.

    Math Stat 1934, 5:161-215. OpenURL

  20. Dudoit S, Fridlyand J, Speed T: Comparison of discrimination methods for the classification of tumors using gene expression data.

    J Am Stat Assoc 2002, 97:77-87. Publisher Full Text OpenURL

  21. Tibshirani R, Hastie T, Narasimhan B, Eisen M, Sherlock G, Brown P, Botstein D: Exploratory screening of genes and cluster from microarray experiments.

    Stat Sin 2002, 12:47-59. OpenURL

  22. Parmigiani G, Garrett E, Irizarry R, Zeger S: The analysis of gene expression data: an overview of methods and software. In The Analysis of Gene Expression Data. Edited by Parmigiani G, Garrett E, Irizarry R, Zeger S. New York, NY: Springer; 2003:1-36. OpenURL

  23. Fabrigar L, MacCallum R, Wegener D, Stahan E: Evaluating the use of exploratory factor analysis in psychological research.

    Psychol Methods 1999, 4:272-299. Publisher Full Text OpenURL

  24. Preacher K, MacCallum R: Exploratory factor analysis in behavior genetics research: factor recovery with small sample sizes.

    Behav Genet 2002, 32:153-161. PubMed Abstract | Publisher Full Text OpenURL

  25. Peterson LE: Factor analysis of cluster-specific gene expression levels from cDNA microarrays.

    Comput Methods Programs Biomed 2002, 69:179-188. PubMed Abstract | Publisher Full Text OpenURL

  26. Peterson LE: CLUSFAVOR 5.0: hierarchical cluster and principal-component analysis of microarray-based transcriptional profiles.

    Genome Biol 2002, 3:software0002.1-0002.8. BioMed Central Full Text OpenURL

  27. Tseng G, Wong W: Tight clustering: a resampling-based approach for identifying stable and tight patterns in data.

    Biometrics 2005, 61:10-16. PubMed Abstract | Publisher Full Text OpenURL

  28. Partridge L, French V: Thermal evolution of ectotherm body size: why get big in the cold. In Animals and Temperature: Phenotypic and Evolutionary Adaptation. Edited by Johnston I, Bennett A. Cambridge, UK; Cambridge University Press; 1996:265-296. OpenURL

  29. Huey RB, Gilchrist GW, Carlson ML, Berrigan D, Serra L: Rapid evolution of a geographic cline in size in an introduced fly.

    Science 2000, 287:308-309. PubMed Abstract | Publisher Full Text OpenURL

  30. Partridge L, Farquhar M: Lifetime mating success of male fruitflies Drosophila melanogaster is related to their size.

    Animal Behav 1983, 31:871-877. OpenURL

  31. Partridge L, Mackay T, Aitken S: Male mating success and fertility in Drosophila melanogaster.

    Genet Res 1985, 46:279-285. OpenURL

  32. Partridge L, Hoffmann A, Jones JS: Male size and mating success in Drosophila melanogaster and D. pseudoobscura under field conditions.

    Animal Behav 1987, 35:468-476. OpenURL

  33. Caboli F, Kennington W, Partridge L: QTL mapping reveals a striking coincidence in the positions of genomic regions associated with adaptive variation in body size in parallel clines of Drosophila melanogaster on different continents.

    Evolution Int J Org Evolution 2003, 57:2653-2658. PubMed Abstract OpenURL

  34. Gorsuch R: Factor Analysis. 2nd edition. Hillsdale, NJ: Lawrence Erlbaum Associates; 1983. OpenURL

  35. Cohen J: Statistical Power Analysis for the Behavioral Sciences. 2nd edition. Hillsdale, NJ: Lawrence Erlbaum Associates; 1988. OpenURL

  36. Stevens J: Applied Multivariate Statistics for the Social Sciences. Lawrence Erlbaum Associates; 1996. OpenURL

  37. FlyBase [http://www.flybase.org] webcite

  38. Andersson M: Sexual Selection. Princeton, NJ: Princeton University Press; 1994. OpenURL

  39. Enright A, John B, Gaul U, Tuschl T, Sander C, Marks D: MicroRNA targets in Drosophila.

    Genome Biol 2003, 5:R1. PubMed Abstract | BioMed Central Full Text OpenURL

  40. Williams G: Linear Algebra with Applications. Boston, MA: Jones and Bartlett; 2005. OpenURL

  41. Olshen A, Jain A: Deriving quantitative conclusions from microarray expression data.

    Bioinformatics 2002, 18:961-970. PubMed Abstract | Publisher Full Text OpenURL

  42. Hastie T, Tibshirani R, Friedman J: The Elements of Statistical Learning: Data Mining, Inference, and Prediction. New York: Springer; 2001. OpenURL

  43. Neter J, Wasserman W, Kutner M: Applied Linear Statistical Models. New York, NY: McGraw-Hill/Irwin; 1996. OpenURL

  44. McDonald R: Factor interaction in nonlinear factor analysis.

    Br J Math Stat Psychol 1967, 20:205-15. PubMed Abstract OpenURL

  45. Molenaar PC, Boomsma D: Application of nonlinear factor analysis to genotype-environment interaction.

    Behav Genet 1987, 17:71-80. PubMed Abstract | Publisher Full Text OpenURL

  46. Fokoue E, Titterington D: Mixtures of factor analyses. Bayesian estimation and inference by stochastic simulation.

    Machine Learning 2003, 50:73-94. Publisher Full Text OpenURL

  47. Lee S, Zhu H: Maximum likelihood estimation of nonlinear structural equation models.

    Psychometrika 2002, 67:189-210. OpenURL

  48. Gildea J, Lopez R, Shearn A: A screen for new trithorax group genes identified little imaginal discs, the Drosophila melanogaster homologue of human reinoblastoma binding protein 2.

    Genetics 2000, 156:645-663. PubMed Abstract | Publisher Full Text OpenURL

  49. Black M, Doerge R: Calculation of the minimum number of replicate spots required for detection of significant gene expression fold change in microarray experiments.

    Bioinformatics 2002, 18:1609-1616. PubMed Abstract | Publisher Full Text OpenURL

  50. Nuzhdin S, Wayne M, Harmon K, McIntyre L: Common patterns of evolution of gene expression level and protein sequence in Drosophila.

    Mol Biol Evol 2004, 21:1308-1317. PubMed Abstract | Publisher Full Text OpenURL

  51. Affymetrix: Affymetrix GeneChip Expression Analysis Technical Manual. Santa Clara, CA: Affymetrix; 2000. OpenURL

  52. Wayne M, Hackett J, Mackay T: Quantitative genetics of ovariole number in Drosophila melanogaster. I. Segregating variation.

    Evolution 1997, 51:1156-1163. OpenURL