|
| As a service to the research community, Genome Biology used to publish non-peer-reviewed articles in a 'preprint' depository to which any research can be submitted and which all individuals can access free of charge.From January 2006 Genome Biology no longer publishes new articles in this section. Any article could be submitted by authors, who have sole responsibility for the article's content. The only screening process is to ensure relevance of the preprint to Genome Biology's scope and to avoid abusive, libellous or indecent articles. Articles in this section of the journal have not been peer-reviewed. Each preprint has a permanent URL, by which it can be cited. Research submitted to the preprint depository may be simultaneously or subsequently submitted to Genome Biology or any other publication for peer review; the only requirement is an explicit citation of, and link to, the preprint in the article that is eventually published. If possible, Genome Biology will provide a reciprocal link from the preprint depository to the published article.![]() Deposited research article Observation of intermittency in gene expression on cDNA microarrays1Departments of Medicine, Molecular and Human Genetics, Baylor College of Medicine, Houston, Texas 77030, USA 2Department of Physics, University of Houston, Texas 77204, USA
Genome Biology 2002, 3:preprint0005.1-0005.6doi:10.1186/gb-2002-3-7-preprint0005 This is the first version of this article to be made available publicly, and no other version is available at present. Subject areas: Bioinformatics, Methods, Genome studies The electronic version of this article is the complete one and can be found online at: http://genomebiology.com/2002/3/7/preprint/0005
© 2002 BioMed Central Ltd AbstractWe used scaled factorial moments to search for intermittency in the log expression ratios (LERs) for thousands of genes spotted on cDNA microarrays (gene chips). Results indicate varying levels of intermittency in gene expression. The observation of intermittency in the data analyzed provides a complimentary handle on moderately expressed genes, generally not tackled by conventional techniques. PACS: 87.10.+e Deposited research articleScaled factorial moments have found widespread use in high-energy physics for detecting intermittency in particle production [1,2,3,4,5,6,7,8,9,10]. The presence of jet-like structures and perhaps quark-gluon plasma phase in particle production can result in clustering of data in bins leading to holes and spikes in the rapidity distribution. This investigation is based on the type of intermittency defined as nonstatistical fluctuations invariant over the scale of resolution of particle rapidity [11,12,13,14,15,16,17,18]. We do not consider, for example, the type of intermittency found in turbulence which produces non-Gaussian tails in temperature distributions [19]. In high energy physics, Bialas and Peschanksi [10] reported that the true bin probabilities can only be observed with infinite statistics. In the case of finite particles, the observed distribution of particles Q(p1, ..., pM) smears out the Bernoulli component, as shown in (5.2) of [10]. To overcome this, scaled factorial moments of the observed data are used to measure the scaled moments of the true distribution. If only statistical fluctuations are present in the rapidity d istribution of particles, then there will be no intermittency. The added value of scaled factorial moments is that they also remove the Poissonian noise to reveal dynamical fluctuations which may be present. This paper describes the use of scaled factorial moments to search for intermittency in gene expression on complimentary DNA (cDNA) microarrays (gene chips) which contain simultaneous expression levels for thousands of genes. Intermittency in this context implies that we are in search of abundances of gene expression values within the Gaussian-like distribution of expression. If there is no intermittency, then we would expect smooth Gaussian-like distributions of expression with only statistical variation. During nucleic acid transcription, each "coding" gene synthesizes a messenger ribonucleic acid (mRNA) using a deoxyribonucleic acid (DNA) template. mRNA transits from the cell nucleus to the ribosome which resides in the cellular cytoplasm. In the ribosome, mRNAs are translated into proteins consisting of amino acids. For each gene, there are usually multiple copies of mRNA found in the cytoplasm. In the laboratory, mRNA is extracted from treated (diseased) and control (normal) cells. Reverse transcription (RT) is then used to generate a complimentary copy of DNA (cDNA) from each RNA. During RT, cDNAs from experimental cells are labelled with Cy5 dye which fluoresces in the red wavelength, and cDNAs from normal cells are labelled with Cy3 dye which fluouresces in the green wavelength. The labelled cDNAs are then aliquoted onto a cDNA microarray which has been spotted with DNA targets that are complimentary to the cDNA. During hybridization, the red and green-labelled cDNAs competitively bind with the spotted DNAs. Following drying, excimer laser scanning and computer image processing of a microarray, the pixel-averaged intensity of red and green signals for each spot (DNA) reveals the level of expression of a particular gene in the treated and normal cells. The logarithm of the ratio of intensities is known as the log expression ratio (LER), which compares gene expression in treatment tissue with normal tissue. (Spot intensities are normalized for total treated and normal mRNA used for the hybridization). Positive values of LER indicate greater gene expression (upregulation) in treated (or diseased) cells, whereas negative values of LER indicate lower expression (downregulation) in treated cells when compared with normal cells. In summary, cDNA microarrays use competitive hybridization to compare concentration levels of thousands of genes simultaneously expressed in treated and normal cells. Let N represent the total number of genes spotted on a cDNA microarray. Let the range of LERs (y) on a microarray be Δy = ymax - ymin. Consider M non-overlapping equally-spaced bins with width Δy =
where M is the total number of bins, nm is the number of genes whose LER value falls within bin m, and N = Σmnm is the total number of genes on the array. We considered two sets of data for our analysis. The first was based on expression of 2,466 genes in the yeast S. Cerevisiae at different times following various experimental treatments (79 arrays) available at web site http://genome-www4.stanford.edu/MicroArray/SMD/publications webcite [20]. These data reflect gene expression of S. Cerevisiae during experimental treatment with alpha factor arrest ("alpha"), centrifugal elutriation ("elu"), temperature sensitive mutation ("cdc15"), sporulation ("spo"), high temperature ("heat"), reducing agent dithiothrietol ("dtt"), low temperature ("cold"), and diauaxic shift ("diau"). The second data set consisted of expression for 9,706 genes in 60 cancer cell lines available at web site http://discover.nci.nih.gov/nature2000 webcite [21]. Cancers represented are melanoma ("ME"), lung ("LC"), central nervous system ("CNS"), colorectal ("CO"), leukemia ("LE"), ovarian ("OV"), renal ("RE"), prostate ("PR"), and breast ("BR"). Calculations began by first determining "base" bin counts for the maximum number of bins possible for eacharray, Mmax = Δy/0.01, where 0.01 was the precision of the data. We observed that F2 increases rapidly when the bin size is smaller than 0.01, mostly likely due to round-off error in the creation of the LER values from the raw data. Round-off error can create artificial holes and spikes in the data. We, therefore, only consider bin sizes larger than 0.01. Fq was calculated for observed and simulated LERs at total bin numbers M = Mmax/L (L = 2, 3, ..., Mmax/2). A lower bound of 30 was used for M in all calculations. Bin counts nm for observed LERs were tabulated using M equally spaced bins of width
where f(m) is the simulated bin count for the mth bin, N is the total number of LERs, h = 1.06
where u = (LERi - ym)/h, LERi is the value of each LER, and ym is the lower bound of the mth bin. The simulation is essentially a smoothed function of the data with attendant statistical fluctuation. A second round of Fq calculations were made after subtracting 0.1 from ymin and ymax, redetermining bin cutoffs, and recalculating nm and f(m) in order to shift the scale of the phase space. This allowed us to look more closely at statistical fluctuations and also provided twice as many values of Fq for observed and simulated LERs. Plots of ln Fq vs. ln M were constructed for each array and each value of q. The difference between slopes for observed and simulated data was based on fitting the linear model ln Fq = separately for observed and simulated data, where Δ Positive values of Δ Figure 1 shows the frequency histogram for N = 2, 402 LERs binned in M = 106 bins of width
Figure 3 shows values of Δ
Figure 4 shows the Δ
The observed intermittency in the data considered may suggest correlations in the abundances of expression levels within the Gaussian-like distributions of LERs. In the S. Cerevisiae sporulation experiments, a majority of genes whose LERs fell within the spikes in Figure 1 had dramatically altered expression values later on in the sporulation experiments. Chu et al. [24] reported temporal changes in expression among a large number of genes throughout the sporulation process. In the cancer cell lines whose LER distributions were investigated, changes in intermittency over the arrays are likely due to cancer-specific alterations in cell-cycle control, DNA repair, oncogenesis, tumor supression, apoptosis, and angiogenenesis, all of which affect tumor growth, severity and evasion from attack by the immune system [25]. Cancers vary in their cause and severity and there may be a wide range of unknown gene-gene and gene-environment interactions which impact gene expression. Errors in reproducibility among the LERs considered were not provided by the groups that generated the data. However, several recent reports [26,27,28,29,30] give errors from various sources (probe preparation, spot size variability, scanning errors, software sophistication, etc.). Wildsmith et al. [26] reported a 28% standard error of the common logarithm of expression based on 64 replicate arrays containing 1248 duplicate spots. Lee et al. [28] reported a maximum misclassification of 9% based on three replicate arrays containing 288 genes. In this study, the spikes, for example, in Figure 1 are not likely due to misclassification. Error, on the other hand, affects the bin size and we have seen the effect throughout the bulk of bin-size range. This study was a first step to search for intermittency without establishing biological relevance. There is a growing literature on the identification of microarray-based regulatory gene networks [31,32,33,34]. We have already begun looking at individual genes and their contribution to F2 on a single array. Our current effort to develop microarray-based promoter models for co-expressed genes based on the Werner approach [33] will facilitate our understanding of regulatory control of genes with high contribution to F2. Progress in this effort is limited by the rate at which we can manually select genes with high contribution to F2, exon map the genes, fetch their upstream DNA promoter sequences, and then search for common transcription binding sites among the multiple promoter sequences in order to infer coregulation. The observation of intermittency in the data analyzed provides a complimentary handle on moderately expressed genes, generally not tackled by conventional techniques. Biologists often focus on strongly downregulated or upregulated genes which are characterized by large negative and positive LERs. Our method of looking at intermittency in gene expression focused on the clustering of LERs independent of their absolute expression value. Thus, we were able to detect large density fluctuations among small LERs. As an example, spikes near the center of the binned distribution in Figure 1 whose LER-values were low greatly increased the factorial moments. Therefore, fold-change analysis, which focuses on large negative and positive LERs, or other multivariate statistical methods such as hierarchical cluster and principal component analyses, can miss unique density fluctuations at low LER values which are detected by factorial moments. L.E.P. acknowledges the support of grant CA-78199-04 of the National Cancer Institute, and C. Aime, F. Kun, A. Loctionov, M. Patra, R. Peschanki, and E. Sarkisyan-Grinbaum for helpful discussions on intermittency. K.L. is supported in part by the U.S. Department of Energy, Grant no. DE-FG03-96ER41004, and in part by the Texas Advanced Research Program, Grant no. 3652-0023-1999. References
Have something to say? Post a comment on this article! |


on Google Scholar







author email
corresponding author email
y/M. The qth (q = 2, ..., 5) scaled factorial moment [

N-0.2 is the bandwidth, and 
0 +
Figure 1.
Figure 2.
Figure 3.
Figure 4.