<?xml version='1.0'?>
<!DOCTYPE art SYSTEM 'http://www.biomedcentral.com/xml/article.dtd'>
<art>
   <ui>gb-2007-8-7-r145</ui>
   <ji>GBJ</ji>
   <fm>
      <dochead>Research</dochead>
      <bibl>
         <title>
            <p>Global analysis of patterns of gene expression during <it>Drosophila </it>embryogenesis</p>
         </title>
         <aug>
            <au id="A1" ce="yes">
               <snm>Tomancak</snm>
               <fnm>Pavel</fnm>
               <insr iid="I1"/>
               <insr iid="I2"/>
               <insr iid="I3"/>
               <email>tomancak@mpi-cbg.de</email>
            </au>
            <au id="A2" ce="yes">
               <snm>Berman</snm>
               <mi>P</mi>
               <fnm>Benjamin</fnm>
               <insr iid="I1"/>
               <insr iid="I4"/>
               <email>bberman@usc.edu</email>
            </au>
            <au id="A3">
               <snm>Beaton</snm>
               <fnm>Amy</fnm>
               <insr iid="I1"/>
               <insr iid="I5"/>
               <email>beaton@fruitfly.org</email>
            </au>
            <au id="A4">
               <snm>Weiszmann</snm>
               <fnm>Richard</fnm>
               <insr iid="I5"/>
               <email>RWeiszmann@lbl.gov</email>
            </au>
            <au id="A5">
               <snm>Kwan</snm>
               <fnm>Elaine</fnm>
               <insr iid="I1"/>
               <insr iid="I2"/>
               <email>ekwan@berkeley.edu</email>
            </au>
            <au id="A6">
               <snm>Hartenstein</snm>
               <fnm>Volker</fnm>
               <insr iid="I6"/>
               <email>volkerh@mcdb.ucla.edu</email>
            </au>
            <au id="A7" ca="yes">
               <snm>Celniker</snm>
               <mi>E</mi>
               <fnm>Susan</fnm>
               <insr iid="I5"/>
               <email>celniker@bdgp.lbl.gov</email>
            </au>
            <au id="A8">
               <snm>Rubin</snm>
               <mi>M</mi>
               <fnm>Gerald</fnm>
               <insr iid="I1"/>
               <insr iid="I2"/>
               <insr iid="I7"/>
               <email>rubing@janelia.hhmi.org</email>
            </au>
         </aug>
         <insg>
            <ins id="I1">
               <p>Department of Molecular and Cell Biology, University of California, Berkeley, CA 94720, USA</p>
            </ins>
            <ins id="I2">
               <p>Howard Hughes Medical Institute, Cyclotron Road, Berkeley, CA 94720, USA</p>
            </ins>
            <ins id="I3">
               <p>Max Planck Institute of Molecular Cell Biology and Genetics, Pfotenhauerstr., Dresden, D-01307, Germany</p>
            </ins>
            <ins id="I4">
               <p>Department of Preventive Medicine, Keck School of Medicine of USC, Eastlake Ave, Los Angeles, CA 90033, USA</p>
            </ins>
            <ins id="I5">
               <p>Lawrence Berkeley National Laboratory, Cyclotron Road, Berkeley, CA 94720</p>
            </ins>
            <ins id="I6">
               <p>Department of Molecular Cell and Developmental Biology, University of California Los Angeles, Los Angeles, CA 90095, USA</p>
            </ins>
            <ins id="I7">
               <p>Janelia Farm Research Campus, HHMI, Helix Drive, Ashburn, VA 20147, USA</p>
            </ins>
         </insg>
         <source>Genome Biology</source>
         <issn>1465-6906</issn>
         <pubdate>2007</pubdate>
         <volume>8</volume>
         <issue>7</issue>
         <fpage>R145</fpage>
         <url>http://genomebiology.com/2007/8/7/R145</url>
         <xrefbib>
            <pubidlist>
               <pubid idtype="pmpid">17645804</pubid>
               <pubid idtype="doi">10.1186/gb-2007-8-7-r145</pubid>
            </pubidlist>
         </xrefbib>
      </bibl>
      <history>
         <rec>
            <date>
               <day>8</day>
               <month>3</month>
               <year>2007</year>
            </date>
         </rec>
         <revrec>
            <date>
               <day>5</day>
               <month>6</month>
               <year>2007</year>
            </date>
         </revrec>
         <acc>
            <date>
               <day>23</day>
               <month>7</month>
               <year>2007</year>
            </date>
         </acc>
         <pub>
            <date>
               <day>23</day>
               <month>07</month>
               <year>2007</year>
            </date>
         </pub>
      </history>
      <cpyrt>
         <year>2007</year>
         <collab>Tomancak et al.; licensee BioMed Central Ltd.</collab>
         <note>This is an open access article distributed under the terms of the Creative Commons Attribution License (<url>http://creativecommons.org/licenses/by/2.0</url>), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.</note>
      </cpyrt>
      <shorttitle>
         <p>Gene expression during <it>Drosophila </it>embryogenesis</p>
      </shorttitle>
      <shortabs>
         <p>Embryonic expression patterns for 6,003 (44%) of the 13,659 protein-coding genes identified in the <it>Drosophila melanogaster </it>genome were documented, of which 40% show tissue-restricted expression.</p>
      </shortabs>
      <abs>
         <sec>
            <st>
               <p>Abstract</p>
            </st>
            <sec>
               <st>
                  <p>Background</p>
               </st>
               <p>Cell and tissue specific gene expression is a defining feature of embryonic development in multi-cellular organisms. However, the range of gene expression patterns, the extent of the correlation of expression with function, and the classes of genes whose spatial expression are tightly regulated have been unclear due to the lack of an unbiased, genome-wide survey of gene expression patterns.</p>
            </sec>
            <sec>
               <st>
                  <p>Results</p>
               </st>
               <p>We determined and documented embryonic expression patterns for 6,003 (44%) of the 13,659 protein-coding genes identified in the <it>Drosophila melanogaster </it>genome with over 70,000 images and controlled vocabulary annotations. Individual expression patterns are extraordinarily diverse, but by supplementing qualitative <it>in situ </it>hybridization data with quantitative microarray time-course data using a hybrid clustering strategy, we identify groups of genes with similar expression. Of 4,496 genes with detectable expression in the embryo, 2,549 (57%) fall into 10 clusters representing broad expression patterns. The remaining 1,947 (43%) genes fall into 29 clusters representing restricted expression, 20% patterned as early as blastoderm, with the majority restricted to differentiated cell types, such as epithelia, nervous system, or muscle. We investigate the relationship between expression clusters and known molecular and cellular-physiological functions.</p>
            </sec>
            <sec>
               <st>
                  <p>Conclusion</p>
               </st>
               <p>Nearly 60% of the genes with detectable expression exhibit broad patterns reflecting quantitative rather than qualitative differences between tissues. The other 40% show tissue-restricted expression; the expression patterns of over 1,500 of these genes are documented here for the first time. Within each of these categories, we identified clusters of genes associated with particular cellular and developmental functions.</p>
            </sec>
         </sec>
      </abs>
   </fm>
   <meta>
      <classifications>
         <classification type="BMC" subtype="man_spc_id" id="30010005">Development</classification>
         <classification type="BMC" subtype="man_spc_id" id="30010010">Genome studies</classification>
      </classifications>
   </meta>
   <bdy>
      <sec>
         <st>
            <p>Background</p>
         </st>
         <p>A defining feature of multi-cellular organisms is their ability to differentially utilize the information contained in their genomes to generate morphologically and functionally specialized cell types during development. Regulation of gene expression in time and space is a major driving force of this process.</p>
         <p>A gene's expression pattern can be defined as a series of differential accumulations of its products in subsets of cells as development progresses. Patterns of mRNA expression are studied by two principal methods - microarray analysis <abbrgrp><abbr bid="B1">1</abbr></abbrgrp> and <it>in situ </it>hybridization <abbrgrp><abbr bid="B2">2</abbr><abbr bid="B3">3</abbr></abbrgrp>. Microarray analysis provides both a quantitative measure of gene expression and an overview of the temporal dynamics of gene expression regulation <abbrgrp><abbr bid="B4">4</abbr></abbrgrp>. A major limitation of microarray analysis is that obtaining spatial information depends on the dissection or cell-sorting of specific tissues or cell types <abbrgrp><abbr bid="B5">5</abbr><abbr bid="B6">6</abbr></abbrgrp>. RNA <it>in situ </it>hybridization has the potential to reveal both spatial and temporal aspects of gene expression during development. However, RNA <it>in situ </it>hybridization is not quantitative <abbrgrp><abbr bid="B7">7</abbr></abbrgrp>. For these reasons, we have used both methods in parallel and integrated the analysis of the resultant datasets.</p>
         <p>There are several reasons for choosing <it>Drosophila melanogaster </it>as an organism for the global study of gene expression during embryonic development. Genetic and molecular analyses have led to a deep understanding of many embryonic processes in this animal <abbrgrp><abbr bid="B8">8</abbr></abbrgrp>. Classical embryology has provided a solid framework for the anatomical description of embryonic stages <abbrgrp><abbr bid="B9">9</abbr></abbrgrp> and robust high-throughput methods for assaying gene expression by whole mount <it>in situ </it>hybridization have been developed <abbrgrp><abbr bid="B10">10</abbr><abbr bid="B11">11</abbr><abbr bid="B12">12</abbr></abbrgrp>. In many cases, the wild-type gene expression pattern has informed the interpretation of the phenotype produced by its mutation <abbrgrp><abbr bid="B13">13</abbr></abbrgrp>. Such studies have provided unprecedented insights into animal development; the process that governs the early embryonic patterning of the <it>Drosophila </it>body plan is now the best understood example of a complex cascade of transcriptional regulation during development <abbrgrp><abbr bid="B14">14</abbr><abbr bid="B15">15</abbr></abbrgrp>.</p>
         <p>We have assembled an atlas of gene expression patterns during <it>Drosophila </it>embryogenesis. Taking advantage of non-redundant gene collections <abbrgrp><abbr bid="B16">16</abbr><abbr bid="B17">17</abbr></abbrgrp>, we performed an unbiased survey of gene expression by using RNA <it>in situ </it>hybridization of gene specific probes to fixed <it>Drosophila </it>embryos <abbrgrp><abbr bid="B12">12</abbr></abbrgrp> and documented the patterns with a set of digital photographs. We describe the tissue specificity of gene expression at each stage range using selected terms from a controlled vocabulary (CV) for embryo anatomy <abbrgrp><abbr bid="B18">18</abbr></abbrgrp>. The CV integrates the spatial and temporal dimensions of the gene expression patterns by linking together intermediate tissues that develop from one another. It also integrates morphological and molecular description of development by allowing for structures that are morphologically indistinguishable and can be defined only on the basis of gene expression. We show that the genes sampled, representing 44% of the <it>Drosophila </it>genes, are largely representative of the genome as a whole, allowing the global analysis of gene expression during the embryonic development of a multicellular organism. We organized the complex gene expression space by a hybrid fuzzy-clustering approach that uses microarray profiles to supplement the CV annotation of <it>in situ </it>patterns. We divided the resulting clusters into two categories, broad and restricted. Broad patterns are characterized by quantitative enrichment in tissues that are related by specific cellular states. Restricted patterns are highly diverse and provide a basis for defining gene sets expressed in related tissues and with related predicted functions.</p>
      </sec>
      <sec>
         <st>
            <p>Results and discussion</p>
         </st>
         <sec>
            <st>
               <p>Annotation dataset</p>
            </st>
            <p>The starting point for our analyses is a collection of 6,003 genes whose embryonic expression patterns we have assayed by <it>in situ </it>hybridization and systematically annotated with CVs (Release 2.0). The number of genes in the dataset has more than doubled from Release 1 <abbrgrp><abbr bid="B12">12</abbr></abbrgrp>, from 2,179 to 6,003, and the accuracy of the annotation has been significantly enhanced by performing a full re-evaluation of every gene by a second, independent curator (Materials and methods; Additional data file 1). Release 2.0, including 74,833 staged embryo images and accompanying CV annotations and microarray data, is publicly available via a searchable database <abbrgrp><abbr bid="B19">19</abbr></abbrgrp>, providing a convenient way to mine the dataset for particular expression patterns. To determine how representative our sample is, we compared the distribution of selected Gene Ontology (GO) functional annotations (generic GO slim <abbrgrp><abbr bid="B20">20</abbr></abbrgrp>) between the 6,003 genes in our subset and the 14,586 genes in the Release 4.3 genome (Additional data file 2). No major biases for a specific molecular function, component or process were detected. Our dataset is slightly enriched for genes with known or inferred GO functions, and is, therefore, slightly deficient for genes with unknown assignment. Genes in this category lack conserved sequence features that would relate them to genes in other organisms, and may be expressed at very low levels, leading to a relative under-representation in expressed sequence tag (EST) collections. We conclude that our dataset contains a largely representative sample of gene expression patterns in the <it>Drosophila </it>genome.</p>
            <p>To annotate gene expression patterns, we used a set of 314 anatomical terms selected from the broad <it>Drosophila </it>Controlled Vocabulary for Anatomy maintained by FlyBase <abbrgrp><abbr bid="B18">18</abbr></abbrgrp>. We grouped developmental structures into 16 color-coded organ systems, and reduced the full 314-term CV to 145 terms by collapsing rarely used or difficult to distinguish sub-terms to their corresponding parent term (Materials and methods; Additional data files 3-5). In order to compare the gene expression properties for a set of related genes, we created a representation of the hierarchical CV that fits on a single line, which we call an 'anatomical signature', or 'anatogram'. Figure <figr fid="F1">1</figr> shows an anatogram for the set of 3,334 genes showing maternal expression. The relative enrichment or under-representation of CV annotations in this set of genes is indicated by the direction and height of the bar corresponding to each term, while the width of the bar indicates the genome-wide frequency of the term. Thus, commonly used annotation terms such as 'brain' (Figure <figr fid="F1">1</figr>, red asterisk) have wider bars than rare terms such as 'amnioserosa' (Figure <figr fid="F1">1</figr>, green asterisk). We used the anatomical signature to summarize groups of genes in this paper and in the accompanying supplementary online material <abbrgrp><abbr bid="B21">21</abbr></abbrgrp>.</p>
            <fig id="F1">
               <title>
                  <p>Figure 1</p>
               </title>
               <caption>
                  <p>Normalized anatomical signature - the anatogram</p>
               </caption>
               <text>
                  <p>Normalized anatomical signature - the anatogram. A linear representation of the CV is used to show the enrichment of annotations within the set of all 3,334 maternally expressed genes versus the entire dataset of 4,759 genes expressed in the embryo. A vertical black line delimits stages, and each colored bar represents an individual CV term (an expanded color key is shown in Additional_data_fille 3). The width of each bar is proportional to the number of times a term was used in our entire dataset, and the height represents the relative enrichment of the given term within the particular gene set (in this case, all maternally expressed genes). Enrichment is given in units of standard deviation above or below the expected sample count based on the background frequencies (z-score). Terms with bars below the zero line are under-represented in the sample. The green asterisk corresponds to the 'amnioserosa' term, while the red asterisk corresponds to the 'brain' term. On the web supplement [21], the user can place the mouse pointer over any bar in the anatomical signature (arrow on the midgut bar in stage range 13-16) and obtain the gene count for the term in the entire dataset, the gene count within the particular set of genes under study, and a statistical <it>p </it>value of statistical over- or under-representation within the set (shown in the black bordered lavender box).</p>
               </text>
               <graphic file="gb-2007-8-7-r145-1"/>
            </fig>
         </sec>
         <sec>
            <st>
               <p>Organization of gene expression data using a hybrid clustering approach</p>
            </st>
            <p>Of the 6,003 genes annotated, 4,759 (79%) showed detectable expression in the embryo, while the remaining 1,244 (21%) were annotated with only the 'No staining' CV term. By grouping genes with identical annotations, the 4,759 genes with detectable expression in the embryo were subdivided into 205 multi-gene groups and 2,335 'singleton' groups (that is, groups consisting of a single uniquely annotated gene). By relaxing the criteria and grouping genes that had at least 75% of their annotation terms in common, we identified 393 multi-gene groups and 1,804 singletons. If we consider each of the multi-gene groups and each of the singleton groups to represent a distinct expression pattern, this method suggests that there are up to 2,197 distinct patterns within our dataset (Additional data file 6).</p>
            <p>To further refine the number of expression categories, we developed a clustering strategy that allowed us to incorporate the quantitative temporal expression data obtained from the microarray experiments together with the qualitative, but spatially rich, data on expression patterns from the CV annotations. We implemented this approach within the framework of fuzzy c-means clustering <abbrgrp><abbr bid="B22">22</abbr><abbr bid="B23">23</abbr></abbrgrp> and developed a similarity metric that assigns different weights to the contribution of the microarray and annotation data (Materials and methods). Our goal was to find a proper balance between the contributions of annotation similarity versus microarray similarity to the overall similarity score. We desired a score that would minimize the contribution of microarray similarity for cases like those genes in Figure <figr fid="F2">2a</figr>, which have almost identical array profiles but incompatible annotation profiles. On the other hand, we wanted a score that would use array similarity to improve the reliability of clustering of broadly expressed genes that had similar but not identical annotation profiles, such as those in Figure <figr fid="F2">2b,c</figr>. We therefore used an asymmetric mixture function that varied the contribution of microarray data based on the similarity of the annotation data (Additional data file 7). Similarity for microarray profiles was calculated using a simple correlation metric, while similarity for <it>in situ </it>annotation profiles was calculated using a custom metric that independently weighted the contribution of each developmental stage range (Materials and methods).</p>
            <fig id="F2">
               <title>
                  <p>Figure 2</p>
               </title>
               <caption>
                  <p>Microarray data can supplement, but not supplant, <it>in situ </it>gene expression patterns</p>
               </caption>
               <text>
                  <p>Microarray data can supplement, but not supplant, <it>in situ </it>gene expression patterns. Microarray data and the CV annotations are shown for genes <b>(a) </b>restricted to particular tissues late in embryogenesis, and <b>(b,c) </b>for broadly expressed genes encoding basic cellular protein complexes. Genes in (a) show strikingly similar array profiles but are expressed in quite diverse tissues. Late in embryogenesis half resolve to the epidermis (*e), and the other half are expressed in muscle (*m), fat body (*fb), and nervous system (*n). The genes of the DNA replication complexes, origin recognition complex and minichromosome maintenance complex display a characteristic pattern with peak expression at hour 5 (stage 10) and late expression in CNS (b). Similarly, the mitochondrial ribosomal genes decline during early embryogenesis but begin to rise around hour 10 (stage 13), with <it>in situ </it>hybridization most common in the midgut and muscle (c). For these broadly expressed gene classes the similarity of the microarray profiles is useful for supplementing the description of the <it>in situ </it>hybridization patterns using the CV annotations.</p>
               </text>
               <graphic file="gb-2007-8-7-r145-2"/>
            </fig>
            <p>The fuzzy c-means algorithm is fuzzy in that each gene is assigned to one or more clusters <abbrgrp><abbr bid="B24">24</abbr></abbrgrp>. As multiple independent regulatory elements can drive the expression of a single gene in different tissues or at different times in development, this is a desirable property for this particular clustering problem. However, despite extensive experimentation with different clustering parameters, the large diversity of expression patterns led to clusters with ambiguous boundaries. Replication experiments using random initialization variables <abbrgrp><abbr bid="B25">25</abbr></abbrgrp> resulted in clusters that were qualitatively similar but with numerous genes redistributed between neighboring clusters <abbrgrp><abbr bid="B26">26</abbr></abbrgrp>. Therefore, each gene was assigned a score for each cluster, and this score was used to rank the most prototypical members of the cluster first and the most ambiguous ones last, and genes with high scores in multiple independent clusters were assigned to each cluster. This scoring allowed us to define a cutoff and determine the set of 'core' genes belonging most unambiguously to one and only one cluster (Materials and methods).</p>
            <p>Of 4,759 genes expressed in the embryo, we had microarray expression data for 4,496. The best fuzzy c-means run grouped these genes into 39 clusters, and each cluster was designated as either 'broad' or 'restricted'. Clusters containing a significant fraction of genes annotated as 'ubiquitous' were designated as broad, as were clusters containing primarily genes with unrestricted maternal only expression (Materials and methods). We also decided to include as broad those clusters of genes exhibiting maternal expression early and midgut-only expression late. Many genes annotated in this way (Figure <figr fid="F2">2c</figr>) encode the mitochondrial ribosomal proteins and other presumably ubiquitous mitochondrial proteins. Using these criteria, 10 of the 39 clusters (Figure <figr fid="F3">3</figr>, 1B-10B) were designated broad, and 2,549 (56.7%) genes were assigned to these clusters. The remaining 1,947 (43.3%) genes exhibited highly restricted patterns and were assigned to 29 clusters designated restricted (Table <tblr tid="T1">1</tblr>) <abbrgrp><abbr bid="B21">21</abbr></abbrgrp>.</p>
            <fig id="F3">
               <title>
                  <p>Figure 3</p>
               </title>
               <caption>
                  <p>Clustered gene expression data for broadly expressed genes</p>
               </caption>
               <text>
                  <p>Clustered gene expression data for broadly expressed genes. We divided broadly expressed genes into 10 clusters labeled 1B-10B, each cluster separated by a horizontal black bar. From the left, we show normalized eisengrams [43] representing microarray data for 13 one-hour time points (yellow relative high expression, blue relative low expression), followed by annotation matrices split by stage range and color-coded according to organ systems. On the right is a magnified view of clusters 2B and 4B highlighting the diversity of annotations for subsets of genes.</p>
               </text>
               <graphic file="gb-2007-8-7-r145-3"/>
            </fig>
            <tbl id="T1">
               <title>
                  <p>Table 1</p>
               </title>
               <caption>
                  <p>Division of clustering results into broad and restricted expression patterns</p>
               </caption>
               <tblbdy cols="6">
                  <r>
                     <c ca="left">
                        <p>Clusters assigned</p>
                     </c>
                     <c ca="center">
                        <p>One</p>
                     </c>
                     <c ca="center">
                        <p>Two</p>
                     </c>
                     <c ca="center">
                        <p>Three or more</p>
                     </c>
                     <c ca="center">
                        <p>Total</p>
                     </c>
                     <c ca="center">
                        <p>Percent</p>
                     </c>
                  </r>
                  <r>
                     <c cspan="6">
                        <hr/>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>No expression</p>
                     </c>
                     <c ca="center">
                        <p>1,064</p>
                     </c>
                     <c ca="center">
                        <p>0</p>
                     </c>
                     <c ca="center">
                        <p>0</p>
                     </c>
                     <c ca="center">
                        <p>1,064</p>
                     </c>
                     <c ca="center">
                        <p>19%</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>Broad</p>
                     </c>
                     <c ca="center">
                        <p>1,959</p>
                     </c>
                     <c ca="center">
                        <p>401</p>
                     </c>
                     <c ca="center">
                        <p>189</p>
                     </c>
                     <c ca="center">
                        <p>2,549</p>
                     </c>
                     <c ca="center">
                        <p>46%</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>Restricted</p>
                     </c>
                     <c ca="center">
                        <p>1,152</p>
                     </c>
                     <c ca="center">
                        <p>606</p>
                     </c>
                     <c ca="center">
                        <p>189</p>
                     </c>
                     <c ca="center">
                        <p>1,947</p>
                     </c>
                     <c ca="center">
                        <p>35%</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>Total</p>
                     </c>
                     <c ca="center">
                        <p>4,175</p>
                     </c>
                     <c ca="center">
                        <p>1,007</p>
                     </c>
                     <c ca="center">
                        <p>378</p>
                     </c>
                     <c ca="center">
                        <p>5,560*</p>
                     </c>
                     <c ca="center">
                        <p>100%</p>
                     </c>
                  </r>
               </tblbdy>
               <tblfn>
                  <p>*Number of genes with valid microarray values for all time points. Genes assigned to both a broad cluster and any other cluster are counted only as broad.</p>
               </tblfn>
            </tbl>
         </sec>
         <sec>
            <st>
               <p>Broadly expressed genes</p>
            </st>
            <p>The ten clusters encompassing broadly expressed genes have relatively similar array profiles, but the diversity of annotations makes the boundaries between these clusters somewhat arbitrary (Figure <figr fid="F3">3</figr>). While there is significant ambiguity in determining the borders of these clusters, each has a distinguishing expression profile. All broad clusters (Figure <figr fid="F4">4a-h</figr>) have maternal expression followed by ubiquitous or broad expression. Genes within these clusters have stereotypical cellular functions, which reveal the physiological and cell biological states of different domains in the embryo during development.</p>
            <fig id="F4">
               <title>
                  <p>Figure 4</p>
               </title>
               <caption>
                  <p>Overview of broad expression patterns</p>
               </caption>
               <text>
                  <p>Overview of broad expression patterns. For the core genes in each broad cluster, we summarize the array profile, the annotation profile (anatogram), the number of total and core genes in the cluster and show one image for each stage of embryogenesis for a single representative gene. Array plots show the distribution of scaled intensity scores: the blue line indicates the median value while the gray box gives the inter-quartile range. The green rectangle shows that staining patterns of all broad genes are remarkably similar immediately after gastrulation. The representative late stage embryos (boxed in red) illustrate the relative diversity into which each of these homogenous early patterns resolve.</p>
               </text>
               <graphic file="gb-2007-8-7-r145-4"/>
            </fig>
            <p>Cluster 1B is one of the several broad clusters characterized by peak microarray expression around hours 4-5 (stage 10; Figure <figr fid="F4">4a</figr>). <it>In situ </it>hybridization showed continued ubiquitous staining throughout embryogenesis, with the heaviest staining resolving to the differentiated midgut, muscle, hindgut, foregut, and anal pads. Genes within this cluster exhibit diverse cellular functions, but within its core members are more than half of all genes known to be involved in nucleolar-based ribosome biogenesis (40 &#215; enrichment, <it>p </it>= 5.8e-11; Additional data file 8).</p>
            <p>Genes in cluster 2B and many in cluster 3B are characterized by peak expression levels around hour 12 (stage 15) and by <it>in situ </it>hybridization appear strongest in the differentiated midgut, muscle, hindgut, and foregut (Figure <figr fid="F4">4b,c</figr>). Cluster 2B contains 33% of all genes annotated as being mitochondrial (7 &#215; enrichment, <it>p </it>= 2.7e-48; Additional data file 8). Genes in 3B often appear restricted to the midgut, but this cluster was classified as 'broad' due to its apparent relationship to cluster 2B, both in its overall expression profile and its enrichment for mitochondrial genes (3 &#215; enrichment, <it>p </it>= 1.6e-5). There is a significant correlation (<it>p </it>= 3.7e-9) between the genes in clusters 2B and 3B with genes shown in an RNA interference (RNAi) screen to be induced by the histone de-acetylase SIN3, suggesting a possible regulatory mechanism <abbrgrp><abbr bid="B27">27</abbr></abbrgrp>. A substantial fraction of these SIN3-induced genes, about 25%, are classified as having diminishing maternal staining by our <it>in situ </it>clustering (<it>p </it>= 2.6e-8 correlation with cluster 10B), suggesting that this common expression pattern is often beneath the level of detection by whole mount <it>in situ </it>hybridization.</p>
            <p>Clusters 4B and 5B are characterized by peak expression levels around hours 4-5 (stage 10) and often resolve to exhibit staining in the differentiated nervous system and midgut (Figure <figr fid="F4">4d,e</figr>). The two clusters are differentiated by expression in the stage 13-16 gonad (Figure <figr fid="F4">4d</figr>). Both clusters are significantly enriched for genes with apparent functions in cell division, including genes required for DNA metabolism, 4B (4 &#215; enrichment, <it>p </it>= 6.6e-5) and 5B (4 &#215; enrichment, <it>p </it>= 5.6e-12), and the cell cycle, 4B (3 &#215; enrichment, <it>p </it>= 4.9e-3) and 5B (4 &#215; enrichment, <it>p </it>= 5.8e-16). Consistent with this overrepresentation of cell-cycle regulated genes, there is significant overlap between the genes in these clusters and a set of 65 genes identified in an RNAi screen for dE2F transcriptional targets <abbrgrp><abbr bid="B28">28</abbr></abbrgrp>. We have 41 of these genes in our dataset with 40% belonging to 5B (8 &#215; enrichment, <it>p </it>= 2.2e-12) and 20% belonging to 4B (9 &#215; enrichment, <it>p </it>= 1.4e-6).</p>
            <p>Genes in cluster 6B are almost uniformly annotated as ubiquitous at all stages of embryogenesis and this annotation is supported by relatively high average array expression levels at all time points (Figure <figr fid="F4">4f</figr>). Cluster 6B contains over 80% of the genes encoding the components of the cytosolic ribosome (8 &#215; enrichment, <it>p </it>= 1.1e-29) and other genes involved in protein metabolism. Additionally, 40% of the 100 genes identified as essential for viability based on a large RNAi screen <abbrgrp><abbr bid="B29">29</abbr></abbrgrp> are included in this cluster (4 &#215; enrichment; <it>p </it>= 2.6e-16).</p>
            <p>The genes in clusters 1B-6B exhibit remarkably similar expression patterns during gastrulation and were most frequently annotated as endoderm and mesoderm anlagen (Figure <figr fid="F4">4</figr>, green rectangle). This early pattern later resolves into endodermal and mesodermal derivatives for genes in clusters 1B-3B or into central nervous system (CNS) and midgut for genes in clusters 4B-5B (Figure <figr fid="F4">4</figr>, red rectangle).</p>
            <p>Clusters 7B-10B are composed of genes with maternally deposited transcripts that diminish after stage 7 (Figure <figr fid="F4">4g,h</figr>). Those in 7B (75 genes; Figure <figr fid="F3">3</figr>) appear to rise steadily until hour 9 (stage 12), while those in 8B (49 genes) come on strongly at 16 hours (stage 16), at a time when formation of cuticle prevents efficient RNA <it>in situ </it>hybridization. Genes in cluster 9B (650 genes) show a spike in expression during the blastoderm stage, correlating with the onset of zygotic transcription, and differ from those in clusters 7B, 8B, and 10B by their annotation as 'ubiquitous' through gastrulation. It is likely that for genes in cluster 7B and 9B, the diminishing maternal expression is augmented by zygotic expression; however, a method that specifically distinguishes between maternal and zygotic transcripts is required to categorize these patterns conclusively.</p>
            <p>The genes and expression patterns in broad clusters have largely failed to attract the attention of developmental biologists, as indicated by the fact that the embryonic expression of only 4.3% of them have been described in the scientific literature <abbrgrp><abbr bid="B18">18</abbr></abbrgrp>. Yet, they represent more than half of the genes expressed in embryogenesis. Our analysis of broad patterns provides a comprehensive and unbiased overview of these neglected genes and redefines the definition of ubiquitous gene expression during development. A major lesson learned from our <it>in situ </it>screen is that a CV annotation strategy is insufficient to describe these patterns fully.</p>
         </sec>
         <sec>
            <st>
               <p>Restricted expression patterns</p>
            </st>
            <p>While the diversity of expression patterns was considerable, our hybrid clustering approach identified a number of tissue or domain specific expression patterns shared among a significant number of genes. While these clusters are more easily categorized than the broad clusters, there is still considerable ambiguity between clusters (Figure <figr fid="F5">5</figr>).</p>
            <fig id="F5">
               <title>
                  <p>Figure 5</p>
               </title>
               <caption>
                  <p>Clustered gene expression data for genes expressed in a restricted manner</p>
               </caption>
               <text>
                  <p>Clustered gene expression data for genes expressed in a restricted manner. We divided genes with restricted expression patterns into 29 clusters labeled 1R-29R, each cluster separated by a horizontal black bar. We used the same conventions as described for the broad clusters to capture and display the microarray and embryonic expression data (see legend to Figure 4).</p>
               </text>
               <graphic file="gb-2007-8-7-r145-5"/>
            </fig>
            <p>Clusters 1R-4R contain 383 genes expressed in various combinations of the yolk nuclei, fat body and blood related tissues (Figure <figr fid="F6">6a-c</figr>). Clusters 1R and 2R genes are more likely to be expressed in combinations of these different structures, while 3R genes are primarily expressed in the fat body, and 4R genes in the head mesoderm and related tissues. Interestingly, the tissues represented in these clusters derive from distinct developmental lineages, raising the question of whether a single coordinated expression program underlies expression in these seemingly unrelated developmental domains.</p>
            <fig id="F6">
               <title>
                  <p>Figure 6</p>
               </title>
               <caption>
                  <p>Overview of the restricted expression patterns</p>
               </caption>
               <text>
                  <p>Overview of the restricted expression patterns. For unique genes in each cluster, we summarized the array profiles, diversity of annotation terms (as an anatogram), and number of total and core genes and show two to four embryo images. Whenever possible, genes with previously uncharacterized expression patterns were selected. Array plots show the distribution of scaled intensity scores: the blue line indicates the median value while the gray box gives the inter-quartile range. The most relevant annotation terms in each anatogram are labeled.</p>
               </text>
               <graphic file="gb-2007-8-7-r145-6"/>
            </fig>
            <p>Clusters 5R-7R contain 1,160 genes expressed late in embryogenesis (stage range 13-16) in a number of epithelial structures (Figure <figr fid="F6">6d-f</figr>), including the epidermis, hindgut, foregut, and trachea. The epithelial pattern (Figure <figr fid="F6">6d</figr>, CG7724, CG4702) is the most recognizable and most abundant tissue-restricted pattern in embryogenesis. The epithelial expression pattern is frequently associated with expression in the tracheal system (Figure <figr fid="F6">6e</figr>). A subset of genes (Figure <figr fid="F6">6f</figr>) also showed expression in mid-embryogenesis (stages 9-12), suggesting they play a role in development and morphogenesis. The differences between the late epithelial clusters (Figure <figr fid="F6">6d,e</figr>) and the early epithelial cluster (Figure <figr fid="F6">6f</figr>) are apparent not only in the CV annotations, but also in the average microarray profiles of these clusters.</p>
            <p>Clusters 13R-16R contain 525 genes expressed specifically in the central and peripheral nervous system (Figure <figr fid="F6">6g-j</figr>). In contrast to the genes in the broad clusters 4B and 5B that are also expressed in the nervous system, these genes lack maternally contributed transcripts and any detectable staining at or immediately after gastrulation. The CNS specific gene expression (Figure <figr fid="F6">6g</figr>) begins at stage 11 and almost always includes both the brain and the ventral nerve cord. A subset of genes (Figure <figr fid="F6">6h</figr>) is also expressed in the midline, with a small number showing transcription before stage 11. Genes expressed exclusively in the midline were extremely rare. Many genes are expressed in both the central and peripheral nervous systems (Figure <figr fid="F6">6i</figr>), while a significant number are expressed in the peripheral nervous system alone (Figure <figr fid="F6">6j</figr>).</p>
            <p>Clusters 18R and 19R contain 229 genes expressed in either differentiated somatic muscle (Figure <figr fid="F6">6k</figr>) or differentiated visceral muscle (Figure <figr fid="F6">6l</figr>). Most genes that were detected in the visceral muscle became active earlier in the mesoderm primordia. As with the head and trunk components of the nervous system, expression in trunk muscles was almost always accompanied by expression in head muscles.</p>
            <p>Clusters 23R-29R contain 422 genes expressed in a domain-specific manner beginning in the blastoderm stage embryo and typically continuing in a tissue-specific manner throughout embryogenesis (Figure <figr fid="F6">6m-p</figr>). Many genes are assigned to more than one cluster with only 148 (35%) assigned to a single cluster. Often genes patterned in the blastoderm show tissue-specific restricted late expression primarily in the CNS and epidermis. The relationship between blastoderm-stage expression and later tissue-specific expression is elusive. While continuity of expression in particular lineage-specific regulatory genes is well-documented, we fail to detect any statistically significant relationship between annotations at the blastoderm and later stages in our full, unbiased set of genes. While we cannot conclusively rule out that this is due to a limitation of our CV, it more likely indicates that expression of such genes is initiated independently at different stages of development rather then maintained through developmental lineages.</p>
            <p>An additional eight clusters contain 349 genes with late tissue-specific expression (Additional data file 9a-h). Some of these contain genes expressed throughout development in a single tissue, like the cluster of genes expressed in pole and germ-cell (Additional data file 9h), while others, like the cluster of midgut-specific genes (Additional data file 9b), are primarily expressed in a particular tissue at a particular time.</p>
            <p>Despite the significant number of genes that conform well to the patterns represented by the above clusters, a large fraction is expressed in unique combinations of tissues or organs. Fuzzy clustering assigned these genes to the set of clusters that best described their expression patterns. Of the 1,947 genes expressed in a restricted manner, 795 (41%) are assigned to more than one cluster (Table <tblr tid="T1">1</tblr>). We illustrate this by showing several examples of genes assigned to multiple clusters (Figure <figr fid="F7">7</figr>). By allowing genes to be placed into more than one expression cluster, we also hope to facilitate online searches of our dataset by representing the range of each gene's expression. The 29 restricted clusters can be viewed as distinct transcriptional programs and the numerous genes that are expressed in unique combination of tissues combine these basic programs. Such a view is consistent with our current understanding of how complex patterns of expression are generated by a set of independently acting <it>cis</it>-regulatory modules <abbrgrp><abbr bid="B30">30</abbr></abbrgrp>. An interesting direction for future research will be to uncover the <it>cis</it>-regulatory modules that are associated with the individual restricted clusters and to examine whether or how these modules are utilized to achieve the observed diversity in gene expression.</p>
            <fig id="F7">
               <title>
                  <p>Figure 7</p>
               </title>
               <caption>
                  <p>Genes classified in multiple clusters</p>
               </caption>
               <text>
                  <p>Genes classified in multiple clusters. <b>(a) </b><it>CG17052 </it>is expressed in the ring gland as well as a number of epithelial structures at stage 14. It belongs to two clusters: 17R, the ring gland (r.g.); and 6R, the late epithelial pattern with trachea (tr.). <b>(b) </b><it>CG15118 </it>is expressed specifically in Bolwig's organ (b.o.), along with broad staining in the brain, ventral nerve cord, anal pad, hindgut, and faintly throughout the embryo. It is classified as belonging to a broad cluster, 1B, as well as the Bolwig's organ cluster, 21R. <b>(c-f) </b><it>Fas3 </it>has a complex expression pattern and is annotated with 27 individual annotation terms. At stage 12, it is expressed in various epithelia, including the clypeolabrum PR (clyp.PR) (c) and dorsal epidermis primordium (dorsi.epi.PR) (d), the visceral muscle PR (e) and the brain PR (not shown). At stage 15, <it>Fas</it>-<it>3 </it>is expressed in the central nervous system, including the midline, along with visceral muscle and various epithelial structures, including the trachea, hindgut, foregut, clypeolabrum, and epidermis (epi) (f). <it>Fas</it>-<it>3 </it>belongs to three clusters: 7R, the early epithelial pattern; 19R, visceral muscle; and 14R, the midline/CNS cluster.</p>
               </text>
               <graphic file="gb-2007-8-7-r145-7"/>
            </fig>
            <p>Can we estimate the number of distinct expression patterns in <it>Drosophila </it>embryogenesis? When we use a relatively conservative measure, requiring that genes need to share 75% or more of their annotation terms to be considered 'indistinguishable', we identify 173 multi-gene groups and 1,141 singletons among the genes in our restricted clusters. Thus, by removing the broad genes, which are prone to inconsistent annotation, the number of groups within our dataset based on this measure drops from 2,197 to 1,314, providing one estimate of the number of 'distinct' patterns (Additional data file 6). On the other hand, these patterns are not unrelated. We consider the 29 restricted clusters the most prominent recurring patterns in the dataset, and we can only speculate where to place the biologically significant number of patterns within these two extremes. It is clear that the clusters are not homogenous since 41% of the genes exhibit composite patterns. If we look at all observed combinations of cluster assignments, we find 454 distinct combinations, and 287 of these cluster combinations consist of a single gene. We favor the idea that many of the composite patterns observed result from simple additive combination of the basic patterns driven by independently acting <it>cis</it>-regulatory modules. Direct examination of the patterns that each of these <it>cis</it>-regulatory modules generates in transgenic reporter assays, rather than the patterns of entire genes, will be more powerful in revealing the underlying mechanisms and logic governing the generation and evolution of each gene's expression pattern.</p>
         </sec>
         <sec>
            <st>
               <p>Relatedness of distinct tissues</p>
            </st>
            <p>Besides grouping genes according to the similarity of gene expression patterns, we used our annotation dataset to define relatedness among tissues based on the similarity of the set of genes expressed in them. Figure <figr fid="F8">8</figr> shows a network plot where tissues were connected by flexible links proportional to the fraction of commonly expressed genes and a force-directed layout was used to bring more similar tissues into proximity with each other. Tissues within individual organ systems, such as muscle (green), CNS (purple), and peripheral nervous system (violet), cluster tightly. The Bolwig's organ is isolated from the rest of the tissues, highlighting its distinct set of expressed genes. Similarly, tissues such as germ cells and amnioserosa, ring gland, stomatogastric nervous system, Malpighian tubule, midgut and garland cells share relatively few expressed genes with other tissues. In contrast, the genes expressed in the posterior spiracle, despite forming their own cluster (Additional data file 9e), appear to be components of many other tissues. As noted above, yolk nuclei, fat body and plasmatocytes share expression of a significant number of genes. In this representation, these structures are weakly related to lymph gland, which in turn shares expressed genes with the circulatory system. Many of the genes expressed in the oenocyte are also expressed in crystal cells, lymph gland, ring gland, midline, gonad and circulatory system.</p>
            <fig id="F8">
               <title>
                  <p>Figure 8</p>
               </title>
               <caption>
                  <p>Network representation of tissue relatedness</p>
               </caption>
               <text>
                  <p>Network representation of tissue relatedness. Nodes represent collapsed annotation terms and edges represent the correlation between expression in each pair of terms. Only tissues that share a statistically significant number of genes are linked and the strength of the links is proportional to the number of genes the two tissues have in common. Tissues that share very few or no genes repel each other. The system is allowed to reach a low energy level in two-dimensional space under a physical spring model (force directed layout). Collapsed annotation terms are color-coded according to their organ system assignments as used throughout.</p>
               </text>
               <graphic file="gb-2007-8-7-r145-8"/>
            </fig>
            <p>The largest, most interconnected set of structures roughly corresponds to the epithelial pattern defined by clusters 5R, 6R and 7R. Notably, the salivary gland duct is isolated from the salivary gland body, reflecting their functional divergence and differential gene expression. The salivary gland duct and trachea are linked by their shared expression of genes required for cuticle deposition. In terms of gene expression, the anal pads are more similar to the hindgut than to other epidermal structures. The large distance between neural and other ectodermal derivatives suggests that specification of neuronal versus epidermal cell fate leads to profound genome-wide changes in transcription. Patterns within the digestive system are interesting - while hindgut and foregut expression are strongly correlated, midgut expression is markedly different despite its functional and spatial relatedness, reflecting its distinct developmental origin.</p>
         </sec>
         <sec>
            <st>
               <p>Relationship between expression and function</p>
            </st>
            <p>Determining a gene's pattern of expression is a key step towards understanding its function during development. The functions of many genes have been determined, either by direct experimental analysis or by sequence homology and compiled by the GO consortium <abbrgrp><abbr bid="B20">20</abbr></abbrgrp>. Additionally, the Uniprot database catalogs protein domains and provides phylogenetic relationships <abbrgrp><abbr bid="B31">31</abbr></abbrgrp>. For each of our 6,003 genes, we identified associated GO terms and Uniprot domains and determined the relative distribution of these terms and domains within the broad versus restricted clusters (Figure <figr fid="F9">9</figr>), highlighting categories containing less than 20% or more than 80% restricted genes. As discussed before, broad clusters are heavily enriched for genes involved in core cellular processes, such as translation, protein degradation, cell division, energy metabolism and RNA binding proteins. The majority of transcripts for RNA binding proteins are deposited maternally into the early embryo, highlighting the necessity for mRNA processing prior to the onset of zygotic transcription. Restricted clusters are enriched in genes with sequence-specific DNA-binding domains and signaling molecules and also contain a large number of the genes involved in cuticle formation.</p>
            <fig id="F9">
               <title>
                  <p>Figure 9</p>
               </title>
               <caption>
                  <p>Distribution of GO annotations and Uniprot domains within broad versus restricted clusters</p>
               </caption>
               <text>
                  <p>Distribution of GO annotations and Uniprot domains within broad versus restricted clusters. GO annotations (left) and Uniprot domains (right) are plotted on a number line according to the relative fraction of genes contained within broad versus restricted expression clusters. We label categories where at least 80% of the genes with patterns belong to either broad or restricted clusters.</p>
               </text>
               <graphic file="gb-2007-8-7-r145-9"/>
            </fig>
            <p>To examine the enrichment of GO and Uniprot categories in individual gene expression clusters, we performed exhaustive pair-wise comparisons <abbrgrp><abbr bid="B21">21</abbr></abbrgrp>. We used the binomial test to evaluate the statistical significance of overlaps between sets of genes defined by the different data-sources. In order to correct the significance estimates for multiple testing we determined the empirical chance distribution by performing a large number of random permutations of gene functional assignments and determining the rate at which we attained particular <it>p </it>values. We interpolated these results using a log-linear regression function to fit the empirical distribution (Materials and methods). The results of this analysis are shown in Additional data file 8, which lists all GO essential (Materials and methods <abbrgrp><abbr bid="B21">21</abbr></abbrgrp>) and Uniprot categories significantly enriched in gene expression clusters (those with an adjusted <it>p </it>value of less than 0.05 and 3-fold or greater enrichment).</p>
            <p>To summarize the functional associations of gene expression clusters, we created a force-directed layout network, which brings into close proximity clusters and GO/Uniprot categories sharing a significant number of genes (Figure <figr fid="F10">10</figr>). In the force-directed layout, restricted and broad clusters separate robustly, with the notable exception of germ cell cluster 22R, which associates strongly with functions typical of broad maternal genes. This connection may be due to the fact that restriction of transcripts to the germ line lineage is often a consequence of protection of maternal message from degradation in early forming pole cells. Another cluster that violates the broad versus restricted separation is cluster 8B, which shows maternal-only expression based on <it>in situ </it>photographs but is enriched for genes involved in cuticle metabolism. Since formation of the cuticle effectively prevents RNA <it>in situ </it>hybridization, we propose that the genes in cluster 8B are likely expressed during late embryogenesis in a pattern resembling epithelial expression (similar to cluster 5R and 6R), although this pattern cannot be visualized by the standard <it>in situ </it>protocol. The late spike in the average array profile of cluster 8B genes supports this notion.</p>
            <fig id="F10">
               <title>
                  <p>Figure 10</p>
               </title>
               <caption>
                  <p>Network representation of the relationship between gene expression and gene function</p>
               </caption>
               <text>
                  <p>Network representation of the relationship between gene expression and gene function. Thirty-nine gene expression clusters (broad and restricted) together with the most significantly enriched GO terms and Uniprot domains (italicized) are organized in two-dimensional space by a force directed layout as in Figure 9. The strength of links between expression clusters and GO/Uniprot terms is determined by the level of enrichment of the GO/Uniprot term within the expression cluster (using z-scores in Additional data file 8). The strength of links between pairs of expression clusters and pairs of GO/Uniprot terms are determined by comparing similarity with respect to the opposite class (so that expression clusters are compared with respect to the GO/Uniprot terms they have similarity with, and vice versa; see Materials and methods). Expression cluster representative <it>in situ </it>images: Cl1B <it>CG12792</it>; Cl2B <it>CG4567</it>; Cl3B <it>CG407</it>8; Cl4B <it>CG2656</it>; Cl5B <it>CG3227</it>; Cl6B <it>CG7375</it>; Cl9B <it>CG8464</it>; Cl10B <it>CG13349</it>; Cl1R <it>CG3246</it>; Cl2R <it>CG8066</it>; Cl3R <it>CG2233</it>; Cl4R <it>CG4829</it>; Cl5R <it>Osi14</it>; Cl6R <it>CG32209</it>; Cl7R <it>CG12676</it>; Cl8R <it>CG14756</it>; Cl9R <it>CG10527</it>; Cl10R <it>CG1246</it>; Cl11R <it>CG633</it>7; Cl12R <it>CG9468</it>; Cl13R <it>CG15651</it>; Cl14R <it>CG31764</it>; Cl15R <it>CG14762</it>; Cl16R <it>CG18675</it>; Cl17R <it>CG8888</it>; Cl18R CG6429; Cl19R <it>CG8780</it>; Cl20R <it>CG15209</it>; Cl21R <it>CG4468</it>; Cl22R CG9925; Cl23R <it>rib</it>; Cl24R <it>CG8147</it>; Cl25R <it>CG8965</it>; Cl26R <it>odd</it>; Cl27R <it>CG12177</it>; Cl28R <it>CG13653</it>; Cl29R <it>CG1096</it>7.</p>
               </text>
               <graphic file="gb-2007-8-7-r145-10"/>
            </fig>
            <p>Interestingly, cluster 7R, containing genes with early (stage 12) onset epithelial expression, clearly separates from 5R and 6R, which contain genes with late epithelial expression (stages 13-16). Early epithelial expressing genes are associated with GO terms for tissue specific functions, such as membrane trafficking, morphogenesis, cell polarity, motility and adhesion, which makes them similar to genes found in the early blastoderm patterning gene cluster (cluster 26R). In contrast, late epithelial clusters (clusters 5R and 6R) associate clearly with cuticle formation in terminally differentiated tissues. This is the best example in our dataset of separation between regulatory developmental genes and effector genes <abbrgrp><abbr bid="B32">32</abbr></abbrgrp> of the terminal cell fates.</p>
            <p>Genes in cluster 24R are expressed in yolk, mesoderm, dorsal ectoderm and anterior and posterior endoderm anlagen at the blastoderm stage. Consistent with this early expression, these genes are expressed later in differentiated midgut, yolk, fat body and plasmatocytes. The force directed layout suggests that these genes are functionally related to clusters 1-4R, which contain genes expressed in yolk, fat body and blood and involved in metabolite transport. Cluster 24R clearly separates from other blastoderm stage clusters, suggesting that for these particular tissues, specific effector genes are required early in and throughout embryonic development.</p>
            <p>GO terms related to membrane trafficking, such as secretory pathway, vesicle transport, Golgi apparatus, and ER, assume a central position in the layout with numerous connections to diverse clusters both broad and restricted. This likely reflects the requirement of these core cellular processes in diverse cell types, but also indicates that there are tissue specific differences in the utilization of these pathways. The modulation of these pathways is mediated by GTPases <abbrgrp><abbr bid="B33">33</abbr></abbrgrp>, which exhibit similar connectivity patterns in the force directed layout (Figure <figr fid="F10">10</figr>).</p>
            <p>CNS and muscle clusters associate with the expected GO terms for nerve impulse transmission and muscle contraction, respectively. Interestingly, despite their clear functional specialization, both tissues show a common requirement for components of the extracellular matrix.</p>
            <p>Another way to uncover relationships between gene expression and gene function is to examine the representation of GO terms in individual tissues using the 'anatograms' (Figure <figr fid="F11">11</figr>). For example, transcriptional regulators are predominantly expressed in the developing and mature nervous systems (Figure <figr fid="F11">11a</figr>). Regulation of transcription initiation by sequence-specific transcription factors is the primary mechanism used to generate tissue-specific gene expression. We determined the gene expression patterns for 238 transcription factors with sequence-specific DNA binding domains; at least one transcription factor is expressed in every tissue type recognized by our annotation hierarchy. We examined the two most abundant transcription factor classes, those with C2H2 zinc finger domains (Figure <figr fid="F11">11b</figr>) and those with homeobox domains (Figure <figr fid="F11">11c</figr>), and found that these domains show similar overall distributions, suggesting that they are deployed to regulate a similar range of developmental processes.</p>
            <fig id="F11">
               <title>
                  <p>Figure 11</p>
               </title>
               <caption>
                  <p>Anatogram summary for selected GO and Uniprot categories</p>
               </caption>
               <text>
                  <p>Anatogram summary for selected GO and Uniprot categories. Anatograms are used to summarize gene expression for selected <b>(a-j) </b>GO terms and <b>(k,l) </b>Uniprot protein domains. Categories related to transcriptional regulation (a-c) are boxed, as are two categories strongly enriched in clusters 5R and 6R representing epithelial patterns (k,l). Tissues discussed in the main text are labeled.</p>
               </text>
               <graphic file="gb-2007-8-7-r145-11"/>
            </fig>
            <p>Cell adhesion molecules are similar to transcription factors in that they are expressed early in development in a number of anlagen, and are later abundant in the nervous system. In addition, these molecules are moderately enriched in differentiated epidermal derivatives (Figure <figr fid="F11">11d</figr>). Cytoskeletal components are enriched in the nervous system and muscles, suggesting that the tissue relatedness observed between mesodermal and neural derivatives is dictated by shared functional requirements of these cell types (Figure <figr fid="F11">11e</figr>). Interestingly, the tissue distribution of kinases is almost indistinguishable from the genome-wide average of all genes (Figure <figr fid="F11">11f</figr>). We also find strong and specific associations between genes with particular GO functions and the tissues in which they are expressed, such as stimulus and Bolwig's organ, chitin metabolism and late epithelial patterns, and helicases and gonads (Figure <figr fid="F11">11g,h,j</figr>).</p>
            <p>Comparison of GO terms and gene expression data often leads to self-evident observations because many functional GO assignments are based on published gene expression patterns. We used the Uniprot catalog to correlate gene expression and protein domains (Figure <figr fid="F10">10</figr>). Figure <figr fid="F11">11</figr> shows several domains expressed specifically in differentiated epidermal derivatives. For example, the zona pellucida genes encode transmembrane glycoproteins that were recently shown to be critical for tracheal morphogenesis <abbrgrp><abbr bid="B34">34</abbr></abbrgrp>. These and other zona pellucida genes are expressed in the 5R/6R epithelial pattern (Figure <figr fid="F11">11k</figr>), which is consistent with a prior study of zona pellucida embryonic expression <abbrgrp><abbr bid="B35">35</abbr></abbrgrp>. A novel domain that apparently exists only in flies, DUF243, is found almost exclusively in proteins encoded by genes with the late 5R pattern (Figure <figr fid="F11">11l</figr>). These tight associations of functional sequence properties and patterns of gene expression provide useful insights into how regulatory strategies are dictated by gene function.</p>
         </sec>
      </sec>
      <sec>
         <st>
            <p>Conclusion</p>
         </st>
         <p>We have described the most complete set of data on spatial and temporal patterns of gene expression during embryogenesis that has been compiled for any metazoan organism. The extent, quality, and unbiased nature of this dataset allowed us to describe and explore gene expression patterns during embryogenesis on a genome wide basis. Below we discuss three issues: how this data can be used as a resource by biologists; the inherent challenges in analyzing such a complex set of data; and what we learned about global strategies for regulating gene expression during embryonic development of a complex multi-cellular organism.</p>
         <sec>
            <st>
               <p>Utility of the dataset</p>
            </st>
            <p>The dataset we assembled can be used in several ways. First, it provides a rich source of candidate genes for further in-depth study. Researchers interested in a particular developmental process, for example, morphogenesis of the salivary gland, can search our annotations and retrieve a list of genes that are expressed in that structure. Such a gene set can be further subdivided by manual curation, using our primary image data. Second, the clustering classification allows one to address more abstract questions, such as: which genes are expressed in a regulated manner at cellular blastoderm? And which genes are involved in organogenesis in the late embryo? Finally, the dataset represents a starting point for an analysis of the sequence determinants of gene expression patterns. Clustering provides gene groupings based on spatio-temporal gene expression, ranging from unique patterns, through small tightly co-regulated gene sets, to large gene expression classes. These classes can be tested against <it>cis</it>-regulatory prediction pipelines to identify significant associations between gene expression specificity and genomic sequence features.</p>
            <p>Determining expression patterns is only a first step towards further understanding gene function and, therefore, it is important to intersect our spatial expression data with other genomic datasets. Our tools allow anyone with a list of genes, for example, derived from a targeted microarray analysis, to obtain the spatio-temporal expression patterns of these genes in the <it>Drosophila </it>embryo. To address the difficulty of summarizing the gene expression patterns of a group of genes, we developed a new visual aide - the anatogram. Anatograms show the 'position' of a given gene set in the complex space of spatio-temporal gene expression patterns and represent a convenient way to summarize such data for groups of genes. Anatograms also provide an intuitive comparison of differences among groups of genes, which can supplement more rigorous statistical comparisons. Any list of genes can serve to generate an anatogram; for example, the list of <it>Drosophila </it>genes homologous to a gene group in another organism, or the genes that contain a particular sequence motif. In this way, anatograms can be used to compare results from gene expression studies among different species. The color code is based on organ systems shared by metazoan organisms and can be adapted to spatio-temporal gene expression data from other animals, providing an organism-independent way to present spatial gene expression data.</p>
         </sec>
         <sec>
            <st>
               <p>Analysis of annotated gene expression patterns</p>
            </st>
            <p>Our results suggest that parallel microarray analysis should be an integral part of any <it>in situ </it>hybridization survey of developmental processes. Microarrays provide independent measurements that help control the artifacts of <it>in situ </it>hybridization methods, and also provide a quantitative measure of gene expression that is especially important for the interpretation of broadly expressed genes. The combined analysis of these two datasets is synergistic. <it>In situ </it>hybridization reveals the spatial diversity in tight temporal clusters and microarray clustering reduces the artificial diversity introduced by assigning annotations based on the qualitative <it>in situ </it>assay.</p>
            <p>In the context of an anatomically well-described system such as <it>Drosophila </it>embryogenesis, it is possible to achieve great precision in expression pattern description. However, making distinctions based on the fine details of patterns, such as different subsets of the CNS, can be problematic when examining genes one by one. We found that it was useful to reduce the granularity of the CV to the level where the annotation assignments are most reliable. This approach necessarily underestimates the true diversity of expression patterns; for example, the expression of <it>GstS1 </it>in a distinct subset of cells in the midgut was annotated simply as midgut. On the other hand, this approach enables description of undefined subsets of cells and their grouping with the correct higher order structures. The fine details of differences among expression patterns on a cellular level can be addressed by comparing images of the individual members of the broader groups defined by CV annotation, or by double labeling <it>in situ </it>experiments <abbrgrp><abbr bid="B36">36</abbr><abbr bid="B37">37</abbr></abbrgrp>. A complementary approach to study gene expression of transcription factors at the blastoderm stage uses high-resolution three-dimensional confocal imaging of fluorescently labeled, fixed specimen followed by computational segmentation analysis <abbrgrp><abbr bid="B38">38</abbr><abbr bid="B39">39</abbr></abbrgrp>.</p>
            <p>Many genes are expressed ubiquitously but non-uniformly, giving the appearance of a restricted expression pattern. Identifying and correctly categorizing such ubiquitous patterns is important because their description with the standard vocabulary makes it difficult to separate them from genes with true restricted expression patterns. We identified two major classes of ubiquitous patterns, a midgut CNS pattern and an endoderm mesoderm pattern. Late in embryogenesis these differentially stained structures become apparent, whereas immediately after gastrulation there are no apparent differences among the ubiquitous patterns.</p>
         </sec>
         <sec>
            <st>
               <p>Gene expression patterns in development</p>
            </st>
            <p>Embryonic development encompasses the complete spectrum of developmental and cell biological processes and, thus, it is not surprising that we detect the expression of 80% of the 6,003 genes we studied. Even this high number underestimates the number of genes expressed during embryogenesis. Our microarray data indicate that the late embryonic genes escaped detection in our <it>in situ </it>assay presumably because deposition of the cuticle prevents entry of the probe. In contrast, genes that are expressed in a very small subset of embryonic cells are more likely to be detected by <it>in situ </it>hybridization than by microarray analysis (data not shown).</p>
            <p>Of genes in our unbiased set, 45% are expressed in broad patterns. Broad genes tend to encode proteins that mediate core cellular processes and their apparent patterns reflect quantitative differences in requirements for basic cellular machineries in different tissues, especially late in embryogenesis.</p>
            <p>Of the genes in our dataset, 35% show spatially and/or temporally restricted gene expression. Our data reveal a tremendous diversity of gene expression patterns. Sets of genes that exhibit exactly the same tissue specific gene expression are rare and usually limited to mature organs. Genes with identical restricted expression patterns spanning multiple stages of embryogenesis were not found, even at the limited resolution level offered by our imaging technique. Genes that are expressed during mid-embryogenesis in a specific tissue very frequently show unrelated patterns earlier and later in development. Consequently, genes that serve as lineage markers by being expressed in a given organ system from anlagen, through primordia to final differentiated organs are rare and, for the most part, had already been discovered by genetic analysis.</p>
            <p>In order to classify the complex expression patterns, we used a fuzzy clustering approach that allows a gene to participate in multiple clusters. We found that nearly all genes with restricted patterns fell into one of six clearly distinguishable restricted pattern types: yolk, blood and fat; epithelia; nervous system; muscle; blastoderm; or other, less frequent, organ specific patterns. Within each of these basic types, several subtypes were distinguished by their preferential expression in particular combinations of tissues.</p>
            <p>Remarkably, 41% of the genes belong to more than one cluster, underscoring the diversity of gene expression. It is perhaps expected that the majority of gene expression patterns will be unique when one considers all developmental stages. The diversity of patterns suggests that many genes are turned on independently multiple times in development. It is less intuitive that, in terminally differentiated tissues, many genes are expressed in multiple organ systems. The existence of expression clusters indicates that the restriction of gene activities within organ systems and developmental lineages frequently occurs, whereas the fuzziness of the clusters suggests that expression in atypical combinations of tissues can be achieved. It will be interesting to investigate the <it>cis</it>-regulatory code that is responsible for initiating common patterns of gene expression and the potential for diversity in the control of gene expression. Since the control of gene expression is thought to be modular, it is possible that combinations of significantly smaller numbers of regulatory modules achieve the overall diversity of patterns.</p>
            <p>What is the functional significance of the observed pattern diversity? Are all the minute features of the vast number of unique patterns necessary to carry out development? Or is the complexity of patterns largely a consequence of position effects in the proximity of regulatory modules that have little deleterious effect. Careful comparisons of gene expression patterns across multiple closely related species should reveal the patterns that are under evolutionary constraint. Our genome-wide dataset of patterns in <it>D. melanogaster </it>serves as a starting point for further investigation of genomic regulatory networks in development and their evolution.</p>
         </sec>
      </sec>
      <sec>
         <st>
            <p>Materials and methods</p>
         </st>
         <sec>
            <st>
               <p>Data collection</p>
            </st>
            <p>Large-scale production of gene expression patterns by RNA <it>in situ </it>hybridization to <it>Drosophila </it>embryos was performed as described <abbrgrp><abbr bid="B12">12</abbr></abbrgrp>. Briefly, we used digoxygenin-labeled RNA probes derived primarily from sequenced cDNAs to visualize gene expression patterns in <it>Drosophila </it>embryos by <it>in situ </it>hybridization and documented the expression patterns by digital microscopy. The histochemical color reaction was stopped in all wells of the 96-well plate at the same time once staining pattern appeared for three included control probes as well as in most wells of the plate (1-1.5 hours at 37&#176;C). Individual embryo images were assigned to one of six stage ranges that coincide with major developmental transitions in embryogenesis, and the development of the pattern across time was confirmed with independently derived Affymetrix microarray time course data covering the first 12.5 hours of embryogenesis <abbrgrp><abbr bid="B19">19</abbr></abbrgrp>. The images were annotated using a CV for embryonic anatomy. In the course of the primary screen, we performed 8,469 successful <it>in situ </it>experiments representing 6,580 genes. We assembled data from multiple independent experiments for 1,514 (23%) genes (labeled RNA probes were generated separately for each experiment). The same EST clone was used as the source for the probe in 1,241 of the multiple experiments, while different ESTs were used as the source in the remaining 273. Low-resolution production images were captured for all 6,003 genes. No high resolution images were captured for genes labeled as maternal, ubiquitous or no expression (2,638). At least one high-resolution image was captured for the remaining 3,365. Of these, 2,202 have high-resolution images at all six stage ranges, and 1,163 are missing at least one stage range. We captured high-resolution images only at stage ranges when a gene was expressed.</p>
         </sec>
         <sec>
            <st>
               <p>Annotation</p>
            </st>
            <p>The primary curator (AB) assigned anatomical terms from the CV concurrently with image acquisition for each gene, providing a first pass annotation of its expression pattern. When the dataset was finalized, a second curator (VH) reviewed and edited the initial annotation assignments. In this second round of annotation, genes with similar or related patterns of expression were examined side by side; these comparisons allowed us to significantly improve the internal consistency of our annotations.</p>
            <p>We used two approaches to review annotations (Additional data file 1). Each approach generated lists of genes with related expression patterns that were then used by the second-round curator to review the annotations of individual genes for consistency. The first approach was purely image-based and did not make use of the first round annotations. For each of the first four stage ranges, we examined images of embryos from all 6,580 genes. We developed a software tool for displaying these images in batches, and a subset of images sharing a particular feature (for example, showing regulated expression at cellular blastoderm) was manually selected. These gene lists were further subdivided until meaningful subsets could no longer be identified.</p>
            <p>The second approach used the first round annotations to generate lists of genes ordered by similarity to particular sets of CV terms. We developed a software tool that generated lists for any arbitrary set of CV terms, but we found it most productive to define 12 relatively independent sets, each focused on a single organ system, that together covered the entire annotation hierarchy.</p>
            <p>The lists generated by the two approaches were used to re-annotate similar genes <it>en masse </it>to make the resulting annotations as uniform as possible. CV terms were added or deleted as necessary, and genes with satisfactory and complete annotations were deemed finished and removed from all re-annotation lists. As part of this process the curator removed 577 (9%) of the genes from the dataset when the quality of the primary data was judged to be insufficient to support high-quality annotation.</p>
         </sec>
         <sec>
            <st>
               <p>Annotation hierarchy</p>
            </st>
            <p>To describe the spatial and temporal gene expression patterns in embryogenesis, we used only two types of relationships, 'part of' to cover spatial relations and 'develops from' to cover temporal relationships among structures. Importantly, we linked terms to six developmental stage ranges and used 'develops from' relationships exclusively to link terms that belong to consecutive stage-ranges. Our anatomical terms were organized into a hierarchical tree, starting with stage range 1-3, which used only two CV terms (maternal, pole plasm), and progressively branching through the six stage ranges until stage-range 13-16 with 126 anatomical terms (Additional data file 3). Anatomical structures that were contained within a larger structure are linked to the larger structure by the 'part of' relationships (for example, 13-16 midline glia is 'part of' 13-16 midline). Anatomical structures that develop from one another across time are linked by the 'develops from' relationship (for example, 13-16 'midline' develops from 11-12 'midline primordium' (PR)). CV terms can have simultaneously the 'part of' and the 'develops from' relationships (for example, 'midline glioblast' is part of 'midline primordium', and 'midline glia' develops from the 'midline glioblast'). Every term occurs in the hierarchy only once. In a few cases where two terms develop into a single later structure (for example, 'anterior and posterior midgut primordium' forming 'midgut'), the strictly hierarchical nature of the tree is broken, and both were linked to the child term ('midgut') by the 'develops from' relationship. This fits the directed acyclic graph (DAG) format that is used to capture many biological ontologies.</p>
            <p>Many specific structures representing small subsets of tissues had very few or none of the 6,003 interrogated genes expressed within them (50 structures had 8 genes or fewer). We summarized the data by focusing on a subset of 145 structures that make up the most common and readily distinguishable structures in our dataset. Genes annotated with more specific structures were collapsed into more general parent structures. For example, the terms 'dorsal epidermis', 'dorsal apodeme', 'dorsal histoblast nest abdominal', 'dorsal ridge' and 'leading edge cell' were collapsed into 'dorsal ectoderm'. We distinguished two levels of collapsing. Within a stage range, we collapsed the 'part of' relationships to the parent term. The resulting 'blocks' of terms represent the most relevant units of embryo anatomy for describing the RNA <it>in situ </it>results. Several such blocks may be defined within a single organ system, for example, 'trunk and head somatic and visceral musculature' (TrunkSomMusc, HeadSomMusc, TrunkViscMusc, HeadViscMusc) in the muscle system. We also collapsed terms referring to the same organ system across a range of stages: structures from stage range 4-6 were collapsed into early organ systems anlagen; structures from stage ranges 7-8 and 9-10 into mid organ systems (Additional data file 4); and structures from stage ranges 11-12 and 13-16 into late organ systems (Additional data file 5). For example, Endocrine_heart refers to the combined anatomical terms for all components of the circulatory and endocrine related structures at stages 11-16 (combining blocks 11-12 CardioVAsc, 11-12 RingGland, 13-16 CardMeso, 13-16 RingGland).</p>
         </sec>
         <sec>
            <st>
               <p>Linear annotation profiles</p>
            </st>
            <p>Enrichment in the linear annotation profiles was displayed as the statistical significance of the over- or under-representation in the number of genes annotated with the given structure. The expected number of genes was modeled as a binomial distribution with parameters <it>n </it>(the number of genes in the list under study) and <it>p </it>(the frequency of the given structure in the dataset as a whole, <inline-formula><m:math name="gb-2007-8-7-r145-i1" xmlns:m="http://www.w3.org/1998/Math/MathML"><m:semantics><m:mrow><m:mfrac><m:mrow><m:msub><m:mi>N</m:mi><m:mi>s</m:mi></m:msub></m:mrow><m:mrow><m:mn>4</m:mn><m:mo>,</m:mo><m:mn>759</m:mn></m:mrow></m:mfrac></m:mrow><m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfeBSjuyZL2yd9gzLbvyNv2Caerbhv2BYDwAHbqedmvETj2BSbqee0evGueE0jxyaibaiKI8=vI8tuQ8FMI8Gi=hEeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciGacaGaaeqabaqadeqadaaakeaadaWcaaqaaiaad6eadaWgaaWcbaGaam4CaaqabaaakeaacaaI0aGaaiilaiaaiEdacaaI1aGaaGyoaaaaaaa@38EF@</m:annotation></m:semantics></m:math></inline-formula>, where <it>N</it><sub><it>s </it></sub>is the number of genes annotated with structure <it>s</it>, and 4,759 is the total number of genes expressed in the embryo). Under this model, the expected number of genes would be <it>np</it>, so gene counts greater than <it>np </it>received positive enrichment scores and counts less than <it>np </it>received negative enrichment scores. The enrichment score was simply the number of standard deviations above or below <it>np </it>in the distribution binomial (<it>n</it>, <it>p</it>), or the standard score (z-score).</p>
         </sec>
         <sec>
            <st>
               <p>Fuzzy clustering</p>
            </st>
            <p>There were 4,496 genes detected in at least one tissue by <it>in situ </it>hybridization and present on the Affymetrix <it>Drosophila </it>1.0 gene chip. Microarray data were extracted and normalized as described in <abbrgrp><abbr bid="B12">12</abbr></abbrgrp>. An additional time point of wild-type flies at 16 hours post egg-laying was obtained from Tiago Magalh&#227;es and normalized with the previous 12 time points using Bioconductor's RMA <abbrgrp><abbr bid="B40">40</abbr><abbr bid="B41">41</abbr></abbrgrp> package.</p>
            <p>Input to the fuzzy clustering algorithm was the <it>g </it>&#215; <it>s </it>binary matrix <it>S </it>of CV annotations (S for 'spatial') where <it>S</it><sub><it>i</it>, <it>j </it></sub>is 1 when gene <it>i </it>is annotated with term <it>j</it>, and 0 otherwise; <it>g </it>is the number of genes in the dataset and <it>s </it>is the number of annotation terms. We used the 145 term collapsed version of the annotations for all clustering. An additional input matrix <it>L </it>(for 'levels') was the real-valued <it>g </it>&#215; <it>t </it>matrix containing the normalized microarray values <abbrgrp><abbr bid="B40">40</abbr><abbr bid="B41">41</abbr></abbrgrp>.</p>
            <p>The basic procedure was similar to that outlined in <abbrgrp><abbr bid="B22">22</abbr></abbrgrp>, where [0,1] membership levels for each gene in each of <it>k </it>clusters was represented by a <it>g </it>&#215; <it>k </it>matrix <it>M </it>and iteratively estimated. This matrix was randomly initialized with:</p>
            <p>
               <display-formula>
                  <m:math name="gb-2007-8-7-r145-i2" xmlns:m="http://www.w3.org/1998/Math/MathML">
                     <m:semantics>
                        <m:mrow>
                           <m:msub>
                              <m:mi>M</m:mi>
                              <m:mrow>
                                 <m:mi>i</m:mi>
                                 <m:mi>j</m:mi>
                              </m:mrow>
                           </m:msub>
                           <m:mo>=</m:mo>
                           <m:mfrac>
                              <m:mrow>
                                 <m:mn>1</m:mn>
                                 <m:mo>+</m:mo>
                                 <m:mi>r</m:mi>
                              </m:mrow>
                              <m:mi>k</m:mi>
                           </m:mfrac>
                        </m:mrow>
                        <m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfeBSjuyZL2yd9gzLbvyNv2Caerbhv2BYDwAHbqedmvETj2BSbqee0evGueE0jxyaibaiKI8=vI8tuQ8FMI8Gi=hEeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciGacaGaaeqabaqadeqadaaakeaacaWGnbWaaSbaaSqaaiaadMgacaWGQbaabeaakiabg2da9maalaaabaGaaGymaiabgUcaRiaadkhaaeaacaWGRbaaaaaa@3AAC@</m:annotation>
                     </m:semantics>
                  </m:math>
               </display-formula>
            </p>
            <p>where <it>r </it>is sampled from the uniform (0,1) distribution. The matrix was then re-normalized so that <inline-formula><m:math name="gb-2007-8-7-r145-i3" xmlns:m="http://www.w3.org/1998/Math/MathML"><m:semantics><m:mrow><m:mstyle displaystyle="true"><m:msubsup><m:mo>&#8721;</m:mo><m:mrow><m:mi>j</m:mi><m:mo>=</m:mo><m:mn>1</m:mn></m:mrow><m:mi>k</m:mi></m:msubsup><m:mrow><m:msub><m:mi>M</m:mi><m:mrow><m:mi>i</m:mi><m:mo>,</m:mo><m:mi>j</m:mi></m:mrow></m:msub></m:mrow></m:mstyle><m:mo>=</m:mo><m:mn>1</m:mn></m:mrow><m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfeBSjuyZL2yd9gzLbvyNv2Caerbhv2BYDwAHbqedmvETj2BSbqee0evGueE0jxyaibaiKI8=vI8tuQ8FMI8Gi=hEeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciGacaGaaeqabaqadeqadaaakeaadaaeWaqaaiaad2eadaWgaaWcbaGaamyAaiaacYcacaWGQbaabeaaaeaacaWGQbGaeyypa0JaaGymaaqaaiaadUgaa0GaeyyeIuoakiabg2da9iaaigdaaaa@3E1A@</m:annotation></m:semantics></m:math></inline-formula> for every gene <it>i</it>. A distance function <it>d</it><sub><it>i</it>, <it>j </it></sub>was calculated at each iteration and used to update membership values:</p>
            <p>
               <display-formula>
                  <m:math name="gb-2007-8-7-r145-i4" xmlns:m="http://www.w3.org/1998/Math/MathML">
                     <m:semantics>
                        <m:mrow>
                           <m:msub>
                              <m:mi>M</m:mi>
                              <m:mrow>
                                 <m:mi>i</m:mi>
                                 <m:mi>j</m:mi>
                              </m:mrow>
                           </m:msub>
                           <m:mo>=</m:mo>
                           <m:mfrac>
                              <m:mrow>
                                 <m:msubsup>
                                    <m:mi>d</m:mi>
                                    <m:mrow>
                                       <m:mi>i</m:mi>
                                       <m:mi>j</m:mi>
                                    </m:mrow>
                                    <m:mrow>
                                       <m:mo>&#8722;</m:mo>
                                       <m:mn>2</m:mn>
                                       <m:mo>/</m:mo>
                                       <m:mo stretchy="false">(</m:mo>
                                       <m:mi>&#966;</m:mi>
                                       <m:mo>&#8722;</m:mo>
                                       <m:mn>1</m:mn>
                                       <m:mo stretchy="false">)</m:mo>
                                    </m:mrow>
                                 </m:msubsup>
                              </m:mrow>
                              <m:mrow>
                                 <m:mstyle displaystyle="true">
                                    <m:msubsup>
                                       <m:mo>&#8721;</m:mo>
                                       <m:mrow>
                                          <m:mi>l</m:mi>
                                          <m:mo>=</m:mo>
                                          <m:mn>1</m:mn>
                                       </m:mrow>
                                       <m:mi>k</m:mi>
                                    </m:msubsup>
                                    <m:mrow>
                                       <m:msubsup>
                                          <m:mi>d</m:mi>
                                          <m:mrow>
                                             <m:mi>i</m:mi>
                                             <m:mi>l</m:mi>
                                          </m:mrow>
                                          <m:mrow>
                                             <m:mo>&#8722;</m:mo>
                                             <m:mn>2</m:mn>
                                             <m:mo>/</m:mo>
                                             <m:mo stretchy="false">(</m:mo>
                                             <m:mi>&#966;</m:mi>
                                             <m:mo>&#8722;</m:mo>
                                             <m:mn>1</m:mn>
                                             <m:mo stretchy="false">)</m:mo>
                                          </m:mrow>
                                       </m:msubsup>
                                    </m:mrow>
                                 </m:mstyle>
                              </m:mrow>
                           </m:mfrac>
                        </m:mrow>
                        <m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfeBSjuyZL2yd9gzLbvyNv2Caerbhv2BYDwAHbqedmvETj2BSbqee0evGueE0jxyaibaiKI8=vI8tuQ8FMI8Gi=hEeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciGacaGaaeqabaqadeqadaaakeaacaWGnbWaaSbaaSqaaiaadMgacaWGQbaabeaakiabg2da9maalaaabaGaamizamaaDaaaleaacaWGPbGaamOAaaqaaiabgkHiTiaaikdacaGGVaGaaiikaiabeA8aMjabgkHiTiaaigdacaGGPaaaaaGcbaWaaabmaeaacaWGKbWaa0baaSqaaiaadMgacaWGSbaabaGaeyOeI0IaaGOmaiaac+cacaGGOaGaeqOXdyMaeyOeI0IaaGymaiaacMcaaaaabaGaamiBaiabg2da9iaaigdaaeaacaWGRbaaniabggHiLdaaaaaa@50DF@</m:annotation>
                     </m:semantics>
                  </m:math>
               </display-formula>
            </p>
            <p>Here, <it>&#966; </it>is a 'fuzziness' parameter that determines the level of competition between clusters for the same gene (as <it>&#966; </it>approaches 1, it becomes a hard partition with each gene being assigned to the single best cluster, while higher values of <it>&#966; </it>cause gene memberships to be more fuzzy). While fuzzy clustering is generally thought to be most useful at <it>&#966; </it>values of 2 through 10 <abbrgrp><abbr bid="B24">24</abbr></abbrgrp>, we found that if <it>&#966; </it>was set above 1.5, the dataset would converge to one or two very fuzzy clusters composed of diffuse sets of terms. On the other extreme, if we used completely hard partitions (as in k-means, <it>&#966; </it>= 1), the majority of clusters were empty. If we used a partitioning close to 1, we found that each resulting cluster was distinct. We tried a range of values between 1 and 1.5 and used a <it>&#966; </it>of 1.05, which yielded the best results (data not shown).</p>
            <p>Iterations were stopped when the average difference in the membership matrix, &#916;<it>M </it>dropped below 5e<sup>-5</sup>, where:</p>
            <p>
               <display-formula>
                  <m:math name="gb-2007-8-7-r145-i5" xmlns:m="http://www.w3.org/1998/Math/MathML">
                     <m:semantics>
                        <m:mrow>
                           <m:mi>&#916;</m:mi>
                           <m:mi>M</m:mi>
                           <m:mo>=</m:mo>
                           <m:mstyle displaystyle="true">
                              <m:munderover>
                                 <m:mo>&#8721;</m:mo>
                                 <m:mrow>
                                    <m:mi>i</m:mi>
                                    <m:mo>=</m:mo>
                                    <m:mn>1</m:mn>
                                 </m:mrow>
                                 <m:mi>n</m:mi>
                              </m:munderover>
                              <m:mrow>
                                 <m:mstyle displaystyle="true">
                                    <m:munderover>
                                       <m:mo>&#8721;</m:mo>
                                       <m:mrow>
                                          <m:mi>j</m:mi>
                                          <m:mo>=</m:mo>
                                          <m:mn>1</m:mn>
                                       </m:mrow>
                                       <m:mi>k</m:mi>
                                    </m:munderover>
                                    <m:mrow>
                                       <m:mfrac>
                                          <m:mrow>
                                             <m:mi>a</m:mi>
                                             <m:mi>b</m:mi>
                                             <m:mi>s</m:mi>
                                             <m:mo stretchy="false">(</m:mo>
                                             <m:msubsup>
                                                <m:mi>M</m:mi>
                                                <m:mrow>
                                                   <m:mi>i</m:mi>
                                                   <m:mi>j</m:mi>
                                                </m:mrow>
                                                <m:mrow>
                                                   <m:mi>t</m:mi>
                                                   <m:mo>+</m:mo>
                                                   <m:mn>1</m:mn>
                                                </m:mrow>
                                             </m:msubsup>
                                             <m:mo>&#8722;</m:mo>
                                             <m:msubsup>
                                                <m:mi>M</m:mi>
                                                <m:mrow>
                                                   <m:mi>i</m:mi>
                                                   <m:mi>j</m:mi>
                                                </m:mrow>
                                                <m:mi>t</m:mi>
                                             </m:msubsup>
                                             <m:mo stretchy="false">)</m:mo>
                                          </m:mrow>
                                          <m:mrow>
                                             <m:mi>n</m:mi>
                                             <m:mi>k</m:mi>
                                          </m:mrow>
                                       </m:mfrac>
                                    </m:mrow>
                                 </m:mstyle>
                              </m:mrow>
                           </m:mstyle>
                        </m:mrow>
                        <m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfeBSjuyZL2yd9gzLbvyNv2Caerbhv2BYDwAHbqedmvETj2BSbqee0evGueE0jxyaibaiKI8=vI8tuQ8FMI8Gi=hEeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciGacaGaaeqabaqadeqadaaakeaacqqHuoarcaWGnbGaeyypa0ZaaabCaeaadaaeWbqaamaalaaabaGaamyyaiaadkgacaWGZbGaaiikaiaad2eadaqhaaWcbaGaamyAaiaadQgaaeaacaWG0bGaey4kaSIaaGymaaaakiabgkHiTiaad2eadaqhaaWcbaGaamyAaiaadQgaaeaacaWG0baaaOGaaiykaaqaaiaad6gacaWGRbaaaaWcbaGaamOAaiabg2da9iaaigdaaeaacaWGRbaaniabggHiLdaaleaacaWGPbGaeyypa0JaaGymaaqaaiaad6gaa0GaeyyeIuoaaaa@528A@</m:annotation>
                     </m:semantics>
                  </m:math>
               </display-formula>
            </p>
            <p>We tried using classical mean <abbrgrp><abbr bid="B22">22</abbr></abbrgrp> and medoid <abbrgrp><abbr bid="B42">42</abbr></abbrgrp> representations for cluster centroids, but these performed poorly when attempting to combine a real-valued <it>L </it>and a binary valued <it>S</it>. Instead of maintaining a discrete centroid model to obtain distance values <it>d</it>, we instead defined <it>d</it><sub><it>i</it>, <it>j </it></sub>as the average distance of gene <it>i </it>to all other genes <it>k</it>, where each <it>k </it>is given the weight of its membership in cluster <it>j</it>:</p>
            <p>
               <display-formula>
                  <m:math name="gb-2007-8-7-r145-i6" xmlns:m="http://www.w3.org/1998/Math/MathML">
                     <m:semantics>
                        <m:mrow>
                           <m:msub>
                              <m:mi>d</m:mi>
                              <m:mrow>
                                 <m:mi>i</m:mi>
                                 <m:mi>j</m:mi>
                              </m:mrow>
                           </m:msub>
                           <m:mo>=</m:mo>
                           <m:mstyle displaystyle="true">
                              <m:munder>
                                 <m:mo>&#8721;</m:mo>
                                 <m:mrow>
                                    <m:mi>k</m:mi>
                                    <m:mo>&#8800;</m:mo>
                                    <m:mi>i</m:mi>
                                 </m:mrow>
                              </m:munder>
                              <m:mrow>
                                 <m:msub>
                                    <m:mi>M</m:mi>
                                    <m:mrow>
                                       <m:mi>k</m:mi>
                                       <m:mi>j</m:mi>
                                    </m:mrow>
                                 </m:msub>
                                 <m:msub>
                                    <m:mi>&#948;</m:mi>
                                    <m:mrow>
                                       <m:mi>i</m:mi>
                                       <m:mi>k</m:mi>
                                    </m:mrow>
                                 </m:msub>
                              </m:mrow>
                           </m:mstyle>
                        </m:mrow>
                        <m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfeBSjuyZL2yd9gzLbvyNv2Caerbhv2BYDwAHbqedmvETj2BSbqee0evGueE0jxyaibaiKI8=vI8tuQ8FMI8Gi=hEeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciGacaGaaeqabaqadeqadaaakeaacaWGKbWaaSbaaSqaaiaadMgacaWGQbaabeaakiabg2da9maaqafabaGaamytamaaBaaaleaacaWGRbGaamOAaaqabaGccqaH0oazdaWgaaWcbaGaamyAaiaadUgaaeqaaaqaaiaadUgacqGHGjsUcaWGPbaabeqdcqGHris5aaaa@4382@</m:annotation>
                     </m:semantics>
                  </m:math>
               </display-formula>
            </p>
            <p>Here, <it>&#948;</it><sub><it>ik </it></sub>is the distance between gene <it>i </it>and gene <it>k</it>. In this way, distances between all pairs of genes are transformed into distances between genes and clusters.</p>
         </sec>
         <sec>
            <st>
               <p>Hybrid distance function</p>
            </st>
            <p>In order to use annotation similarity to determine the contribution of array similarity (as described in Results and discussion), we defined an asymmetric mixture function where microarray similarity has a significant effect when the annotation similarity was medium to high, but very little effect when the annotation similarity was low (Additional data file 7). The mixture function defines the combined similarity <it>s</it><sub><it>i</it>, <it>k </it></sub>in terms of the spatial similarity <it>s</it><sub><it>s </it></sub>and the array similarity <it>s</it><sub><it>l </it></sub>(the combined distance <it>&#948;</it><sub><it>i</it>, <it>k </it></sub>is simply 1 - <it>s</it><sub><it>i</it>, <it>k</it></sub>):</p>
            <p>
               <display-formula>
                  <m:math name="gb-2007-8-7-r145-i7" xmlns:m="http://www.w3.org/1998/Math/MathML">
                     <m:semantics>
                        <m:mrow>
                           <m:msub>
                              <m:mi>s</m:mi>
                              <m:mrow>
                                 <m:mi>i</m:mi>
                                 <m:mo>,</m:mo>
                                 <m:mi>k</m:mi>
                              </m:mrow>
                           </m:msub>
                           <m:mo>=</m:mo>
                           <m:msub>
                              <m:msup>
                                 <m:mi>s</m:mi>
                                 <m:mo>&#8242;</m:mo>
                              </m:msup>
                              <m:mrow>
                                 <m:mi>i</m:mi>
                                 <m:mo>,</m:mo>
                                 <m:mi>k</m:mi>
                              </m:mrow>
                           </m:msub>
                           <m:mo>+</m:mo>
                           <m:mo stretchy="false">(</m:mo>
                           <m:mn>1</m:mn>
                           <m:mo>&#8722;</m:mo>
                           <m:msub>
                              <m:msup>
                                 <m:mi>s</m:mi>
                                 <m:mo>&#8242;</m:mo>
                              </m:msup>
                              <m:mrow>
                                 <m:mi>i</m:mi>
                                 <m:mo>,</m:mo>
                                 <m:mi>k</m:mi>
                              </m:mrow>
                           </m:msub>
                           <m:mo stretchy="false">)</m:mo>
                           <m:msub>
                              <m:msup>
                                 <m:mi>s</m:mi>
                                 <m:mo>&#8242;</m:mo>
                              </m:msup>
                              <m:mrow>
                                 <m:mi>i</m:mi>
                                 <m:mo>,</m:mo>
                                 <m:mi>k</m:mi>
                              </m:mrow>
                           </m:msub>
                           <m:msub>
                              <m:msup>
                                 <m:mi>s</m:mi>
                                 <m:mo>&#8243;</m:mo>
                              </m:msup>
                              <m:mrow>
                                 <m:mi>i</m:mi>
                                 <m:mo>,</m:mo>
                                 <m:mi>k</m:mi>
                              </m:mrow>
                           </m:msub>
                        </m:mrow>
                        <m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfeBSjuyZL2yd9gzLbvyNv2Caerbhv2BYDwAHbqedmvETj2BSbqee0evGueE0jxyaibaiKI8=vI8tuQ8FMI8Gi=hEeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciGacaGaaeqabaqadeqadaaakeaacaWGZbWaaSbaaSqaaiaadMgacaGGSaGaam4AaaqabaGccqGH9aqpceWGZbGbauaadaWgaaWcbaGaamyAaiaacYcacaWGRbaabeaakiabgUcaRiaacIcacaaIXaGaeyOeI0Iabm4CayaafaWaaSbaaSqaaiaadMgacaGGSaGaam4AaaqabaGccaGGPaGabm4CayaafaWaaSbaaSqaaiaadMgacaGGSaGaam4AaaqabaGcceWGZbGbayaadaWgaaWcbaGaamyAaiaacYcacaWGRbaabeaaaaa@4AE9@</m:annotation>
                     </m:semantics>
                  </m:math>
               </display-formula>
            </p>
            <p>The spatial similarity <inline-formula><m:math name="gb-2007-8-7-r145-i8" xmlns:m="http://www.w3.org/1998/Math/MathML"><m:semantics><m:mrow><m:msub><m:msup><m:mi>s</m:mi><m:mo>&#8242;</m:mo></m:msup><m:mrow><m:mi>i</m:mi><m:mo>,</m:mo><m:mi>k</m:mi></m:mrow></m:msub></m:mrow><m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfeBSjuyZL2yd9gzLbvyNv2Caerbhv2BYDwAHbqedmvETj2BSbqee0evGueE0jxyaibaiKI8=vI8tuQ8FMI8Gi=hEeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciGacaGaaeqabaqadeqadaaakeaaceWGZbGbauaadaWgaaWcbaGaamyAaiaacYcacaWGRbaabeaaaaa@36EB@</m:annotation></m:semantics></m:math></inline-formula> and the array similarity <inline-formula><m:math name="gb-2007-8-7-r145-i9" xmlns:m="http://www.w3.org/1998/Math/MathML"><m:semantics><m:mrow><m:msub><m:msup><m:mi>s</m:mi><m:mo>&#8243;</m:mo></m:msup><m:mrow><m:mi>i</m:mi><m:mo>,</m:mo><m:mi>k</m:mi></m:mrow></m:msub></m:mrow><m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfeBSjuyZL2yd9gzLbvyNv2Caerbhv2BYDwAHbqedmvETj2BSbqee0evGueE0jxyaibaiKI8=vI8tuQ8FMI8Gi=hEeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciGacaGaaeqabaqadeqadaaakeaaceWGZbGbayaadaWgaaWcbaGaamyAaiaacYcacaWGRbaabeaaaaa@36EC@</m:annotation></m:semantics></m:math></inline-formula> were calculated separately, and were normalized to the interval [0,1] before mixing. We normalized raw similarity scores using the absolute median normalization, which is defined as:</p>
            <p>
               <display-formula>
                  <m:math name="gb-2007-8-7-r145-i10" xmlns:m="http://www.w3.org/1998/Math/MathML">
                     <m:semantics>
                        <m:mrow>
                           <m:msub>
                              <m:mi>x</m:mi>
                              <m:mi>n</m:mi>
                           </m:msub>
                           <m:mo>=</m:mo>
                           <m:mfrac>
                              <m:mrow>
                                 <m:mi>x</m:mi>
                                 <m:mo>&#8722;</m:mo>
                                 <m:mi>v</m:mi>
                              </m:mrow>
                              <m:mi>&#963;</m:mi>
                           </m:mfrac>
                        </m:mrow>
                        <m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfeBSjuyZL2yd9gzLbvyNv2Caerbhv2BYDwAHbqedmvETj2BSbqee0evGueE0jxyaibaiKI8=vI8tuQ8FMI8Gi=hEeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciGacaGaaeqabaqadeqadaaakeaacaWG4bWaaSbaaSqaaiaad6gaaeqaaOGaeyypa0ZaaSaaaeaacaWG4bGaeyOeI0IaamODaaqaaiabeo8aZbaaaaa@3B11@</m:annotation>
                     </m:semantics>
                  </m:math>
               </display-formula>
            </p>
            <p>where <it>&#957; </it>is the median and the absolute deviation is:</p>
            <p>
               <display-formula>
                  <m:math name="gb-2007-8-7-r145-i11" xmlns:m="http://www.w3.org/1998/Math/MathML">
                     <m:semantics>
                        <m:mrow>
                           <m:mi>&#963;</m:mi>
                           <m:mo>=</m:mo>
                           <m:mstyle displaystyle="true">
                              <m:msubsup>
                                 <m:mo>&#8721;</m:mo>
                                 <m:mrow>
                                    <m:mi>i</m:mi>
                                    <m:mo>=</m:mo>
                                    <m:mn>1</m:mn>
                                 </m:mrow>
                                 <m:mi>n</m:mi>
                              </m:msubsup>
                              <m:mrow>
                                 <m:mfrac>
                                    <m:mrow>
                                       <m:mi>a</m:mi>
                                       <m:mi>b</m:mi>
                                       <m:mi>s</m:mi>
                                       <m:mo stretchy="false">(</m:mo>
                                       <m:msub>
                                          <m:mi>x</m:mi>
                                          <m:mi>i</m:mi>
                                       </m:msub>
                                       <m:mo>&#8722;</m:mo>
                                       <m:mi>&#957;</m:mi>
                                       <m:mo stretchy="false">)</m:mo>
                                    </m:mrow>
                                    <m:mi>N</m:mi>
                                 </m:mfrac>
                              </m:mrow>
                           </m:mstyle>
                           <m:mo>.</m:mo>
                        </m:mrow>
                        <m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfeBSjuyZL2yd9gzLbvyNv2Caerbhv2BYDwAHbqedmvETj2BSbqee0evGueE0jxyaibaiKI8=vI8tuQ8FMI8Gi=hEeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciGacaGaaeqabaqadeqadaaakeaacqaHdpWCcqGH9aqpdaaeWaqaamaalaaabaGaamyyaiaadkgacaWGZbGaaiikaiaadIhadaWgaaWcbaGaamyAaaqabaGccqGHsislcqaH9oGBcaGGPaaabaGaamOtaaaaaSqaaiaadMgacqGH9aqpcaaIXaaabaGaamOBaaqdcqGHris5aOGaaiOlaaaa@461D@</m:annotation>
                     </m:semantics>
                  </m:math>
               </display-formula>
            </p>
            <p>The microarray similarity <inline-formula><m:math name="gb-2007-8-7-r145-i9" xmlns:m="http://www.w3.org/1998/Math/MathML"><m:semantics><m:mrow><m:msub><m:msup><m:mi>s</m:mi><m:mo>&#8243;</m:mo></m:msup><m:mrow><m:mi>i</m:mi><m:mo>,</m:mo><m:mi>k</m:mi></m:mrow></m:msub></m:mrow><m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfeBSjuyZL2yd9gzLbvyNv2Caerbhv2BYDwAHbqedmvETj2BSbqee0evGueE0jxyaibaiKI8=vI8tuQ8FMI8Gi=hEeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciGacaGaaeqabaqadeqadaaakeaaceWGZbGbayaadaWgaaWcbaGaamyAaiaacYcacaWGRbaabeaaaaa@36EC@</m:annotation></m:semantics></m:math></inline-formula> was simply the Pearson correlation coefficient <abbrgrp><abbr bid="B43">43</abbr></abbrgrp>. As a measure of annotation similarity (<inline-formula><m:math name="gb-2007-8-7-r145-i8" xmlns:m="http://www.w3.org/1998/Math/MathML"><m:semantics><m:mrow><m:msub><m:msup><m:mi>s</m:mi><m:mo>&#8242;</m:mo></m:msup><m:mrow><m:mi>i</m:mi><m:mo>,</m:mo><m:mi>k</m:mi></m:mrow></m:msub></m:mrow><m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfeBSjuyZL2yd9gzLbvyNv2Caerbhv2BYDwAHbqedmvETj2BSbqee0evGueE0jxyaibaiKI8=vI8tuQ8FMI8Gi=hEeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciGacaGaaeqabaqadeqadaaakeaaceWGZbGbauaadaWgaaWcbaGaamyAaiaacYcacaWGRbaabeaaaaa@36EB@</m:annotation></m:semantics></m:math></inline-formula>), we have previously used the jaccard metric <abbrgrp><abbr bid="B12">12</abbr></abbrgrp>. This metric implicitly assumed that more terms in common equates to greater similarity of expression. This is not the case for our data where the terms were related to each other in non-uniform ways, and so we designed a new metric that takes into account several important aspects of our annotation data.</p>
            <p>Some stage ranges have only a single associated annotation term (stage range 1-3 maternal), while others have many more (stage range 13-16 has 37 collapsed terms of which up to 17 are used in a single annotation record). The relative abundance of stage range 13-16 terms dominate any metric where each term is given the same weight, so we calculated a stage range-specific similarity independently for each stage and then produced a weighted sum. Stage ranges 4-6 and 13-16 received higher weights because they coincide with two periods in embryogenesis when <it>de novo </it>transcriptional initiation most frequently occurs, cellularization and organogenesis. Stage ranges 7-8 and 9-10 represented in most cases carry-over expression from stage range 4-6, and were difficult to score and, therefore, were less reliable. The weights used for each stage range were: stage range 1-3 (7%), stage range 4-6 (36%), stage range 7-8 (7%), stage range 9-10 (7%), stage range 11-12 (7%) and stage range 13-16 (36%).</p>
            <p>The similarity score for each stage range consisted of two components: a positive 'match bonus' score for the extent to which the two genes had terms in common, and a negative 'mismatch penalty' score for the extent to which the two genes had mismatched terms. The match bonus contributed twice as much as the mismatch penalty to the overall score:</p>
            <p>
               <display-formula>
                  <m:math name="gb-2007-8-7-r145-i12" xmlns:m="http://www.w3.org/1998/Math/MathML">
                     <m:semantics>
                        <m:mrow>
                           <m:msub>
                              <m:msup>
                                 <m:mi>s</m:mi>
                                 <m:mo>&#8242;</m:mo>
                              </m:msup>
                              <m:mrow>
                                 <m:mi>i</m:mi>
                                 <m:mo>,</m:mo>
                                 <m:mi>k</m:mi>
                              </m:mrow>
                           </m:msub>
                           <m:mo>=</m:mo>
                           <m:mstyle displaystyle="true">
                              <m:munder>
                                 <m:mo>&#8721;</m:mo>
                                 <m:mrow>
                                    <m:mi>r</m:mi>
                                    <m:mo>&#8712;</m:mo>
                                    <m:mi>s</m:mi>
                                    <m:mi>t</m:mi>
                                    <m:mi>a</m:mi>
                                    <m:mi>g</m:mi>
                                    <m:mi>e</m:mi>
                                    <m:mi>s</m:mi>
                                 </m:mrow>
                              </m:munder>
                              <m:mrow>
                                 <m:msub>
                                    <m:mi>&#955;</m:mi>
                                    <m:mi>r</m:mi>
                                 </m:msub>
                                 <m:mrow>
                                    <m:mo>(</m:mo>
                                    <m:mrow>
                                       <m:mn>2</m:mn>
                                       <m:msubsup>
                                          <m:msup>
                                             <m:mi>s</m:mi>
                                             <m:mo>&#8242;</m:mo>
                                          </m:msup>
                                          <m:mrow>
                                             <m:mi>i</m:mi>
                                             <m:mo>,</m:mo>
                                             <m:mi>k</m:mi>
                                          </m:mrow>
                                          <m:mo>+</m:mo>
                                       </m:msubsup>
                                       <m:mo>&#8722;</m:mo>
                                       <m:msubsup>
                                          <m:msup>
                                             <m:mi>s</m:mi>
                                             <m:mo>&#8242;</m:mo>
                                          </m:msup>
                                          <m:mrow>
                                             <m:mi>i</m:mi>
                                             <m:mo>,</m:mo>
                                             <m:mi>k</m:mi>
                                          </m:mrow>
                                          <m:mo>&#8722;</m:mo>
                                       </m:msubsup>
                                    </m:mrow>
                                    <m:mo>)</m:mo>
                                 </m:mrow>
                              </m:mrow>
                           </m:mstyle>
                        </m:mrow>
                        <m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfeBSjuyZL2yd9gzLbvyNv2Caerbhv2BYDwAHbqedmvETj2BSbqee0evGueE0jxyaibaiKI8=vI8tuQ8FMI8Gi=hEeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciGacaGaaeqabaqadeqadaaakeaaceWGZbGbauaadaWgaaWcbaGaamyAaiaacYcacaWGRbaabeaakiabg2da9maaqafabaGaeq4UdW2aaSbaaSqaaiaadkhaaeqaaOWaaeWaaeaacaaIYaGabm4CayaafaWaa0baaSqaaiaadMgacaGGSaGaam4AaaqaaiabgUcaRaaakiabgkHiTiqadohagaqbamaaDaaaleaacaWGPbGaaiilaiaadUgaaeaacqGHsislaaaakiaawIcacaGLPaaaaSqaaiaadkhacqGHiiIZcaWGZbGaamiDaiaadggacaWGNbGaamyzaiaadohaaeqaniabggHiLdaaaa@51B2@</m:annotation>
                     </m:semantics>
                  </m:math>
               </display-formula>
            </p>
            <p>Where <inline-formula><m:math name="gb-2007-8-7-r145-i13" xmlns:m="http://www.w3.org/1998/Math/MathML"><m:semantics><m:mrow><m:msubsup><m:msup><m:mi>s</m:mi><m:mo>&#8242;</m:mo></m:msup><m:mrow><m:mi>i</m:mi><m:mo>,</m:mo><m:mi>k</m:mi></m:mrow><m:mo>+</m:mo></m:msubsup></m:mrow><m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfeBSjuyZL2yd9gzLbvyNv2Caerbhv2BYDwAHbqedmvETj2BSbqee0evGueE0jxyaibaiKI8=vI8tuQ8FMI8Gi=hEeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciGacaGaaeqabaqadeqadaaakeaaceWGZbGbauaadaqhaaWcbaGaamyAaiaacYcacaWGRbaabaGaey4kaScaaaaa@37CE@</m:annotation></m:semantics></m:math></inline-formula> is the match and <inline-formula><m:math name="gb-2007-8-7-r145-i14" xmlns:m="http://www.w3.org/1998/Math/MathML"><m:semantics><m:mrow><m:msubsup><m:msup><m:mi>s</m:mi><m:mo>&#8242;</m:mo></m:msup><m:mrow><m:mi>i</m:mi><m:mo>,</m:mo><m:mi>k</m:mi></m:mrow><m:mo>&#8722;</m:mo></m:msubsup></m:mrow><m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfeBSjuyZL2yd9gzLbvyNv2Caerbhv2BYDwAHbqedmvETj2BSbqee0evGueE0jxyaibaiKI8=vI8tuQ8FMI8Gi=hEeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciGacaGaaeqabaqadeqadaaakeaaceWGZbGbauaadaqhaaWcbaGaamyAaiaacYcacaWGRbaabaGaeyOeI0caaaaa@37D9@</m:annotation></m:semantics></m:math></inline-formula> is the mismatch. Match and mismatch scores are defined as follows. Genes not sharing annotation terms at a given stage range receive a match score of 0. Genes sharing any annotation term receive a match bonus equal to 1 plus a rarity factor from 0 to 0.5. The rarity factor is inversely proportional to the abundance (amongst all annotations) of the most rare term that is shared between the two genes. The mismatch penalty is equal to 1 minus a rarity factor from 0 to 0.5. Genes sharing rare terms receive the highest overall similarity scores, and genes mismatched for rare terms receive the lowest. This has the desirable effect of keeping those genes with rare sets of terms in common more tightly clustered.</p>
         </sec>
         <sec>
            <st>
               <p>Broad versus restricted cluster designation</p>
            </st>
            <p>The classification of clusters as broad or restricted was made automatically based on the average annotation terms of the genes assigned to the cluster. We classified a cluster as broad if any of the following criteria were met: over 30% of the genes in the cluster were annotated as ubiquitous at some stage from 7 to 16; if the cluster had 75% maternal genes and no restricted annotation terms in two-thirds of the genes; if the cluster contained genes with only maternal and late midgut staining (see Results and discussion). All other clusters were classified as restricted.</p>
         </sec>
         <sec>
            <st>
               <p>Assignment of genes to multiple clusters</p>
            </st>
            <p>As described in the text, we did not attain a fuzzy c-means clustering result where clusters had sharp boundaries. Therefore, the raw membership matrix <it>M </it>often assigned a gene high membership scores in neighboring, highly related clusters. In order to mitigate this fact and assign multiple memberships in a meaningful way, we performed an exhaustive analysis of cluster similarities to assign each gene to a set of significantly unrelated clusters. First, the raw membership matrix was transformed to a binary matrix <it>M</it>* using a threshold of <it>m</it><sub>min</sub>. Next, a distance score &#916;<sub><it>i</it>, <it>j </it></sub>between each pair of clusters <it>i </it>and <it>j </it>was determined from <it>M</it>* by dividing the number of genes in common by the number in the smaller of the two clusters:</p>
            <p>
               <display-formula>
                  <m:math name="gb-2007-8-7-r145-i15" xmlns:m="http://www.w3.org/1998/Math/MathML">
                     <m:semantics>
                        <m:mrow>
                           <m:msub>
                              <m:mi>&#916;</m:mi>
                              <m:mrow>
                                 <m:mi>i</m:mi>
                                 <m:mo>,</m:mo>
                                 <m:mi>j</m:mi>
                              </m:mrow>
                           </m:msub>
                           <m:mo>=</m:mo>
                           <m:mn>1</m:mn>
                           <m:mo>&#8722;</m:mo>
                           <m:mfrac>
                              <m:mrow>
                                 <m:mstyle displaystyle="true">
                                    <m:msubsup>
                                       <m:mo>&#8721;</m:mo>
                                       <m:mrow>
                                          <m:mi>g</m:mi>
                                          <m:mo>=</m:mo>
                                          <m:mn>1</m:mn>
                                       </m:mrow>
                                       <m:mrow>
                                          <m:msub>
                                             <m:mi>N</m:mi>
                                             <m:mrow>
                                                <m:mi>g</m:mi>
                                                <m:mi>e</m:mi>
                                                <m:mi>n</m:mi>
                                                <m:mi>e</m:mi>
                                                <m:mi>s</m:mi>
                                             </m:mrow>
                                          </m:msub>
                                       </m:mrow>
                                    </m:msubsup>
                                    <m:mrow>
                                       <m:mrow>
                                          <m:mo>(</m:mo>
                                          <m:mrow>
                                             <m:msubsup>
                                                <m:mi>M</m:mi>
                                                <m:mrow>
                                                   <m:mi>g</m:mi>
                                                   <m:mo>,</m:mo>
                                                   <m:mi>i</m:mi>
                                                </m:mrow>
                                                <m:mo>*</m:mo>
                                             </m:msubsup>
                                             <m:msubsup>
                                                <m:mi>M</m:mi>
                                                <m:mrow>
                                                   <m:mi>g</m:mi>
                                                   <m:mo>,</m:mo>
                                                   <m:mi>j</m:mi>
                                                </m:mrow>
                                                <m:mo>*</m:mo>
                                             </m:msubsup>
                                          </m:mrow>
                                          <m:mo>)</m:mo>
                                       </m:mrow>
                                    </m:mrow>
                                 </m:mstyle>
                              </m:mrow>
                              <m:mrow>
                                 <m:mi>min</m:mi>
                                 <m:mo>&#8289;</m:mo>
                                 <m:mrow>
                                    <m:mo>(</m:mo>
                                    <m:mrow>
                                       <m:mstyle displaystyle="true">
                                          <m:msubsup>
                                             <m:mo>&#8721;</m:mo>
                                             <m:mrow>
                                                <m:mi>g</m:mi>
                                                <m:mo>=</m:mo>
                                                <m:mn>1</m:mn>
                                             </m:mrow>
                                             <m:mrow>
                                                <m:msub>
                                                   <m:mi>N</m:mi>
                                                   <m:mrow>
                                                      <m:mi>g</m:mi>
                                                      <m:mi>e</m:mi>
                                                      <m:mi>n</m:mi>
                                                      <m:mi>e</m:mi>
                                                      <m:mi>s</m:mi>
                                       