<?xml version='1.0'?>
<!DOCTYPE art SYSTEM 'http://www.biomedcentral.com/xml/article.dtd'>
<art>
   <ui>gb-2008-9-2-r36</ui>
   <ji>GBJ</ji>
   <fm>
      <dochead>Method</dochead>
      <bibl>
         <title>
            <p>Genome wide prediction of HNF4&#945; functional binding sites by the use of local and global sequence context</p>
         </title>
         <aug>
            <au id="A1">
               <snm>Kel</snm>
               <mi>E</mi>
               <fnm>Alexander</fnm>
               <insr iid="I1"/>
               <email>ake@biobase.de</email>
            </au>
            <au id="A2">
               <snm>Niehof</snm>
               <fnm>Monika</fnm>
               <insr iid="I2"/>
               <email>niehof@item.fraunhoferd.de</email>
            </au>
            <au id="A3">
               <snm>Matys</snm>
               <fnm>Volker</fnm>
               <insr iid="I1"/>
               <email>vma@biobase.de</email>
            </au>
            <au id="A4">
               <snm>Zemlin</snm>
               <fnm>R&#252;diger</fnm>
               <insr iid="I2"/>
               <email>zemlin@item.fraunhofer.de</email>
            </au>
            <au id="A5" ca="yes">
               <snm>Borlak</snm>
               <fnm>J&#252;rgen</fnm>
               <insr iid="I2"/>
               <email>borlak@item.fraunhofer.de</email>
            </au>
         </aug>
         <insg>
            <ins id="I1">
               <p>BIOBASE GmbH, Halchtersche Str., 38304 Wolfenb&#252;ttel, Germany</p>
            </ins>
            <ins id="I2">
               <p>Fraunhofer Institute of Toxicology and Experimental Medicine, Center for Drug Research and Medical Biotechnology, Nikolai-Fuchs-Str., 30625 Hannover, Germany</p>
            </ins>
         </insg>
         <source>Genome Biology</source>
         <issn>1465-6906</issn>
         <pubdate>2008</pubdate>
         <volume>9</volume>
         <issue>2</issue>
         <fpage>R36</fpage>
         <url>http://genomebiology.com/2008/9/2/R36</url>
         <xrefbib>
            <pubidlist>
               <pubid idtype="pmpid">18291023</pubid>
               <pubid idtype="doi">10.1186/gb-2008-9-2-r36</pubid>
            </pubidlist>
         </xrefbib>
      </bibl>
      <history>
         <rec>
            <date>
               <day>19</day>
               <month>7</month>
               <year>2007</year>
            </date>
         </rec>
         <revrec>
            <date>
               <day>9</day>
               <month>11</month>
               <year>2007</year>
            </date>
         </revrec>
         <acc>
            <date>
               <day>21</day>
               <month>2</month>
               <year>2008</year>
            </date>
         </acc>
         <pub>
            <date>
               <day>21</day>
               <month>02</month>
               <year>2008</year>
            </date>
         </pub>
      </history>
      <cpyrt>
         <year>2008</year>
         <collab>Kel et al.; licensee BioMed Central Ltd.</collab>
         <note>This is an open access article distributed under the terms of the Creative Commons Attribution License (<url>http://creativecommons.org/licenses/by/2.0</url>), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.</note>
      </cpyrt>
      <shorttitle>
         <p>Prediction of transcription factor binding sites</p>
      </shorttitle>
      <shortabs>
         <p>An application of machine learning algorithms enables prediction of the functional context of transcription factor binding sites in the human genome.</p>
      </shortabs>
      <abs>
         <sec>
            <st>
               <p>Abstract</p>
            </st>
            <p>We report an application of machine learning algorithms that enables prediction of the functional context of transcription factor binding sites in the human genome. We demonstrate that our method allowed <it>de novo </it>identification of hepatic nuclear factor (HNF)4&#945; binding sites and significantly improved an overall recognition of faithful HNF4&#945; targets. When applied to published findings, an unprecedented high number of false positives were identified. The technique can be applied to any transcription factor.</p>
         </sec>
      </abs>
   </fm>
   <meta>
      <classifications>
         <classification type="BMC" subtype="man_spc_id" id="30010002">Bioinformatics</classification>
         <classification type="BMC" subtype="man_spc_id" id="30010016">Molecular biology</classification>
         <classification type="BMC" subtype="man_spc_id" id="30010013">Methods</classification>
      </classifications>
   </meta>
   <bdy>
      <sec>
         <st>
            <p>Background</p>
         </st>
         <p>Regulation of gene expression is accomplished through binding of transcription factors (TFs) to distinct regions of DNA (TF binding sites (TFBSs)), and, after anchoring at these sites, transmission of the regulatory signal to the basal transcription complex. Indeed, regions around TFBSs can be interrogated with regards to binding and interaction with other TFs (so-called composite modules) as well as local sequence properties that favor recruitment of TFs, bending and looping of DNA and nucleosome positioning. Some of these TFs are specific for a particular tissue, a definite stage of development, or a given extracellular signal, but most TFs are involved in gene regulation under a rather wide spectrum of cellular conditions. It is clear by now that combinations of TFs rather than single factors drive gene transcription and define its specificity. Dynamic and function-specific complexes of many different TFs, so-called enhanceosomes <abbrgrp><abbr bid="B1">1</abbr></abbrgrp>, are formed at gene promoters and enhancers to drive gene expression in a specific manner. At the DNA level, the blueprints for assembling such variable TF complexes on promoter regions may be seen as specific combinations of TFBSs located in close proximity to each other. They are termed 'composite modules' (CMs) or 'composite regulatory modules' <abbrgrp><abbr bid="B2">2</abbr></abbrgrp> or <it>cis</it>-regulatory modules <abbrgrp><abbr bid="B3">3</abbr></abbrgrp>. There may be several different types of CMs located in the regulatory region of one gene, which may be distant from each other (for example, liver- and muscle-specific enhancers of one gene) or overlapping. Taking this into account, it becomes more and more evident that the 'local sequence context' in the vicinity of the TFBS, as well as 'global context' of the whole promoter/enhancer where the TF site is located, influences binding and functioning of the corresponding TF. Numerous examples of so called composite regulatory elements are reported (see the TRANSCompel database <abbrgrp><abbr bid="B4">4</abbr></abbrgrp>) when TF binding and proper functioning of a site is strongly dependent on other sites located in the close vicinity (adjacent or even overlapping sites) or quite distant from each other (up to 100 and more nucleotides). For instance, for the TF family of nuclear receptors (to which hepatocyte nuclear factor (HNF)4 factors belong) there is experimental evidence showing clear dependence between functioning of HNF4 factors at their cognate sites and binding of other factors to the neighboring sites, both synergistically and antagonistically <abbrgrp><abbr bid="B4">4</abbr></abbrgrp>. There is a need to develop computational models to predict TFBSs that are functional and are involved in the control of gene transcription. Recent developments in the field of machine learning techniques allow us to apply them to build highly sensitive and specific methods for predicting functional TFBSs in human and other genomes.</p>
         <p>Because of our continued interest in the regulation of liver-enriched TFs <abbrgrp><abbr bid="B5">5</abbr><abbr bid="B6">6</abbr></abbrgrp>, we were particularly interested in identifying novel genes regulated by the hepatic nuclear factor (HNF)4&#945;. Indeed, HNF4&#945; is a versatile TF, and several investigators have reported the identification of genes targeted by HNF4&#945;. These studies included various experimental approaches, including transient transfection of HNF4&#945; into a human hepatoma cell line, a rat insulinoma cell line, and a human kidney cell line <abbrgrp><abbr bid="B7">7</abbr><abbr bid="B8">8</abbr><abbr bid="B9">9</abbr></abbrgrp>. Additionally, findings with conditional knock-outs of HNF4&#945; <abbrgrp><abbr bid="B10">10</abbr></abbrgrp> were recently reported. Notably, in the study of Odom <it>et al</it>. the genome-wide identification of binding sites for TFs HNF4&#945;, HNF1&#945;, and HNF6 was reported by use of the ChIP-chip assay with a 13,000 human promoter sequence containing microarray <abbrgrp><abbr bid="B11">11</abbr></abbrgrp>. Strikingly, in the case of HNF4&#945;, the number of contacted promoters was unexpectedly high; 1,575 potential HNF4&#945; target genes were identified. In addition, 42% of the genes occupied by RNA polymerase II were also occupied by HNF4&#945;, suggesting that nearly 50% of all liver-expressed genes are regulated by HNF4&#945; alone. Similarly, in another recent ChIP-chip experiment of ENCODE (Encyclopedia of DNA Elements) genomic regions (about 1% of the human genome), 663 novel HNF4&#945; binding sites were identified in 100 genes <abbrgrp><abbr bid="B12">12</abbr></abbrgrp>, which suggests there are a large number of HNF4&#945; targets (over 60,000 sites in the vicinity of about 10,000 genes) if extrapolated to the entire genome. This unprecedented high number of HNF4&#945; binding sites revealed by the ChIP-chip method raises the question of the functional role of all these sites in the regulation of gene transcription.</p>
         <p>Indeed, the ChIP-chip assay is a much wanted and a highly advanced method for the genome-wide search and identification of TFBSs. Nonetheless, it suffers from unacceptably high false positive findings. In the study of Odom <it>et al</it>. <abbrgrp><abbr bid="B11">11</abbr></abbrgrp> 252 (16%) false positive binding sites were predicted by the authors. Another problem with this method is that a surprisingly small fraction of identified ChIP fragments possesses the canonical binding motif for the corresponding TF <abbrgrp><abbr bid="B3">3</abbr></abbrgrp>. This limitation needs to be overcome, and it is highly desirable to identify functional binding sites relevant for the regulation of gene transcription. Furthermore, in the current studies there is often no rationale for the selection of promoters spotted on the array; for example, no bioinformatic approach is applied to identify relevant sequences for the design of the ChIP-chip assay.</p>
         <p>Here, we report a computational approach based on a novel machine learning technique, which enabled the identification of genome-wide TFBSs. This method was applied to search for HNF4&#945; gene targets. A genetic algorithm and an exhaustive feature selection algorithm were trained on 73 known and well characterized HNF4&#945; target sequences in promoters and enhancers of different mammalian genes (Additional data file 1). By genome-wide scanning of all human gene promoters we identified novel genes targeted by HNF4&#945;. Then, a subset of predicted binding sites was confirmed by electromobility shift assay. We further interrogated promoter sequences for HNF4&#945; binding sites identified by the ChIP-chip assay. We also analyzed expression of genes targeted by HNF4&#945; and observed a good correlation between computationally annotated HNF4&#945; binding sites and expression of targeted genes. Notably, ChIP-chip experiments tend to report a rather high number of TFBSs in promoters of genes whose regulation by HNF4&#945; is not observed, whereas our computational method for the prediction of HNF4 regulatory sites enabled improved specificity with the method encompassing rules for the regulation of gene expression.</p>
         <p>Overall, we demonstrate the power of our computational approach in identifying novel genes targeted by HNF4&#945;. Our machine learning technique significantly improved the overall recognition and, therefore, the identification of faithful HNF4&#945; targets. This method enabled refinement of TF site predictions based on the ChIP-chip assay and identification from among them of potentially functional sites, as reported here. Furthermore, our method can easily be applied to the genome-wide identification of genes targeted by any mammalian TF and is not limited to promoter sequences alone, with an overall success of approximately 80% based on experimental confirmation.</p>
      </sec>
      <sec>
         <st>
            <p>Results</p>
         </st>
         <sec>
            <st>
               <p>Repeats in HNF4 binding sites</p>
            </st>
            <p>It is generally accepted that HNF4 regulates gene expression by binding to direct repeat motifs of the RG(G/T)TCA sequence separated by one nucleotide (direct repeat (DR)1) <abbrgrp><abbr bid="B13">13</abbr></abbrgrp>. We used two 'half-site' positional weight matrices (PWMs) taken from the TRANSFAC<sup>&#174; </sup>database (see Materials and methods) in order to identify such repeats in the sequences containing known binding sites of HNF4 (based on TRANSFAC<sup>&#174; </sup>annotation; Additional data file 1). We found that although the DR1 repeat structure is clearly seen in the general consensus and in the full positional weight matrix, actual genomic sites often can be characterized by more complicated structures. The results are presented in Figure <figr fid="F1">1</figr>. As can be shown, the current common point of view that DR1 is the only characteristic repeat for HNF4 binding sites is not accurate. We can identify repeats at various distances and with various orientations different from the canonical DR1 structure in the sequences experimentally known as true HNF4 binding sites. This fairly unbiased analysis of the internal repeat structure of known HNF4&#945; binding sites confirms earlier observations that sometimes HNF4&#945; binds to elements other then DR1.</p>
            <fig id="F1">
               <title>
                  <p>Figure 1</p>
               </title>
               <caption>
                  <p>Repeats in the structure of HNF4 binding sites (from TRANSFAC<sup>&#174;</sup>)</p>
               </caption>
               <text>
                  <p>Repeats in the structure of HNF4 binding sites (from TRANSFAC<sup>&#174;</sup>). <b>(a) </b>Examples of multiple repeats forming canonical DR1 as well as DR2, inverted (IR) and 'everted' (ER) repeats. The centrally located black arrows, marked as DR1 or DR2, indicate the repeat with the maximal score (sum of the scores of single elements) as compared to the gray arrows representing multiple repeats. <b>(b) </b>Statistics of repeats of different types (direct repeats, DR0-4; everted repeats, ER0-4; inverted repeats, IR0-4) in the structure of HNF4 sites. Black bars show the observed number of repeats found in the structure of 73 sequences of known HNF4 binding sites (listed in Additional data file 1) considering one repeat with the maximal score per sequence. Gray bars show the total number of repeats found in this set of HNF4 sites.</p>
               </text>
               <graphic file="gb-2008-9-2-r36-1"/>
            </fig>
         </sec>
         <sec>
            <st>
               <p>Molecular organization of the local context of genomic HNF4&#945; binding sites</p>
            </st>
            <p>We applied the 'local context' machine learning technique to the set of known HNF4&#945; binding sites in order to reveal properties of the DNA context in close proximity to the functional HNF4&#945; binding sites. We analyzed frequencies of short oligonucleotides of length 4, as well as the frequency of short repeating motifs of lengths 2 and 4. The binding sites for HNF4&#945; are characterized by various repeat structures (Figure <figr fid="F1">1a</figr>). From our analysis of distribution of half-site motifs above we can see that short additional degenerated motifs resembling parts of the consensus repeat can be seen in the vicinity of the core of the site. Based on these results we decided to perform a thorough contextual analysis of DNA sequences containing HNF4&#945; binding sites. The analysis was done by applying the algorithm in an exhaustive search through the space of all possible short oligonucleotides and repeats in various regions of the sites and their flanks. In addition, we searched for non-redundant sets of contextual features as reported previously <abbrgrp><abbr bid="B14">14</abbr></abbrgrp>. Table <tblr tid="T1">1</tblr> presents the results of this analysis. We selected a combination of four oligonucleotides, six dinucleotide pairs and six four-nucleotide repeats that are overrepresented or underrepresented in the sequences of genomic HNF4&#945; binding sites and compared the results to background sequences. A linear combination of these local contextual features gives rise to the score of context (<it>d </it>in equation 1; see Materials and methods). Figure <figr fid="F2">2</figr> depicts two distributions of the score of context that we obtained on a test set of HNF4&#945; recognition sites (the test and training sets are defined in Additional data file 1) and the test background set. Splitting the set of sites into the training and test subsets was done by random selection. Note that the sites from the test set were not used in the training phase of the algorithm. As shown in Figure <figr fid="F2">2</figr>, we clearly discriminate real HNF4&#945; sites from false positives in the background. In our further analysis, we used the score of context with a cut-off value of 0.55, which minimizes the sum of false negative error (the proportion of unrecognized real sites to the total number of HNF4&#945; sites in the test set) and false positive error (the proportion of false recognition of the background sequences as true sites to the total number of tested sequences in the test background set).</p>
            <tbl id="T1">
               <title>
                  <p>Table 1</p>
               </title>
               <caption>
                  <p>Oligonucleotides and short repeats found in the local context of genomic HNF4&#945; sites</p>
               </caption>
               <tblbdy cols="9">
                  <r>
                     <c ca="left">
                        <p>Oligonucleotide/repeat</p>
                     </c>
                     <c ca="center">
                        <p>Mode</p>
                     </c>
                     <c ca="center">
                        <p>Wind_from</p>
                     </c>
                     <c ca="center">
                        <p>Wind_to</p>
                     </c>
                     <c ca="center">
                        <p>
                           <it>r</it>
                           <sub>
                              <it>min</it>
                           </sub>
                        </p>
                     </c>
                     <c ca="center">
                        <p>
                           <it>r</it>
                           <sub>
                              <it>max</it>
                           </sub>
                        </p>
                     </c>
                     <c ca="center">
                        <p>Alpha</p>
                     </c>
                     <c ca="center">
                        <p>Avrfreq_Y</p>
                     </c>
                     <c ca="center">
                        <p>Avrfreq_N</p>
                     </c>
                  </r>
                  <r>
                     <c cspan="9">
                        <hr/>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>MDDR</p>
                     </c>
                     <c ca="center">
                        <p>(I)</p>
                     </c>
                     <c ca="center">
                        <p>22</p>
                     </c>
                     <c ca="center">
                        <p>66</p>
                     </c>
                     <c ca="center">
                        <p>0</p>
                     </c>
                     <c ca="center">
                        <p>0</p>
                     </c>
                     <c ca="center">
                        <p>0.003082</p>
                     </c>
                     <c ca="center">
                        <p>13.433 (3.665)</p>
                     </c>
                     <c ca="center">
                        <p>11.051 (4.782)</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>ANGB</p>
                     </c>
                     <c ca="center">
                        <p>(I)</p>
                     </c>
                     <c ca="center">
                        <p>20</p>
                     </c>
                     <c ca="center">
                        <p>38</p>
                     </c>
                     <c ca="center">
                        <p>0</p>
                     </c>
                     <c ca="center">
                        <p>0</p>
                     </c>
                     <c ca="center">
                        <p>0.016132</p>
                     </c>
                     <c ca="center">
                        <p>5.358 (2.529)</p>
                     </c>
                     <c ca="center">
                        <p>3.582 (2.845)</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>CDDM</p>
                     </c>
                     <c ca="center">
                        <p>(I)</p>
                     </c>
                     <c ca="center">
                        <p>36</p>
                     </c>
                     <c ca="center">
                        <p>38</p>
                     </c>
                     <c ca="center">
                        <p>0</p>
                     </c>
                     <c ca="center">
                        <p>0</p>
                     </c>
                     <c ca="center">
                        <p>0.020372</p>
                     </c>
                     <c ca="center">
                        <p>4.346 (2.332)</p>
                     </c>
                     <c ca="center">
                        <p>3.212 (1.732)</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>AV-VS</p>
                     </c>
                     <c ca="center">
                        <p>(II)</p>
                     </c>
                     <c ca="center">
                        <p>1</p>
                     </c>
                     <c ca="center">
                        <p>34</p>
                     </c>
                     <c ca="center">
                        <p>33</p>
                     </c>
                     <c ca="center">
                        <p>33</p>
                     </c>
                     <c ca="center">
                        <p>0.008246</p>
                     </c>
                     <c ca="center">
                        <p>6.694 (3.663)</p>
                     </c>
                     <c ca="center">
                        <p>4.893 (3.088)</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>MD-DB</p>
                     </c>
                     <c ca="center">
                        <p>(II)</p>
                     </c>
                     <c ca="center">
                        <p>20</p>
                     </c>
                     <c ca="center">
                        <p>70</p>
                     </c>
                     <c ca="center">
                        <p>20</p>
                     </c>
                     <c ca="center">
                        <p>25</p>
                     </c>
                     <c ca="center">
                        <p>-0.003212</p>
                     </c>
                     <c ca="center">
                        <p>16.941 (3.078)</p>
                     </c>
                     <c ca="center">
                        <p>14.711 (3.344)</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>BR-NT</p>
                     </c>
                     <c ca="center">
                        <p>(II)</p>
                     </c>
                     <c ca="center">
                        <p>33</p>
                     </c>
                     <c ca="center">
                        <p>37</p>
                     </c>
                     <c ca="center">
                        <p>9</p>
                     </c>
                     <c ca="center">
                        <p>18</p>
                     </c>
                     <c ca="center">
                        <p>-0.003942</p>
                     </c>
                     <c ca="center">
                        <p>4.802 (5.460)</p>
                     </c>
                     <c ca="center">
                        <p>7.702 (5.144)</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>VS-YA</p>
                     </c>
                     <c ca="center">
                        <p>(II)</p>
                     </c>
                     <c ca="center">
                        <p>1</p>
                     </c>
                     <c ca="center">
                        <p>34</p>
                     </c>
                     <c ca="center">
                        <p>11</p>
                     </c>
                     <c ca="center">
                        <p>11</p>
                     </c>
                     <c ca="center">
                        <p>0.0103</p>
                     </c>
                     <c ca="center">
                        <p>4.237 (1.742)</p>
                     </c>
                     <c ca="center">
                        <p>3.121 (2.640)</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>VB-HA</p>
                     </c>
                     <c ca="center">
                        <p>(II)</p>
                     </c>
                     <c ca="center">
                        <p>1</p>
                     </c>
                     <c ca="center">
                        <p>34</p>
                     </c>
                     <c ca="center">
                        <p>5</p>
                     </c>
                     <c ca="center">
                        <p>5</p>
                     </c>
                     <c ca="center">
                        <p>0.008647</p>
                     </c>
                     <c ca="center">
                        <p>9.028 (2.985)</p>
                     </c>
                     <c ca="center">
                        <p>6.517 (3.741)</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>HM-GN</p>
                     </c>
                     <c ca="center">
                        <p>(II)</p>
                     </c>
                     <c ca="center">
                        <p>40</p>
                     </c>
                     <c ca="center">
                        <p>50</p>
                     </c>
                     <c ca="center">
                        <p>2</p>
                     </c>
                     <c ca="center">
                        <p>4</p>
                     </c>
                     <c ca="center">
                        <p>0.006468</p>
                     </c>
                     <c ca="center">
                        <p>7.783 (4.335)</p>
                     </c>
                     <c ca="center">
                        <p>4.672 (3.764)</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>(RBNH)<sup>2</sup></p>
                     </c>
                     <c ca="center">
                        <p>(III)</p>
                     </c>
                     <c ca="center">
                        <p>20</p>
                     </c>
                     <c ca="center">
                        <p>51</p>
                     </c>
                     <c ca="center">
                        <p>5</p>
                     </c>
                     <c ca="center">
                        <p>12</p>
                     </c>
                     <c ca="center">
                        <p>0.030376</p>
                     </c>
                     <c ca="center">
                        <p>7.259 (1.778)</p>
                     </c>
                     <c ca="center">
                        <p>5.961 (2.168)</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>(MVKN)<sup>2</sup></p>
                     </c>
                     <c ca="center">
                        <p>(III)</p>
                     </c>
                     <c ca="center">
                        <p>20</p>
                     </c>
                     <c ca="center">
                        <p>51</p>
                     </c>
                     <c ca="center">
                        <p>7</p>
                     </c>
                     <c ca="center">
                        <p>13</p>
                     </c>
                     <c ca="center">
                        <p>0.015979</p>
                     </c>
                     <c ca="center">
                        <p>3.123 (1.413)</p>
                     </c>
                     <c ca="center">
                        <p>2.388 (1.155)</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>(BNDK)<sup>2</sup></p>
                     </c>
                     <c ca="center">
                        <p>(III)</p>
                     </c>
                     <c ca="center">
                        <p>32</p>
                     </c>
                     <c ca="center">
                        <p>32</p>
                     </c>
                     <c ca="center">
                        <p>7</p>
                     </c>
                     <c ca="center">
                        <p>7</p>
                     </c>
                     <c ca="center">
                        <p>-0.002214</p>
                     </c>
                     <c ca="center">
                        <p>0.000 (0.000)</p>
                     </c>
                     <c ca="center">
                        <p>14.343 (28.652)</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>(DNCD)<sup>2</sup></p>
                     </c>
                     <c ca="center">
                        <p>(III)</p>
                     </c>
                     <c ca="center">
                        <p>28</p>
                     </c>
                     <c ca="center">
                        <p>42</p>
                     </c>
                     <c ca="center">
                        <p>7</p>
                     </c>
                     <c ca="center">
                        <p>7</p>
                     </c>
                     <c ca="center">
                        <p>0.068635</p>
                     </c>
                     <c ca="center">
                        <p>4.176 (2.797)</p>
                     </c>
                     <c ca="center">
                        <p>1.051 (2.196)</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>(NBHV)<sup>2</sup></p>
                     </c>
                     <c ca="center">
                        <p>(III)</p>
                     </c>
                     <c ca="center">
                        <p>26</p>
                     </c>
                     <c ca="center">
                        <p>26</p>
                     </c>
                     <c ca="center">
                        <p>7</p>
                     </c>
                     <c ca="center">
                        <p>7</p>
                     </c>
                     <c ca="center">
                        <p>-0.001045</p>
                     </c>
                     <c ca="center">
                        <p>0.000 (0.000)</p>
                     </c>
                     <c ca="center">
                        <p>13.626 (28.102)</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>(NVYB)<sup>2</sup></p>
                     </c>
                     <c ca="center">
                        <p>(III)</p>
                     </c>
                     <c ca="center">
                        <p>29</p>
                     </c>
                     <c ca="center">
                        <p>29</p>
                     </c>
                     <c ca="center">
                        <p>7</p>
                     </c>
                     <c ca="center">
                        <p>7</p>
                     </c>
                     <c ca="center">
                        <p>-0.001696</p>
                     </c>
                     <c ca="center">
                        <p>0.000 (0.100000)</p>
                     </c>
                     <c ca="center">
                        <p>12.909 (27.523)</p>
                     </c>
                  </r>
               </tblbdy>
               <tblfn>
                  <p>Mode: (I), search for oligonucleotides; (II), dinucleotide pairs; (III), four-nucleotide repeats. Wind_from and Wind_to, sequence window in which the motif was found (the HNF4&#945; site is located between positions 29 and 42, flanks are 28 bp and 33 bp long, respectively). <it>r</it><sub><it>min </it></sub>and <it>r</it><sub><it>max</it></sub>, distances between dinucleotide pairs and repeats. Alpha, coefficients in the linear function. Avrfreq_Y, the average frequency of the oligonucleotides in the corresponding window among sequences of the real sites. Avrfreq_N, background sequences. The standard error is given in parentheses.</p>
               </tblfn>
            </tbl>
            <fig id="F2">
               <title>
                  <p>Figure 2</p>
               </title>
               <caption>
                  <p>Two histograms showing the distributions of the score of context in the -28 bp/+33 bp flanks of real HNF4&#945; sites (gray bars, see also test set; Additional data file 1) versus the -28 bp/+33 bp flanks of PWM matches (PWM score >0.8) in random genomic positions (white bars)</p>
               </caption>
               <text>
                  <p>Two histograms showing the distributions of the score of context in the -28 bp/+33 bp flanks of real HNF4&#945; sites (gray bars, see also test set; Additional data file 1) versus the -28 bp/+33 bp flanks of PWM matches (PWM score >0.8) in random genomic positions (white bars). Mean values of the two distributions are 0.5849 and 0.348, respectively. The x-axis gives the score of context; the left y-axis gives the number of PWM matches in random genomic positions with the corresponding score of context; and the right y-axis gives the number of real HNF4&#945; sites with the corresponding score of context.</p>
               </text>
               <graphic file="gb-2008-9-2-r36-2"/>
            </fig>
            <p>Among selected contextual features, some, like the motifs ANGB and MDDR, fit to different parts of the HNF4 consensus sequence and appear to be overrepresented in a rather wide area around the center of the binding sites (Table <tblr tid="T1">1</tblr>). The motif CDDM is overrepresented in quite a small area corresponding to the central positions of the sites. Very interesting are the 'negative' features, such as repeats of the motif BNDK, which are positioned at the beginning and the end of the HNF4 consensus, and repeats NBHV and NVYB, which have one part of the repeat just at the left edge of the consensus and the second part located at the center of the consensus. Such 'negative' features represent some nucleotide combinations that are rarely or never observed at functional binding sites, although such sequence context can be found in the background sequences. It is important to mention that the background sequences were generated as matching the HNF4 PWM but still have the additional contextual differences that can be found through the local context approach. Therefore, the local context approach can capture contextual rules that cannot be identified by the conventional PWMs, since they distinguish real sites from false positive hits of the matrix.</p>
            <p>To validate the contextual features found in our analysis, we ran the algorithm 3 times using different samples of 100 background sequences generated in the same way as the first sample. As expected (see Materials and methods), the resulting set of identified contextual features was different each time (data not shown), whereas, the oligonucleotides ANGB and CDDM, as well as the repeat (RBNH)<sup>2</sup>, were identified in all tested cases (although with slightly different 'from' and 'to' parameters of the sequence window). Overall discrimination of the test distributions using the obtained sets of contextual features was practically the same as obtained in the first run shown in Figure <figr fid="F2">2</figr>. Therefore, in all further analyses we used the set of features obtained in the first run.</p>
         </sec>
         <sec>
            <st>
               <p>Molecular organization of the global context of HNF4&#945; binding sites</p>
            </st>
            <p>To study the global context, we retrieved flanking sequences of length &#177;500 bp around known HNF4 binding sites (Additional data file 1) and put them into the Y<sub>Global </sub>set. The background set (N<sub>Global</sub>) was constructed based on randomly chosen intergenic fragments of DNA from various human chromosomes applying the same strategy as for N<sub>Local</sub> (described in Materials and methods), but with the 500 bp flanks around the assumed false positive match of the HNF4 matrix (we chose at random 642 sequence fragments scattered through intergenic regions on all human chromosomes).</p>
            <p>We analyzed these sets using the Composite Module Analyst (CMA) program (see Materials and methods), which allowed us to study combinations of TFBSs in the interrogated sequences. Input for CMA is a set of DNA sequences under study (foreground set) - for example, the set of HNF4 functional sites - and a set of background sequences. By comparison of two sequence sets, CMA identifies through an iterative genetic algorithm a specific combination of TF matrices (PWMs) that are common for the foreground set of sequences and distinguish them from the background sequences <abbrgrp><abbr bid="B15">15</abbr></abbrgrp>. The results are given in Table <tblr tid="T2">2</tblr>. The CMA algorithm identified six single TF matrices and eight pairs of TF matrices characterized by variable distances between sites in each pair (<it>d</it><sub><it>max </it></sub>is defined as a distance of 100, 200 and 500 bp). Figure <figr fid="F3">3</figr> shows the results of the comparison of the distributions of the CM score in the two sets: the Y<sub>Global </sub>set (HNF4&#945; sites &#177;500 bp; gray bars) and the N<sub>Global </sub>set (Genome PWM matches &#177;500 bp in random genomic positions; white bars). One can see the clear discrimination between these sets. The average CM score for real HNF4 sites equals 0.499, whereas for random genome PWM matches it equals 0.050 (ratio = 9.98, <it>t</it>-test <it>p-</it>value = 1.4896 &#215; 10<sup>-26</sup>).</p>
            <tbl id="T2">
               <title>
                  <p>Table 2</p>
               </title>
               <caption>
                  <p>Matrices and matrix pairs of the global context selected by the CMA program</p>
               </caption>
               <tblbdy cols="10">
                  <r>
                     <c ca="left">
                        <p>Matrix_ID(1)</p>
                     </c>
                     <c ca="left">
                        <p>TFs(1)</p>
                     </c>
                     <c ca="left">
                        <p>Cut-off(1)</p>
                     </c>
                     <c ca="left">
                        <p>Matrix_ID(2)</p>
                     </c>
                     <c ca="left">
                        <p>TFs(2)</p>
                     </c>
                     <c ca="left">
                        <p>Cut-off(2)</p>
                     </c>
                     <c ca="left">
                        <p><it>d</it><sub><it>min </it></sub>(bp)</p>
                     </c>
                     <c ca="left">
                        <p><it>d</it><sub><it>max </it></sub>(bp)</p>
                     </c>
                     <c ca="center">
                        <p>
                           <it>&#954;</it>
                        </p>
                     </c>
                     <c ca="center">
                        <p>
                           <it>&#966;</it>
                        </p>
                     </c>
                  </r>
                  <r>
                     <c cspan="10">
                        <hr/>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>
                           <b>V$MAZ_Q6</b>
                        </p>
                     </c>
                     <c ca="left">
                        <p>
                           <b>MAZ</b>
                        </p>
                     </c>
                     <c ca="left">
                        <p>0.89</p>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c ca="center">
                        <p>4</p>
                     </c>
                     <c ca="center">
                        <p>0.020763</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>
                           <b>V$ER_Q6</b>
                        </p>
                     </c>
                     <c ca="left">
                        <p>
                           <b>ER-&#945;</b>
                        </p>
                     </c>
                     <c ca="left">
                        <p>0.913</p>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c ca="center">
                        <p>4</p>
                     </c>
                     <c ca="center">
                        <p>0.047177</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>
                           <b>V$HEB_Q6</b>
                        </p>
                     </c>
                     <c ca="left">
                        <p>
                           <b>HTF4</b>
                        </p>
                     </c>
                     <c ca="left">
                        <p>0.969</p>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c ca="center">
                        <p>4</p>
                     </c>
                     <c ca="center">
                        <p>0.078905</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>
                           <b>V$HNF4_Q6_01*</b>
                        </p>
                     </c>
                     <c ca="left">
                        <p>
                           <b>HNF4&#945;, HNF4&#945;2, HNF4&#947;</b>
                        </p>
                     </c>
                     <c ca="left">
                        <p>0.976</p>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c ca="center">
                        <p>4</p>
                     </c>
                     <c ca="center">
                        <p>0.210340</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>
                           <b>V$HEN1_02</b>
                        </p>
                     </c>
                     <c ca="left">
                        <p>
                           <b>HEN1</b>
                        </p>
                     </c>
                     <c ca="left">
                        <p>0.854</p>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c ca="center">
                        <p>4</p>
                     </c>
                     <c ca="center">
                        <p>0.099368</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>
                           <b>V$CREB_Q2*</b>
                        </p>
                     </c>
                     <c ca="left">
                        <p>
                           <b>CRE-BP2, CREM, ATF-1,2,3,4,6</b>
                        </p>
                     </c>
                     <c ca="left">
                        <p>0.888</p>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c ca="center">
                        <p>4</p>
                     </c>
                     <c ca="center">
                        <p>0.086618</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>
                           <b>V$HNF4_Q6_01*</b>
                        </p>
                     </c>
                     <c ca="left">
                        <p>
                           <b>HNF4&#945;, HNF4&#945;2, HNF4&#947;</b>
                        </p>
                     </c>
                     <c ca="left">
                        <p>0.8325</p>
                     </c>
                     <c ca="left">
                        <p>
                           <b>V$EFC_Q6</b>
                        </p>
                     </c>
                     <c ca="left">
                        <p>
                           <b>RFX1 (EF-C)</b>
                        </p>
                     </c>
                     <c ca="left">
                        <p>0.6825</p>
                     </c>
                     <c ca="left">
                        <p>8</p>
                     </c>
                     <c ca="left">
                        <p>100</p>
                     </c>
                     <c ca="center">
                        <p>2</p>
                     </c>
                     <c ca="center">
                        <p>0.043344</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>
                           <b>V$COUP_01</b>
                        </p>
                     </c>
                     <c ca="left">
                        <p>
                           <b>COUP-TF1, COUP-TF2</b>
                        </p>
                     </c>
                     <c ca="left">
                        <p>0.8005</p>
                     </c>
                     <c ca="left">
                        <p>
                           <b>V$KROX_Q6</b>
                        </p>
                     </c>
                     <c ca="left">
                        <p>
                           <b>Egr-1,2,3,4</b>
                        </p>
                     </c>
                     <c ca="left">
                        <p>0.8315</p>
                     </c>
                     <c ca="left">
                        <p>8</p>
                     </c>
                     <c ca="left">
                        <p>100</p>
                     </c>
                     <c ca="center">
                        <p>2</p>
                     </c>
                     <c ca="center">
                        <p>0.053285</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>
                           <b>V$PEBP_Q6</b>
                        </p>
                     </c>
                     <c ca="left">
                        <p>
                           <b>PEBP2&#945;/AML1,3; PEBP2&#946;</b>
                        </p>
                     </c>
                     <c ca="left">
                        <p>0.84</p>
                     </c>
                     <c ca="left">
                        <p>
                           <b>V$TEL2_Q6</b>
                        </p>
                     </c>
                     <c ca="left">
                        <p>
                           <b>Tel-2a,b,c</b>
                        </p>
                     </c>
                     <c ca="left">
                        <p>0.878</p>
                     </c>
                     <c ca="left">
                        <p>8</p>
                     </c>
                     <c ca="left">
                        <p>100</p>
                     </c>
                     <c ca="center">
                        <p>2</p>
                     </c>
                     <c ca="center">
                        <p>0.214469</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>
                           <b>V$ELK1_01</b>
                        </p>
                     </c>
                     <c ca="left">
                        <p>
                           <b>Elk-1</b>
                        </p>
                     </c>
                     <c ca="left">
                        <p>0.785</p>
                     </c>
                     <c ca="left">
                        <p>
                           <b>V$WHN_B</b>
                        </p>
                     </c>
                     <c ca="left">
                        <p>
                           <b>FOXN1</b>
                        </p>
                     </c>
                     <c ca="left">
                        <p>0.948</p>
                     </c>
                     <c ca="left">
                        <p>8</p>
                     </c>
                     <c ca="left">
                        <p>100</p>
                     </c>
                     <c ca="center">
                        <p>2</p>
                     </c>
                     <c ca="center">
                        <p>0.111909</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>
                           <b>V$CMYB_01</b>
                        </p>
                     </c>
                     <c ca="left">
                        <p>
                           <b>c-Myb, B-Myb</b>
                        </p>
                     </c>
                     <c ca="left">
                        <p>0.86</p>
                     </c>
                     <c ca="left">
                        <p>
                           <b>V$KROX_Q6</b>
                        </p>
                     </c>
                     <c ca="left">
                        <p>
                           <b>Egr-1,2,3,4</b>
                        </p>
                     </c>
                     <c ca="left">
                        <p>0.841</p>
                     </c>
                     <c ca="left">
                        <p>8</p>
                     </c>
                     <c ca="left">
                        <p>100</p>
                     </c>
                     <c ca="center">
                        <p>2</p>
                     </c>
                     <c ca="center">
                        <p>0.100922</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>
                           <b>V$FOXO1_02*</b>
                        </p>
                     </c>
                     <c ca="left">
                        <p>
                           <b>FOXO1,2,4, FOXJ3</b>
                        </p>
                     </c>
                     <c ca="left">
                        <p>0.8715</p>
                     </c>
                     <c ca="left">
                        <p>
                           <b>V$FXR_Q3</b>
                        </p>
                     </c>
                     <c ca="left">
                        <p>
                           <b>FXR&#945;/RXR</b>
                        </p>
                     </c>
                     <c ca="left">
                        <p>0.8135</p>
                     </c>
                     <c ca="left">
                        <p>8</p>
                     </c>
                     <c ca="left">
                        <p>500</p>
                     </c>
                     <c ca="center">
                        <p>2</p>
                     </c>
                     <c ca="center">
                        <p>0.100184</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>
                           <b>V$HNF4_Q6_01*</b>
                        </p>
                     </c>
                     <c ca="left">
                        <p>
                           <b>HNF4&#945;, HNF4&#945;2, HNF4&#947;</b>
                        </p>
                     </c>
                     <c ca="left">
                        <p>0.8065</p>
                     </c>
                     <c ca="left">
                        <p>
                           <b>V$HNF4_01</b>
                           <sup>(*)</sup>
                        </p>
                     </c>
                     <c ca="left">
                        <p>
                           <b>HNF4&#945;, HNF4&#945;2, HNF4&#947;</b>
                        </p>
                     </c>
                     <c ca="left">
                        <p>0.8705</p>
                     </c>
                     <c ca="left">
                        <p>8</p>
                     </c>
                     <c ca="left">
                        <p>200</p>
                     </c>
                     <c ca="center">
                        <p>2</p>
                     </c>
                     <c ca="center">
                        <p>0.080381</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>
                           <b>V$XBP1_01</b>
                        </p>
                     </c>
                     <c ca="left">
                        <p>
                           <b>XBP-1</b>
                        </p>
                     </c>
                     <c ca="left">
                        <p>0.8845</p>
                     </c>
                     <c ca="left">
                        <p>
                           <b>V$FOXO1_02</b>
                           <sup>(*)</sup>
                        </p>
                     </c>
                     <c ca="left">
                        <p>
                           <b>FOXO1,2,4, FOXJ3</b>
                        </p>
                     </c>
                     <c ca="left">
                        <p>0.8715</p>
                     </c>
                     <c ca="left">
                        <p>8</p>
                     </c>
                     <c ca="left">
                        <p>200</p>
                     </c>
                     <c ca="center">
                        <p>2</p>
                     </c>
                     <c ca="center">
                        <p>0.112402</p>
                     </c>
                  </r>
               </tblbdy>
               <tblfn>
                  <p>Matrix_ID(1) and Matrix_ID(2) are the TRANSFAC<sup>&#174; </sup>identifiers of the selected single matrix (or the first matrix in a pair and the second matrix in the pair, respectively). The other headings of the table correspond to the parameters of the composite module score (see equation 2 in Materials and methods). The first six lines of the table represent the single matrices selected by the algorithm to represent the global context; the other lines represent the pairs of matrices selected by the CMA program. The corresponding values of the parameters (in the Cut-off, <it>&#954; </it>and <it>&#966; </it>columns) are optimized by the CMA algorithm. *Matrices corresponding to TFs whose binding site combinatorial co-occupancy was found in Odom <it>et al</it>. [45] for promoters of liver-expressed genes. Bold text indicates TFs identified by the CMA.</p>
               </tblfn>
            </tbl>
            <fig id="F3">
               <title>
                  <p>Figure 3</p>
               </title>
               <caption>
                  <p>Two histograms showing the distributions of the CM score in the &#177;500 bp flanks of HNF4&#945; sites (gray bars) versus &#177;500 bp flanks of PWM matches (PWM score >0.8) in random genomic positions (white bars)</p>
               </caption>
               <text>
                  <p>Two histograms showing the distributions of the CM score in the &#177;500 bp flanks of HNF4&#945; sites (gray bars) versus &#177;500 bp flanks of PWM matches (PWM score >0.8) in random genomic positions (white bars). The average CM score for real HNF4&#945; sites is 0.499, whereas for PWM matches in random genomic positions (in the set N<sub>Global</sub>) it is 0.050 (ratio = 9.98, <it>t</it>-test <it>p</it>-value = 1.4896 &#215; 10<sup>-26</sup>).</p>
               </text>
               <graphic file="gb-2008-9-2-r36-3"/>
            </fig>
            <p>The obtained significant combination of matrices determines the global context that is characteristic for the regulatory regions around functional HNF4&#945; binding sites in the genome. The biological interpretation of located composite modules is based on the concept of the 'enhanceosome', postulating that, for a proper performance of regulatory function, a TF, while binding to the DNA target sites, should participate in many protein-protein interactions with other TFs binding in the neighborhood of the sites. As can be demonstrated, the algorithm selected HNF4 matrices three times, for example, as a single element, as well as parts of matrix pairs with another HNF4 matrix and with the V$EFC matrix. Note that the algorithm additionally selected TF matrices corresponding to recognition motifs of, for instance, MAZ, ER, FOX, CREB, Elk1 (Ets domain factor), COUP-TF, RFX1 and some others. Strikingly, it is known that HNF4&#945; TFs cooperate with ER <abbrgrp><abbr bid="B16">16</abbr></abbrgrp> and build synergistic composite elements with CREB <abbrgrp><abbr bid="B17">17</abbr><abbr bid="B18">18</abbr></abbrgrp> and antagonistic composite elements with COUP-TF <abbrgrp><abbr bid="B19">19</abbr></abbrgrp> (see also the TRANSCompel<sup>&#174; </sup>entries C00369, C00129, and C00124). Interaction and cooperation between some other factors listed in the composite module is also known, for example, COUP-TF with ER <abbrgrp><abbr bid="B20">20</abbr></abbrgrp> and CREB with Ets <abbrgrp><abbr bid="B21">21</abbr></abbrgrp>. Thus, the found composition around known HNF4&#945; binding sites represents potential interaction partners of HNF4 factors, therefore providing functionality in the regulation of HNF4&#945; target genes. Note that in the case of computing the global context there was no test set available, that is, all known sites were used to train the algorithm. In order to validate the computed composite module, we performed a series of ten data shuffling experiments. Each time, the assignments of positive and negative sets were randomly shuffled among the sequences and CMA was applied in order to find a matrix combination that would best discriminate between these sets of sequences. No good discrimination was obtained in such shuffling iterations. The maximum ratio achieved between the mean values was 1.6 with <it>t</it>-test <it>p-</it>values of 10<sup>-5</sup>, which is much higher than in the unshuffled case (Figure <figr fid="F3">3</figr>).</p>
         </sec>
         <sec>
            <st>
               <p>Complex criteria for determining functional HNF4&#945; binding sites</p>
            </st>
            <p>We determined the following complex recognition criteria for a sequence of length 1,000 bp to be a potential target for HNF4&#945; TFs: the maximal matrix score of an HNF4 site in the sequence (<it>q</it><sub><it>max</it></sub>) should be >0.8; the maximal local context score (<it>d</it><sub><it>max</it></sub>) should be >0.28; the maximal global context score (CM;<it> v</it><sub><it>max</it></sub>) should be >0.18; the sum of matrix scores of all HNF4 sites found in the sequence (<it>q</it><sub><it>Sum</it></sub>) should be >10.0; and the TFBS with the maximal score should be considered as the binding site for HNF4&#945;, whereas the 1,000 bp regions provide the functional context for this site.</p>
            <p>These rather complex criteria were derived through an iterative computation of different combinations of each individual threshold with the goal of achieving a method that would have approximately 90% sensitivity and would efficiently use individual criteria of the local and global contexts. Finally, we obtained criteria that yield 87% sensitivity on the Y<sub>Global </sub>set (known functional sites for HNF4 factors with 500 bp flanks) and thresholds of the local and global context scores were set at the minimum of the sum of errors of these two criteria (Figures <figr fid="F2">2</figr> and <figr fid="F3">3</figr>, respectively). As can be seen from these two figures, the relative contribution of the global context to prediction power is larger than that of the local context. The sum of the errors for the local context is approximately two times higher than the sum of the errors for the global context. This means that in applying these complex criteria, in approximately 13% of cases we may miss an identification of functional HNF4&#945; binding sites (the false negative rate of the method is 13%).</p>
         </sec>
         <sec>
            <st>
               <p>Analyzing ChIP-chip data for HNF4&#945; sites</p>
            </st>
            <p>Using the HNF4&#945; PWM, which was built on a representative set of 73 known functional HNF4&#945; binding sites in mammalian genes, and two new methods (local and global content for estimating the DNA context around functional HNF4&#945; binding sites as discussed above), we analyzed the ChIP-chip data for HNF4&#945; reported by Odom <it>et al</it>. <abbrgrp><abbr bid="B11">11</abbr></abbrgrp>. We interrogated two sets of sequences: 'positives', a set of 1,605 sequences that were reported as HNF4&#945;-targeted genes in hepatocytes; and 'negatives', a set of 10,852 sequences that were reported not to be contacted by HNF4&#945; in hepatocytes and pancreatic islets. The average length of the sequences reported by Odom <it>et al</it>. was approximately 1 Kb <abbrgrp><abbr bid="B11">11</abbr></abbrgrp>. In each sequence of both sets, we computed the number of potential HNF4&#945; binding sites (matrix score >0.8), the sum of the scores for all sites, and the maximal score of the sites found in the sequence. Thereafter, we calculated the local context score (<it>d</it>) and the global context score (<it>v</it>) for each potential HNF4&#945; binding site in these sequences and reported the maximal scores obtained in each sequence. We applied the complex recognition criteria (see above) to the sequences in these two sets. As a result, only 21% of the 'positive' set (that is, 375 sequences out of 1,605) passed the criteria. Indeed, 79% of the sequences were rejected, since they did not pass one or several requirements as defined above. In order to estimate the rate of false positives of our method, we applied it to the set of 'negative' sequences. Our complex criteria rejected 97.4% of these sequences, giving us an overall estimate of 2.6% for the false positive rate. Figure <figr fid="F4">4</figr> depicts a plot of the global and local context scores, comparing distribution of the 375 sequences selected from the 'positive' set versus distribution of all sequences in the 'negative' set. Obviously, the selected sequences are characterized by the highest global and local context scores, whereas the majority of the 'negative' sequences are characterized by low values for these two scores. The list of the 375 sequences that passed our criteria are given in Additional data file 2. Furthermore, Figure <figr fid="F5">5</figr> summarizes the data obtained in the analysis of known HNF4&#945; binding sites, as well as 'positive' and ' negative' sets of sequences derived from ChIP-chip experiments reported by Odom <it>et al</it>. <abbrgrp><abbr bid="B11">11</abbr></abbrgrp>. These data clearly show that the majority of the sequences revealed in the ChIP-chip experiments of Odom <it>et al</it>. <abbrgrp><abbr bid="B11">11</abbr></abbrgrp> differ quite significantly in their local and global context from the sequences of known and experimentally confirmed HNF4&#945; binding sites. We estimate that only 20% of these sequences fulfill our requirements to be considered as faithful functional HNF4&#945; binding sites. Note that Odom <it>et al</it>. <abbrgrp><abbr bid="B11">11</abbr></abbrgrp> assume a 16% false discovery rate in the identification of binding sites in their ChIP-chip experiments. Application of our analysis to the Odom <it>et al</it>. data suggests about 80% of the ChIP-chip identified targets do not meet the contextual requirements that characterize biologically functional sites and, therefore, may not be involved in HNF4&#945;-dependent regulation of gene transcription.</p>
            <fig id="F4">
               <title>
                  <p>Figure 4</p>
               </title>
               <caption>
                  <p>Plot of the distribution of global and local contexts in the 375 sequences (red squares) selected from the 'positive' set of ChIP-chip results reported by Odom <it>et al</it>. [11] versus all 10,852 sequences from the 'negative' (not binding; H13K_noHNF4) set (green dots) reported for the same experiment</p>
               </caption>
               <text>
                  <p>Plot of the distribution of global and local contexts in the 375 sequences (red squares) selected from the 'positive' set of ChIP-chip results reported by Odom <it>et al</it>. [11] versus all 10,852 sequences from the 'negative' (not binding; H13K_noHNF4) set (green dots) reported for the same experiment. The selected sequences are characterized by the highest global and local context scores whereas the majority of the 'negative' sequences are characterized by low values for these two scores. The vertical and horizontal lines show two thresholds chosen for the global context score (0.28) and the local context score (0.18).</p>
               </text>
               <graphic file="gb-2008-9-2-r36-4"/>
            </fig>
            <fig id="F5">
               <title>
                  <p>Figure 5</p>
               </title>
               <caption>
                  <p>Percentages of sequences passing the complex recognition criteria in the set of known HNF4 binding sites (TRANSFAC<sup>&#174; </sup>HNF4 sites), in the set of 'positive' sequences based on ChIP-chip experiments reported by Odom <it>et al</it>. [11] for hepatocytes and in the set of 'negative' sequences described by Odom <it>et al</it>. [11]</p>
               </caption>
               <text>
                  <p>Percentages of sequences passing the complex recognition criteria in the set of known HNF4 binding sites (TRANSFAC<sup>&#174; </sup>HNF4 sites), in the set of 'positive' sequences based on ChIP-chip experiments reported by Odom <it>et al</it>. [11] for hepatocytes and in the set of 'negative' sequences described by Odom <it>et al</it>. [11]. From the last set we estimate that the percentage of false results from our method is about 2.6%.</p>
               </text>
               <graphic file="gb-2008-9-2-r36-5"/>
            </fig>
         </sec>
         <sec>
            <st>
               <p>Linking HNF4&#945; binding sites to gene expression</p>
            </st>
            <p>We also applied our computational method to data reported by Naiki <it>et al</it>. <abbrgrp><abbr bid="B7">7</abbr></abbrgrp> and Lucas <it>et al</it>. <abbrgrp><abbr bid="B9">9</abbr></abbrgrp>. Notably, these investigators carried out microarray experiments to identify genes whose expression differed upon targeted overexpression of HNF4&#945;. From these studies a list of differentially expressed genes was obtained. Additionally, we compared the differentially expressed genes with findings reported by Odom <it>et al</it>. <abbrgrp><abbr bid="B11">11</abbr></abbrgrp>, who performed ChIP-chip experiments with HNF4&#945;. We thus compared data from two different approaches, that is, targeted overexpression of HNF4&#945; and ChIP-chip data for the identification of novel HNF4&#945; target genes. We then applied our computational approach (by use of the complex recognition criteria described above) to interrogate the data sets. The results are presented in Table <tblr tid="T3">3</tblr>. Only a small fraction of identified genes could be compared directly; 75 and 70 differentially expressed genes (Up + Dn) reported by Naiki <it>et al</it>. <abbrgrp><abbr bid="B7">7</abbr></abbrgrp> and Lucas <it>et al</it>. <abbrgrp><abbr bid="B9">9</abbr></abbrgrp>, respectively, and 150 genes whose expression did not change (NC). As can be seen from the data given in Table <tblr tid="T3">3</tblr>, our computational method and the ChIP-chip data are similar when correlated with the gene expression data of HNF4&#945;-targeted genes (see the Table <tblr tid="T2">2</tblr> legend); approximately 18-20% of differentially expressed genes were similarly identified by the ChIP-chip and our computational method based on the data of 145 differently expressed genes. Indeed, several genes targeted by HNF4&#945; were identified by both methods (for example, 5 genes (<it>ACADVL</it>, <it>RBKS</it>, <it>SLC35D1</it>, <it>ATP7B</it>, <it>MGST2</it>) out of 70 from the Lucas <it>et al</it>. <abbrgrp><abbr bid="B9">9</abbr></abbrgrp> data set).</p>
            <tbl id="T3">
               <title>
                  <p>Table 3</p>
               </title>
               <caption>
                  <p>Comparison of gene lists between HNF4&#945; expression data, ChIP-chip data, and computational prediction of target promoters</p>
               </caption>
               <tblbdy cols="5">
                  <r>
                     <c>
                        <p/>
                     </c>
                     <c cspan="2" ca="center">
                        <p>Gene sets in ChIP-chip experiment</p>
                     </c>
                     <c ca="center">
                        <p>HNF4 targets identified by PWM V$HNF4_Q6_1 (cut-off = 0.9)</p>
                     </c>
                     <c ca="center">
                        <p>HNF4 targets identified by local + global context</p>
                     </c>
                  </r>
                  <r>
                     <c>
                        <p/>
                     </c>
                     <c cspan="2">
                        <hr/>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c>
                        <p/>
                     </c>
                  </r>
                  <r>
                     <c>
                        <p/>
                     </c>
                     <c ca="center">
                        <p>Positive*</p>
                     </c>
                     <c ca="center">
                        <p>Negative<sup>&#8224;</sup></p>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c>
                        <p/>
                     </c>
                  </r>
                  <r>
                     <c cspan="5">
                        <hr/>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>Gupta <it>et al</it>. [44]</p>
                        <p>Up + Dn (133)</p>
                     </c>
                     <c ca="center">
                        <p>13 (9.8%)</p>
                     </c>
                     <c ca="center">
                        <p>ND</p>
                     </c>
                     <c ca="center">
                        <p>66 (49.6%)</p>
                     </c>
                     <c ca="center">
                        <p>41 (30.8%)</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>Naiki <it>et al</it>. [7]</p>
                        <p>Up + Dn (75)<sup>&#8225;</sup></p>
                     </c>
                     <c ca="center">
                        <p>17 (22.7%)</p>
                     </c>
                     <c ca="center">
                        <p>32 (42.7%)</p>
                     </c>
                     <c ca="center">
                        <p>15 (20%)</p>
                     </c>
                     <c ca="center">
                        <p>14 (18.7%)</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>Lucas <it>et al</it>. [9]</p>
                        <p>Up + Dn (70)<sup>&#8225;</sup></p>
                     </c>
                     <c ca="center">
                        <p>13 (18.6%)</p>
                     </c>
                     <c ca="center">
                        <p>39 (55.7%)</p>
                     </c>
                     <c ca="center">
                        <p>17 (24.3%)</p>
                     </c>
                     <c ca="center">
                        <p>13 (18.6%)</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>Lucas <it>et al</it>. [9]</p>
                        <p>NC (150)<sup>&#8225;</sup></p>
                     </c>
                     <c ca="center">
                        <p>20 (13.3%)</p>
                     </c>
                     <c ca="center">
                        <p>99 (66%)</p>
                     </c>
                     <c ca="center">
                        <p>29 (19.3%)</p>
                     </c>
                     <c ca="center">
                        <p>4 (2.7%)</p>
                     </c>
                  </r>
               </tblbdy>
               <tblfn>
                  <p>*The number of differentially expressed genes with HNF4&#945; binding sites as identified by ChIP-chip experiments. <sup>&#8224;</sup>The number of differentially expressed genes with no HNF4&#945; binding as determined by ChIP-chip experiments. <sup>&#8225;</sup>The number of genes whose expression was upregulated or downregulated by more than two-fold. NC, genes with no change of expression; ND, not determined.</p>
               </tblfn>
            </tbl>
            <p>At the same time, our computational method for identifying HNF4&#945; gene targets is less error-prone; there were 2.7% false results based on the computational method compared to 13.3% false results determined using the ChIP-chip method (based solely on gene expression data from Lucas <it>et al</it>. <abbrgrp><abbr bid="B9">9</abbr></abbrgrp>) (Table <tblr tid="T3">3</tblr>, last row). It is of considerable importance that when using just a single HNF4 PWM and ignoring local and global sequence context the prediction of HNF4&#945; target genes becomes prone to generating false positive errors (19% as shown in Table <tblr tid="T3">3</tblr>).</p>
         </sec>
         <sec>
            <st>
               <p>Search for HNF4&#945; functional sites amongst all known human gene promoters</p>
            </st>
            <p>We applied the method developed for identifying putative HNF4&#945; gene targets to the full set of promoters of human genes annotated in TRANSPro&#8482; database release 2.1 (containing 15,455 promoters). First, we scanned promoters in the region from -500 to +100 around the transcription start site (TSS) for matches of the HNF4 weight matrix with matrix score <it>q </it>> 0.8 accompanied by local context score <it>d </it>> 0.48. We identified 3,009 promoters that had at least one site passing both these criteria. Next, we chose the highest scoring match of the HNF4 matrix in each of the promoters and retrieved 500 bp flanking regions around the match. We applied the complex criteria (see above) to obtain a set of sequences, which led to the prediction of 375 target promoters; among them 121 promoters of genes encoding TFs and other components of the cell signaling system. These genes attracted our attention for experimental verification by electrophoretic mobility shift assay (EMSA) as reported here. The full list of the predicted target promoters is given in Additional data file 2.</p>
         </sec>
         <sec>
            <st>
               <p>Electrophoretic mobility shift assay confirmation</p>
            </st>
            <p>Supershift experiments with probes for established recognition sites for HNF4&#945;, that is, promoter regions derived from HNF1&#945;, AAT, APOB, AGT, APOC3, CYP2D6, TF, ALDH2, APOC2 and PCK1, resulted in binding of HNF4&#945; (Figure <figr fid="F6">6a</figr>). This exemplifies the selectivity and sensitivity of the EMSA assay for validating HNF4&#945; binding sites for ten arbitrarily chosen but known targets of HNF4&#945;. From the list of 375 predicted HNF4&#945; target genes (see above) we selected a further 10 novel HNF4&#945; binding sites for experimental confirmation that are characterized by high PWM and local and global context scores and that were not reported in the study of Odom <it>et al</it>. <abbrgrp><abbr bid="B11">11</abbr></abbrgrp>. Note that EMSA revealed binding of HNF4&#945; to NCOA2, TFF2, CHEK1, CD63, SH3Gl2, RND2, ESRRBL2 and DDB1, whereas supershift experiments did not confirm HNF4&#945; binding to NEUROG3 and IL6 (Figure <figr fid="F6">6b</figr>), thus providing an estimate of 80% for the sensitivity of our computational method for <it>de novo </it>prediction of HNF4&#945; binding sites. A summary of the biological function of these newly identified HNF4&#945; target genes is given in Table <tblr tid="T4">4</tblr>.</p>
            <fig id="F6">
               <title>
                  <p>Figure 6</p>
               </title>
               <caption>
                  <p>EMSA confirmation experiments</p>
               </caption>
               <text>
                  <p>EMSA confirmation experiments. <b>(a) </b>EMSA with established HNF4&#945; recognition sites. Electrophoretic mobility shift experiment with 2.5 &#956;g Caco-2 cell nuclear extracts and oligonucleotides corresponding to promoter regions derived from <it>HNF1</it>, <it>APOB</it>, <it>AAT</it>, <it>AGT</it>, <it>APOC3</it>, <it>CYP2D6</it>, <it>TF</it>, <it>ALDH2</it>, <it>APOC2 </it>and <it>PCK1 </it>as <sup>32</sup>P labeled probes. For supershift analysis an antibody directed against HNF4&#945; was added (+). <b>(b) </b>EMSA with predicted novel HNF4&#945; recognition sites. Electrophoretic mobility shift experiment with 2.5 &#956;g Caco-2 cell nuclear extracts and oligonucleotides corresponding to promoter regions derived from <it>NCOA2</it>, <it>TFF2</it>, <it>CHEK1</it>, <it>CD63</it>, <it>SH3GL2</it>, <it>RND2</it>, <it>ESRRBL1</it>, <it>DDB1</it>, <it>NEUROG3 </it>and <it>IL6 </it>as <sup>32</sup>P labeled probes. For supershift analysis an antibody directed against HNF4&#945; was added (+). <b>(c) </b>EMSA with potential recognition sites from putative HNF4&#945; targets reported by Odom <it>et al</it>. [11]. Electrophoretic mobility shift experiment with 2.5 &#956;g Caco-2 cell nuclear extracts and oligonucleotides corresponding to promoter regions derived from <it>AZI2</it>, <it>CFL2</it>, <it>GPHN</it>, <it>C14orf119</it>, <it>PPP1R3C</it>, <it>AKR1C3</it>, <it>NPAS2</it>, <it>MDM2</it>, <it>CLCN3 </it>and <it>CBX3 </it>as <sup>32</sup>P labeled probes. For supershift analysis an antibody directed against HNF4&#945; was added (+).</p>
               </text>
               <graphic file="gb-2008-9-2-r36-6"/>
            </fig>
            <tbl id="T4">
               <title>
                  <p>Table 4</p>
               </title>
               <caption>
                  <p>Biological functions of novel predicted HNF4&#945; gene targets</p>
               </caption>
               <tblbdy cols="3">
                  <r>
                     <c ca="left">
                        <p>Gene symbol</p>
                     </c>
                     <c ca="left">
                        <p>Gene name</p>
                     </c>
                     <c ca="left">
                        <p>Biological function</p>
                     </c>
                  </r>
                  <r>
                     <c cspan="3">
                        <hr/>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>
                           <it>CD63</it>
                        </p>
                     </c>
                     <c ca="left">
                        <p>
                           <it>CD63 antigen (melanoma 1 antigen)</it>
                        </p>
                     </c>
                     <c ca="left">
                        <p>Localization plasma membrane</p>
                     </c>
                  </r>
                  <r>
                     <c>
                        <p/>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c ca="left">
                        <p>Endocytosis</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>
                           <it>CHEK1</it>
                        </p>
                     </c>
                     <c ca="left">
                        <p>
                           <it>CHK1 checkpoint homolog</it>
                        </p>
                     </c>
                     <c ca="left">
                        <p>Cell cycle</p>
                     </c>
                  </r>
                  <r>
                     <c>
                        <p/>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c ca="left">
                        <p>Negative regulation of cell proliferation</p>
                     </c>
                  </r>
                  <r>
                     <c>
                        <p/>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c ca="left">
                        <p>DNA damage response</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>
                           <it>ESRRBL1</it>
                        </p>
                     </c>
                     <c ca="left">
                        <p>
                           <it>estrogen-related receptor beta like 1</it>
                        </p>
                     </c>
                     <c ca="left">
                        <p>Induction of neuronal apoptosis</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>
                           <it>DDB1</it>
                        </p>
                     </c>
                     <c ca="left">
                        <p>
                           <it>damage-specific DNA binding protein 1, 127 kDa</it>
                        </p>
                     </c>
                     <c ca="left">
                        <p>DNA repair</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>
                           <it>NCOA2</it>
                        </p>
                     </c>
                     <c ca="left">
                        <p>
                           <it>nuclear receptor coactivator 2</it>
                        </p>
                     </c>
                     <c ca="left">
                        <p>Regulation of transcription</p>
                     </c>
                  </r>
                  <r>
                     <c>
                        <p/>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c ca="left">
                        <p>Signal transduction</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>
                           <it>RND2</it>
                        </p>
                     </c>
                     <c ca="left">
                        <p>
                           <it>Rho family GTPase 2</it>
                        </p>
                     </c>
                     <c ca="left">
                        <p>Signal transduction</p>
                     </c>
                  </r>
                  <r>
                     <c>
                        <p/>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c ca="left">
                        <p>Protein transport</p>
                     </c>
                  </r>
                  <r>
                     <c>
                        <p/>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c ca="left">
                        <p>Dendrite development</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>
                           <it>SH3GL2</it>
                        </p>
                     </c>
                     <c ca="left">
                        <p>
                           <it>SH3-domain GRB2-like 2</it>
                        </p>
                     </c>
                     <c ca="left">
                        <p>Central nervous system development</p>
                     </c>
                  </r>
                  <r>
                     <c>
                        <p/>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c ca="left">
                        <p>Signal transduction</p>
                     </c>
                  </r>
                  <r>
                     <c>
                        <p/>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c ca="left">
                        <p>Endocytosis</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>
                           <it>TFF2</it>
                        </p>
                     </c>
                     <c ca="left">
                        <p>
                           <it>trefoil factor 2</it>
                        </p>
                     </c>
                     <c ca="left">
                        <p>Defense response</p>
                     </c>
                  </r>
                  <r>
                     <c>
                        <p/>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c ca="left">
                        <p>Digestion</p>
                     </c>
                  </r>
               </tblbdy>
            </tbl>
            <p>In addition, we wished to verify HNF4&#945; binding sites predicted by ChIP-chip experiments <abbrgrp><abbr bid="B11">11</abbr></abbrgrp>. Note that nearly 80% of the proposed HNF4&#945; binding sites were rejected by our computational method, which combines analysis of HNF4 matrices with the local and global contexts of the sequences. For this we selected ten genes that were reported by Odom <it>et al</it>. <abbrgrp><abbr bid="B11">11</abbr></abbrgrp> to be targeted by HNF4&#945; in hepatocytes but were characterized by our computational method with extremely low scores for the HNF4 weight matrix and local and global contexts (all four tests comprising the complex criteria set by us failed to identify these genes as HNF4&#945; targets). Therefore, these ten potential sites (in promoters of genes <it>NPAS2</it>, <it>GPHN</it>, <it>PPP1R3C</it>, <it>AKR1C3</it>, <it>CFL2</it>, <it>MDM2</it>, <it>CLCN3</it>, <it>CBX3</it>, <it>AZI2 </it>and <it>C14orf119</it>) were analyzed for HNF4&#945; binding. Strikingly, none of these sites were bound by HNF4&#945;, as shown by the supershift experiments (Figure <figr fid="F6">6c</figr>).</p>
         </sec>
      </sec>
      <sec>
         <st>
            <p>Discussion</p>
         </st>
         <p><it>De novo </it>computational identification of genes targeted by various TFs is a challenging task, especially in genomes of higher eukaryotic organisms, which are characterized by extremely large gene regulatory regions. Indeed, binding of TFs to their cognate sites on DNA is a complex process that requires the presence of a specific short sequence pattern in DNA, commonly described by a PWM. Furthermore, the specific local sequence context in the vicinity of the binding site is required to provide favorable conditions for DNA confirmation and DNA flexibility <abbrgrp><abbr bid="B22">22</abbr></abbrgrp>. In addition, local structures such as short repeats and palindromes are often observed and, as discussed before, are needed to enable an optimal environment for homo- and heterodimerization of TFs <abbrgrp><abbr bid="B4">4</abbr></abbrgrp>. The particularly important role of the global context of TFBSs in determining cooperative binding of factors with other TFs to their neighboring DNA sites is broadly recognized <abbrgrp><abbr bid="B1">1</abbr></abbrgrp>. A broad collection of experimentally proven facts on cooperative binding of two and more TFs to so-called composite regulatory elements with synergistic effects on the regulation of gene expression is provided by the TRANSCompel<sup>&#174; </sup>database <abbrgrp><abbr bid="B4">4</abbr></abbrgrp>. Among these are several known examples of nuclear receptors that are involved in such composite elements (for example, glucocorticoid receptor, androgen receptor and others). But there are no bioinformatics tools available so far that enable a systematic analysis of the combinatorial sequence context of genomic binding sites.</p>
         <p>In general, there is a definitive need to develop novel computational approaches to improve the description of the DNA patterns required for TF binding. Ellrott and co-workers <abbrgrp><abbr bid="B23">23</abbr></abbrgrp> applied a Markov chain model to identify HNF4&#945; binding sites in order to improve recognition accuracy of the DNA binding pattern. They have demonstrated that the approach performs better than PWMs alone, but this approach does not consider any local context on the flanks of sites that indeed play a crucial role in promoter activation and DNA binding <it>in vivo</it>.</p>
         <p>Recently, local context in the form of short repeats has been successfully implemented to improve recognition of binding sites for nuclear receptors <abbrgrp><abbr bid="B24">24</abbr><abbr bid="B25">25</abbr></abbrgrp>. Extending the previously published approach <abbrgrp><abbr bid="B24">24</abbr></abbrgrp> to the application of hidden Markov models, Sandelin and Wasserman <abbrgrp><abbr bid="B25">25</abbr></abbrgrp> modeled various known constellations of direct, inverted and everted repeats for different sites of nuclear receptors and were able to improve the general precision of the recognition. This approach looks very promising, although it lacks any capability to classify predicted sites in order to identify which particular TF from the large family of nuclear receptors is able to bind to the predicted sites. In addition, we show here that binding sites for such nuclear receptors as HNF4&#945; are highly enriched by various different repeat structures, which does not completely fit with the existing paradigm that the DR1 repeat comprises the canonical structure of HNF4 sites. This makes it extremely difficult to judge factor recognition based on an oversimplified model based on the repeat structures of sites.</p>
         <p>We therefore developed a novel approach for the recognition of functional HNF4&#945; binding sites by analyzing the local and global contexts of targeted genes. The method is based on the assumption that the sequence context surrounding TFBSs in DNA is very important for the process of TF binding to the site and, most importantly, for providing specificity of the TF in the regulation of gene expression - by either activation or repression of the gene in particular cellular situations. The sequence contexts of the TFBSs actually makes them functional: in the absence of the proper context, the possible binding of a TF to a particular site on DNA can be impaired or made functionally neutral (which means that the factors are bound to the DNA, but do not influence expression of the gene; such sites are, therefore, non-functional).</p>
         <p>In the current work, we performed a thorough analysis of the local nucleotide context on the flanks of known functional HNF4&#945; sites, as well as in the whole local region occupied by the sites. We improved our earlier published approach to analyzing local context <abbrgrp><abbr bid="B2">2</abbr></abbrgrp>, which is based on a SiteVideo method <abbrgrp><abbr bid="B14">14</abbr></abbrgrp>, and introduced new types of contextual features that modeled various repeated structures in the sequences on the flanks of the sites. Interestingly, the revealed short oligonucleotide features and repeats can be classified into three categories. The first category includes oligonucleotides like ANGD and MDDR and repeats like AV-VS, VS-YA and (RBNH)<sup>2 </sup>that fit to different parts of the HNF4 consensus sequence and appear to be overrepresented in a rather wide area around the center of the binding sites. We can interpret such features as a signature of overrepresentation of HNF4 site-like patterns in the local area surrounding the functional HNF4 site, which may play a role in increasing the probability of HNF4&#945; binding to this site. The second category includes oligonucleotides like CDDM and repeats like (DNCD)<sup>2</sup>, which are overrepresented in a quite small area corresponding to the central positions of the sites. Such features correspond to the central HNF4 site pattern, but they reveal some contextual features of the functional HNF4 sites that cannot be described by the PWM matrix model, for example, correlation between neighboring nucleotides that can not be captured in full by the mononucleotide weight matrix. The third category includes 'negative' features that reveal oligonucleotides to be underrepresented at functional binding sites when compared with background sequences. Such negative features can be local, as in the case of the repeats (BNDK)<sup>2</sup>, (NBHV)<sup>2 </sup>and (NVYB)<sup>2</sup>, which again describes some mutual nucleotide correlations that cannot be captured by PWMs, or distributed, such as BR-NT, which can be interpreted as an 'echo' of some physical-chemical properties of DNA that may interfere with the binding or functioning of the TFs. Notably, the length of the oligonucleotides tested by our method was restricted to four letters of the extended code, mainly because of the high computational complexity of the calculations; however, this oligoncleotide length seems quite optimal for revealing statistically significant features of DNA sequences.</p>
         <p>We assume that, in addition to the local context, the global context of the TFBSs in the regulatory regions of genes dictates whether these sites are functional. The global context, which we model by specific combinations of binding sites of various TFs, provides some sort of 'scaffold' on DNA to enable cooperative or antagonistic interactions between TFs. These multiple and complex interactions, if correctly organized in space and time, give rise to the regulatory function of the TFBSs under investigation. It is clear by now that binding of a single TF to its cognate site on DNA alone does not guarantee the proper functional activity of the targeted gene. More interaction with other TFs in the transcription complex and in the enhanceosome are necessary to acquire the full regulatory functionality.</p>
         <p>Specifically, known functional combinations of TFBSs were used before in a number of promoter analysis approaches, for example, for identifying muscle-specific promoters <abbrgrp><abbr bid="B26">26</abbr><abbr bid="B27">27</abbr></abbrgrp>, the promoters of liver-enriched genes <abbrgrp><abbr bid="B28">28</abbr></abbrgrp>, yeast genes <abbrgrp><abbr bid="B29">29</abbr></abbrgrp>, immune-specific genes <abbrgrp><abbr bid="B30">30</abbr><abbr bid="B31">31</abbr><abbr bid="B32">32</abbr></abbrgrp>, and the promoters of genes regulated during the cell cycle <abbrgrp><abbr bid="B33">33</abbr></abbrgrp> or genes involved in antibacterial defense responses <abbrgrp><abbr bid="B34">34</abbr><abbr bid="B35">35</abbr></abbrgrp>. A number of approaches identifying composite motifs were created, including BioProspector <abbrgrp><abbr bid="B36">36</abbr></abbrgrp>, Co-Bind <abbrgrp><abbr bid="B37">37</abbr></abbrgrp>, MITRA <abbrgrp><abbr bid="B38">38</abbr></abbrgrp>, and dyad search <abbrgrp><abbr bid="B39">39</abbr></abbrgrp>. These programs help to discover <it>ab initio </it>new regulatory sites for yet unknown TFs. Another set of methods has been developed to discover composite modules by utilizing information on potential binding sites for known TFs (stochastic methods such as ClusterScan <abbrgrp><abbr bid="B40">40</abbr></abbrgrp> and TOUCAN system <abbrgrp><abbr bid="B41">41</abbr></abbrgrp>, and probabilistic methods such as reported in <abbrgrp><abbr bid="B42">42</abbr></abbrgrp>). We combined these two approaches by computing local context as an exhaustive <it>ab initio </it>composite motif discovery method with the global context - the powerful composite module discovery method based on application of a genetic algorithm.</p>
         <p>Furthermore, we wish to point out that our method considers several alternative PWMs for the calculation of the global context. The use of such alternative PWMs for constructing the composite module (see Materials and methods) enables more reliable predictions. In cases with small training sets but with data derived from multiple rounds of computations, the use of different matrices is particularly meaningful. These computations are far from statistical saturation, but new sites may eventually add a certain bias and potentially drive the new PWM matrix away from the functional binding site sequence context. In the case of HNF4&#945; we included both half-site and full-site matrices. Still, the full-length PWMs are able to capture some subtle differences in the spacing sequences between the 'repeats'. Recent reports confirm our old observation that such 'gap' sequences between two repeats of the nuclear receptor sites are often actually more conserved between species then the repeats themselves. Therefore, the combined usage of fixed-length PWMs together with the distributed oligonucleotides on the flanks provides a more robust method for site detection than each method separately.</p>
         <p>To compare the findings from our algorithm with the best existing methods, we performed an independent run of the TOUCAN software <abbrgrp><abbr bid="B41">41</abbr></abbrgrp> on the set of HNF4&#945; sites. Notably, TOUCAN is similar to our method and is based on a genetic algorithm. It identified a combination of 14 PWMs for different TFs, including two matrices for HNF4&#945; factors (V$HNF4_01 and V$HNF4_01_B) and matrices for other factors (V$USF_01, V$OCT1_02, V$SP1_Q6, V$PPARG_03 and some others). Interestingly, except for the HNF4 matrices, there were no further correlations with matrices selected by our method (Table <tblr tid="T2">2</tblr>). This difference can be explained by the ability of our method to identify pairs of matrices and also by the ability to optimize the cut-off values, which is not possible in TOUCAN. Another advantage of our method is the possibility to include information on the tissue specificity of the factors, through extensive use of factor expression annotation in the TRANSFAC<sup>&#174; </sup>database. We further compared our method with the NUBIscan algorithm <abbrgrp><abbr bid="B24">24</abbr></abbrgrp>. Our approach combines many different features of the most conserved part of the sites as well as various features of the local and global contexts, whereas the NUBIscan algorithm relies solely on the 'repeated' structure of the nuclear receptor sites, which is indeed a very profound property of these sites but not the only one. And, similar to the later published Wasserman approach <abbrgrp><abbr bid="B25">25</abbr></abbrgrp>, the NUBIscan algorithm lacks the capability to classify predicted sites in order to identify which particular TF from the large family of nuclear receptors is able to bind to a predicted site. Notably, our method was designed specifically for the recognition of HNF4 sites. Nonetheless, our strategy to define the local and global sequence contexts is a highly generalized method and can be applied to any TF.</p>
         <p>An additional point of consideration is the 'regulatory potential score', as introduced in the work of Elnitski and co-workers <abbrgrp><abbr bid="B43">43</abbr></abbrgrp>, which refers to the five-way multi-species alignment introduced in the UCSC Genome Browser. In our study we did not restrict ourselves to conserved sites based only on multispecies homology of regulatory regions (phylogenetic footprinting). Although this concept is quite popular for the selection of evolutionarily conserved regulatory sites, the method suffers from low sensitivity because functional TFBSs are frequently missed, mainly due to the very complicated evolutionary history of mammalian regulatory sequences, which can hardly be modeled by the simple divergent concept, which is the basic concept of phylogenetic footprinting.</p>
         <p>Furthermore, promoter regions are characterized by very specific average base composition as well as composition of di-nucleotides. It is known that whereas the overall genomic sequences are highly depleted of CG dinucleotides, promoters are often located near high concentrations of them, in or near so-called CpG islands. We therefore compared the nucleotide and dinucleotide composition of the promoter sequences of the known HNF4&#945; sites and sequences that were used as probes in the ChIP-chip experiments of Odom <it>et al</it>. <abbrgrp><abbr bid="B11">11</abbr></abbrgrp>, since these were the sequences to which our method was applied. We did not find any significant difference in the nucleotide and dinucleotide compositions of these two sets of sequences, which is not a surprise since the probes of the chip were designed using known genomic promoters. Consequently, since training and test sequences are very similar in their context, the CpG bias, if any, of the HNF4-containing promoters of the training set cannot bias the results of the analysis of the sequences from the ChIP-chip experiment.</p>
         <p>Additionally, many authors attribute a certain functional role to the CpG islands in promoters. These islands can be some sort of centers of regulated DNA methylation, which can effectively contribute to hepatocyte-specific gene regulation by providing HNF4&#945; binding sites with necessary functional context. The absence of such a CpG context in the vicinity of HNF4&#945; binding sites may potentially render them functionally neutral. Therefore, some CG dinucleotide-like features that were included in the local context (for example, elevated frequency of the oligonucleotide CDDM, where D is (A/T/G) and M is (A/C)) may reflect this 'CpG' bias of functionally active promoter sequences and, therefore, help to identify the functionally active HNF4&#945; binding sites.</p>
         <p>We further studied the influence of the distance from HNF4&#945; sites to TSSs. As shown in Additional data file 1, the location of known HNF4 sites is variable in promoter sequences and may range from a position close to the TSS to up to +10 kb and -11 kb. Under such high variability, it is improbable that the method can 'memorize' the position of TSSs during training since sequences in the training set are not aligned in relation to TSSs.</p>
         <p>We then applied the developed methods to analysis of the data derived from ChIP-chip experiments reported by Odom <it>et al</it>. <abbrgrp><abbr bid="B11">11</abbr></abbrgrp> for HNF4&#945;. This study is based on chromatin immunoprecipitation combined with DNA-DNA hybridization on a microarray containing 13,000 human promoter sequences. In the study of Odom <it>et al</it>. <abbrgrp><abbr bid="B11">11</abbr></abbrgrp>, the number of HNF4&#945;-targeted promoters was unexpectedly high; 1,575 potential HNF4&#945; target genes in hepatocytes were identified, corresponding to 42% of the genes occupied by RNA polymerase II. Only 48 (3%) of the 1,575 putative HNF4&#945; targets were verified, however, in separate gene-specific ChIP-chip experiments. Additionally, HNF4&#945; DNA binding was not distinguished from protein-protein interactions, as <it>in vitro </it>binding was not analyzed. We applied our algorithm to the proposed HNF4&#945; gene targets and found merely 20% of them to obey the complex computational criteria (the presence of appropriate local and global contexts) that can predict the functional activity of these binding sites. We further stratified our approach by comparing HNF4&#945; functional sites identified by us with independent gene expression data. This comparison shows that our computational approach is versatile and predicts expressed genes directly targeted by the HNF4&#945; TF with similar sensitivity to chromatin immunoprecipitation (ChIP)-chip experiments. In strong contrast, the false discovery rate of the computational method is almost five times lower than that of the ChIP-chip method. This confirms our suspicion that many of the HNF4&#945; binding sites predicted by Odom <it>et al</it>. <abbrgrp><abbr bid="B11">11</abbr></abbrgrp> are functionally neutral, whereas the developed computational method is able to recognize the functionally active HNF4&#945; binding sites based on verification of the local and global contexts of these sites.</p>
         <p>Furthermore, in a recent paper by Gupta <it>et al</it>. <abbrgrp><abbr bid="B44">44</abbr></abbrgrp>, regulation of gene expression was studied in pancreatic cells of an HNF4&#945; conditional knockout model. Expression analysis identified 133 genes as HNF4&#945; regulated. Regulated genes could be compared with the promoter array data of Odom <it>et al</it>. <abbrgrp><abbr bid="B11">11</abbr></abbrgrp>. Surprisingly, the overlap between differentially expressed genes and those bound by HNF4&#945; is rather small. In other words, of 133 genes whose expression was dependent on HNF4&#945;, only 13 have been identified by the location analysis reported by Odom <it>et al</it>. <abbrgrp><abbr bid="B11">11</abbr></abbrgrp>. Likewise, of 587 promoters occupied by HNF4&#945; in the study of Odom <it>et al</it>. <abbrgrp><abbr bid="B11">11</abbr></abbrgrp>, 574 showed no significant change in gene expression <abbrgrp><abbr bid="B44">44</abbr></abbrgrp>. Therefore, 86% of HNF4&#945;-targeted genes proposed by Odom <it>et al</it>. <abbrgrp><abbr bid="B11">11</abbr></abbrgrp> did not differ in gene expression in the absence of HNF4&#945;. These estimates agree well with our computational approach where only 20% of the target genes proposed by Odom <it>et al</it>. <abbrgrp><abbr bid="B11">11</abbr></abbrgrp> could be computationally confirmed.</p>
         <p>In the most recent study <abbrgrp><abbr bid="B45">45</abbr></abbrgrp> by the same investigators and through the application of an improved ChIP-chip assay, more then 4,000 HNF4&#945; target genes were identified. By comparing the list of genes identified in the ChIP-chip assay with the list of genes expressed in liver, the authors determined the combinatorial co-occupancy of binding sites of different factors in promoters of HNF4&#945; target genes. Furthermore, this feature correlated well with expression of these genes in hepatocytes. This agrees well with our findings and confirms the utility of our method in defining the local and global contexts for specific combinations of different TFBSs in the vicinity of functionally active HNF4&#945; binding sites of promoters of genes whose expression is regulated by HNF4&#945;. Notably, the combination of PWMs identified by the genetic algorithm (Table <tblr tid="T2">2</tblr>) captured two TFs, FOX and CREB. Strikingly, these factors were identified independently by Odom <it>et al</it>. <abbrgrp><abbr bid="B45">45</abbr></abbrgrp> in an analysis of TFBSs that co-accrued with HNF4&#945; sites (Table <tblr tid="T2">2</tblr>, matrices indicated by asterisks).</p>
         <p>In a further study of Odom <it>et al</it>. <abbrgrp><abbr bid="B46">46</abbr></abbrgrp>, the authors showed that two-thirds of the binding sites identified by ChIP-chip experiments are not conserved between human and mouse. Taking into account the quite conservative liver expression patterns of genes between these two species, we can conclude that by far not all HNF4&#945; binding sites identified by the ChIP-chip method directly contribute to the regulation of gene expression.</p>
         <p>To experimentally validate our predictions, we selected two sets of promoters. The first set contained ten <it>ab initio</it>, and therefore novel, HNF4&#945; recognition sites predicted by the computational complex recognition criteria described above. Strikingly, eight of the ten binding sites (<it>NCOA2</it>, <it>TFF2</it>, <it>CHEK1</it>, <it>CD63</it>, <it>SH3Gl2</it>, <it>RND2</it>, <it>ESRRBL2 </it>and <it>DDB1</it>) could be confirmed as HNF4&#945; binding sites in electromobility supershift experiments (Figure <figr fid="F6">6b</figr>). In addition, we studied another set of ten promoters that were reported by Odom <it>et al</it>. <abbrgrp><abbr bid="B11">11</abbr></abbrgrp> as targets for HNF4&#945;, but our computational method rejected them because of extremely low scores for the HNF4 weight matrix as well as low scores for local and global contexts. None of these sites (<it>NPAS2</it>, <it>GPHN</it>, <it>PPP1R3C</it>, <it>AKR1C3</it>, <it>CFL2</it>, <it>MDM2</it>, <it>CLCN3</it>, <it>CBX3</it>, <it>AZI2 </it>and <it>C14orf119</it>) did in fact bind to HNF4&#945;, as shown by electromobility supershift assays (Figure <figr fid="F6">6c</figr>). These findings suggest a high error rate concerning the proposed targets by Odom <it>et al</it>. <abbrgrp><abbr bid="B11">11</abbr></abbrgrp>.</p>
         <p>Finally, another computational approach has been applied to analyze the same set of HNF4&#945; (as well as HNF1 and HNF6) ChIP-chip data that is the focus of our current study. Indeed, Smith and colleagues <abbrgrp><abbr bid="B47">47</abbr></abbrgrp> demonstrated that an application of combinations of motifs allowed for improvements in the prediction of the genomic location of TFBSs. In contrast to our approach, however, they performed a blind motif discovery instead of using the existing TF weight matrices. To the best of our knowledge, this makes the algorithm very complicated and increases the risk of missing important TF combinations that are characteristic of functionally active regulatory sites.</p>
         <p>Several further improvements to our algorithm can be considered in the future. Among the most important, we should consider the possibility of taking into account sequence conservation in the non-coding regulatory regions of genes between different species, for example, human and mouse. It was demonstrated in recent studies <abbrgrp><abbr bid="B48">48</abbr><abbr bid="B49">49</abbr><abbr bid="B50">50</abbr><abbr bid="B51">51</abbr></abbrgrp> that sequence conservation can be a good indication of the functional importance of a region. Indeed, such regions can bear functional TFBSs. Despite being quite useful, such considerations should be taken with care since regulatory regions are characterized by a high level of convergent evolution, which can provide non-divergent means of forming the functional context of a TFBS.</p>
         <p>Another direction for further improvements is considering known protein-protein interaction data between different TFs. Such data are partially available in databases such as TRANSFAC<sup>&#174;</sup>, TRANSPATH<sup>&#174; </sup>and BIND. Known interactions between TFs can help to find proper combinations of neighboring binding sites for these factors.</p>
         <p>A further step in improving our method will be the use of PWMs for factors whose expression is tissue specific, as indicated in TRANSFAC<sup>&#174;</sup>. This will greatly improve the predictive power of the method. To achieve this, more extensive annotation of expression information of TFs is needed and will be a task of the future. One possibility for obtaining this information resides in ChIP-chip experiments in conjunction with gene expression data. This will help to identify TF activity in a given cellular environment.</p>
         <p>Taking gene expression data into account will significantly help to determine the global and local sequence contexts and, therefore, functional TFBSs. Recently, we applied the algorithm described in this paper to analysis of promoters of differentially expressed genes <abbrgrp><abbr bid="B2">2</abbr><abbr bid="B15">15</abbr><abbr bid="B51">51</abbr></abbrgrp>. Such an integrative approach is now available in the software system ExPlain&#8482; for a mechanistic interpretation of gene expression changes in eukaryotic cells under various physiological and pathological conditions <abbrgrp><abbr bid="B52">52</abbr></abbrgrp>.</p>
      </sec>
      <sec>
         <st>
            <p>Conclusion</p>
         </st>
         <p>We report a new approach based on machine learning techniques for the <it>de novo </it>identification of novel HNF4&#945; binding sites. The genetic algorithms developed by us significantly improved data analysis of various experimental sources. The method described here can be applied to any TF and enables computational prediction of genome-wide functional TFBSs. By applying our method, interactions between different TFs can be taken into account. This provides clues to the mechanisms responsible for promoter activation and even for antagonistic binding of TFs, for example, HNF4&#945; and Coup-TF, which successfully compete for the same binding site but differ in activity under various biological conditions. Indeed, while both factors can bind to the same sequence, the individual local and global sequence contexts determine the actual binding activity and may, therefore, provide an estimate of TF activity in particular cellular or physiological conditions.</p>
      </sec>
      <sec>
         <st>
            <p>Materials and methods</p>
         </st>
         <sec>
            <st>
               <p>Databases</p>
            </st>
            <p>Databases provided by BIOBASE GmbH were used, for example, TRANSFAC<sup>&#174;</sup>, which is a database on gene regulation <abbrgrp><abbr bid="B53">53</abbr></abbrgrp>. It collects data on TFs and their binding sites in promoters and enhancers of eukaryotic genes as well as a library of PWMs. This work was done with TRANSFAC<sup>&#174; </sup>release 9.4. Additionally, to retrieve promoters of human genes, we used TRANSPro&#8482; release 2.1 <abbrgrp><abbr bid="B54">54</abbr></abbrgrp>, which is based on genomic sequence from Ensembl release v35, November 2005. Final verification of the composite modules was done with the help of the TRANSCompel&#8482; database <abbrgrp><abbr bid="B4">4</abbr></abbrgrp>.</p>
         </sec>
         <sec>
            <st>
               <p>HNF4&#945; binding sites in the human genome</p>
            </st>
            <p>In this work, we significantly updated the collection of known genomic HNF4&#945; sites in TRANSFAC<sup>&#174;</sup>. Additional data file 1 lists the collected sites with information about the target genes, positions in the promoters of the genes, and the site sequence.</p>
            <p>First of all, we compiled all known HNF4 binding sites from the literature and extended them upstream (28 bp) and downstream (34 bp); this is set as Y<sub>Local</sub>. Next, we prepared the background sequences; this is set as N<sub>Local</sub>. After that, we split the Y<sub>Local </sub>set into two parts: the training set and the test set (sites included in the training set are indicated in Additional data file 1 by asterisks). We also split the N<sub>Local </sub>set into two parts: the training background set and the test background set. The training of the method was done by comparison of training set versus training background set. The testing of the method and building of the histogram in the Figure <figr fid="F2">2</figr> was done on the test set versus the test background set - on two sets that were not used in the training. This procedure of preparing four sets is the best possible statistical procedure for training and testing of the recognition methods.</p>
         </sec>
         <sec>
            <st>
               <p>Positional weight matrix for HNF4&#945; binding sites</p>
            </st>
            <p>Based on the collection of HNF4&#945; sites we constructed a PWM (accession number M01031) and two 'half-site' matrices (accession numbers M01032 and M01033) and deposited them in the TRANSFAC<sup>&#174; </sup>database (Table <tblr tid="T5">5</tblr>). The construction of PWMs was done according to the general outline described in <abbrgrp><abbr bid="B51">51</abbr></abbrgrp> and as detailed in the protocol of TRANSFAC<sup>&#174; </sup>matrix construction (see the TRANSFAC<sup>&#174; </sup>documentation). The half-site matrices were created by manual splitting of each site into two parts and were used independently for the alignment. Together with pre-existing HNF4&#945; matrices in TRANSFAC<sup>&#174; </sup>(accession numbers M00762, M00764, and M00967), the new matrices were used to search for HNF4&#945; binding sites in genomic sequences. For this basic search we employed the MATCH&#8482; algorithm, calculating scores for the matches by applying the so-called information vector <abbrgrp><abbr bid="B55">55</abbr></abbrgrp>. This algorithm is implemented in the ExPlain&#8482; software system. This software was also used for analysis of the flanking regions of HNF4&#945; sites to search for other TFBSs from the most up-to-date library of matrices derived from the TRANSFAC<sup>&#174; </sup>Professional database. The cut-offs for the matrices were set to minFN to maximize the sensitivity of the site prediction (false negative rate of 10%).</p>
            <tbl id="T5">
               <title>
                  <p>Table 5</p>
               </title>
               <caption>
                  <p>Positional weight matrix for HNF4&#945; sites (TRANSFAC<sup>&#174; </sup>accession number M01031, identifier V$HNF4_Q6_01)</p>
               </caption>
               <tblbdy cols="15">
                  <r>
                     <c ca="left">
                        <p>A</p>
                     </c>
                     <c ca="center">
                        <p>
                           <b>19</b>
                        </p>
                     </c>
                     <c ca="center">
                        <p>
                           <b>5</b>
                        </p>
                     </c>
                     <c ca="center">
                        <p>
                           <b>11</b>
                        </p>
                     </c>
                     <c ca="center">
                        <p>
                           <b>5</b>
                        </p>
                     </c>
                     <c ca="center">
                        <p>
                           <b>2</b>
                        </p>
                     </c>
                     <c ca="center">
                        <p>
                           <b>52</b>
                        </p>
                     </c>
                     <c ca="center">
                        <p>46</p>
                     </c>
                     <c ca="center">
                        <p>
                           <b>49</b>
                        </p>
                     </c>
                     <c ca="center">
                        <p>
                           <b>0</b>
                        </p>
                     </c>
                     <c ca="center">
                        <p>
                           <b>3</b>
                        </p>
                     </c>
                     <c ca="center">
                        <p>
                           <b>3</b>
                        </p>
                     </c>
                     <c ca="center">
                        <p>
                           <b>1</b>
                        </p>
                     </c>
                     <c ca="center">
                        <p>
                           <b>46</b>
                        </p>
                     </c>
                     <c ca="center">
                        <p>17</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>C</p>
                     </c>
                     <c ca="center">
                        <p>
                           <b>8</b>
                        </p>
                     </c>
                     <c ca="center">
                        <p>
                           <b>2</b>
                        </p>
                     </c>
                     <c ca="center">
                        <p>
                           <b>3</b>
                        </p>
                     </c>
                     <c ca="center">
                        <p>
                           <b>16</b>
                        </p>
                     </c>
                     <c ca="center">
                        <p>
                           <b>48</b>
                        </p>
                     </c>
                     <c ca="center">
                        <p>
                           <b>1</b>
                        </p>
                     </c>
                     <c ca="center">
                        <p>2</p>
                     </c>
                     <c ca="center">
                        <p>
                           <b>0</b>
                        </p>
                     </c>
                     <c ca="center">
                        <p>
                           <b>0</b>
                        </p>
                     </c>
                     <c ca="center">
                        <p>
                           <b>1</b>
                        </p>
                     </c>
                     <c ca="center">
                        <p>
                           <b>19</b>
                        </p>
                     </c>
                     <c ca="center">
                        <p>
                           <b>47</b>
                        </p>
                     </c>
                     <c ca="center">
                        <p>
                           <b>2</b>
                        </p>
                     </c>
                     <c ca="center">
                        <p>15</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>G</p>
                     </c>
                     <c ca="center">
                        <p>
                           <b>21</b>
          