<?xml version='1.0'?>
<!DOCTYPE art SYSTEM 'http://www.biomedcentral.com/xml/article.dtd'>
<art>
   <ui>gb-2003-4-4-r27</ui>
   <ji>GBJ</ji>
   <fm>
      <dochead>Software</dochead>
      <bibl>
         <title>
            <p>MatchMiner: a tool for batch navigation among gene and gene product identifiers</p>
         </title>
         <aug>
            <au id="A1">
               <snm>Bussey</snm>
               <mi>J</mi>
               <fnm>Kimberly</fnm>
               <insr iid="I1"/>
            </au>
            <au id="A2">
               <snm>Kane</snm>
               <fnm>David</fnm>
               <insr iid="I2"/>
            </au>
            <au id="A3">
               <snm>Sunshine</snm>
               <fnm>Margot</fnm>
               <insr iid="I2"/>
            </au>
            <au id="A4">
               <snm>Narasimhan</snm>
               <fnm>Sudar</fnm>
               <insr iid="I2"/>
            </au>
            <au id="A5">
               <snm>Nishizuka</snm>
               <fnm>Satoshi</fnm>
               <insr iid="I1"/>
            </au>
            <au id="A6">
               <snm>Reinhold</snm>
               <mi>C</mi>
               <fnm>William</fnm>
               <insr iid="I1"/>
            </au>
            <au id="A7">
               <snm>Zeeberg</snm>
               <fnm>Barry</fnm>
               <insr iid="I1"/>
            </au>
            <au id="A8">
               <snm>Ajay</snm>
               <fnm/>
               <insr iid="I3"/>
            </au>
            <au id="A9" ca="yes">
               <snm>Weinstein</snm>
               <mi>N</mi>
               <fnm>John</fnm>
               <insr iid="I1"/>
               <email>weinstein@dtpax2.ncifcrf.gov</email>
            </au>
         </aug>
         <insg>
            <ins id="I1">
               <p>Genomics and Bioinformatics Group, Laboratory of Molecular Pharmacology, Center for Cancer Research, National Cancer Institute, NIH Building 37, Bethesda, MD 20892-4255, USA</p>
            </ins>
            <ins id="I2">
               <p>SRA International Inc., 4300 Fair Lakes CT, Fairfax, VA 22033, USA</p>
            </ins>
            <ins id="I3">
               <p>Current address: Celera Genomics, 45 West Gude Drive, Rockville, MD 20850, USA</p>
            </ins>
         </insg>
         <source>Genome Biology</source>
         <issn>1465-6906</issn>
         <pubdate>2003</pubdate>
         <volume>4</volume>
         <issue>4</issue>
         <fpage>R27</fpage>
         <url>http://genomebiology.com/2003/4/4/R27</url>
         <xrefbib>
            <pubidlist>
               <pubid idtype="doi">10.1186/gb-2003-4-4-r27</pubid>
               <pubid idtype="pmpid">12702208</pubid>
            </pubidlist>
         </xrefbib>
      </bibl>
      <history>
         <rec>
            <date>
               <day>10</day>
               <month>10</month>
               <year>2002</year>
            </date>
         </rec>
         <revrec>
            <date>
               <day>20</day>
               <month>12</month>
               <year>2002</year>
            </date>
         </revrec>
         <acc>
            <date>
               <day>28</day>
               <month>2</month>
               <year>2003</year>
            </date>
         </acc>
         <pub>
            <date>
               <day>25</day>
               <month>3</month>
               <year>2003</year>
            </date>
         </pub>
      </history>
      <cpyrt>
         <year>2003</year>
         <collab>Bussey et al.; licensee BioMed Central Ltd. This is an Open Access article: verbatim copying and redistribution of this article are permitted in all media for any purpose, provided this notice is preserved along with the article's original URL.</collab>
      </cpyrt>
      <shorttitle>
         <p>MatchMiner: a tool for batch navigation among gene and gene
product identifiers</p>
      </shorttitle>
      <shortabs>
         <p>MatchMiner is a freely available program package for batch navigation among gene and gene product identifier types commonly encountered in microarray studies and other forms of 'omic' research.</p>
      </shortabs>
      <abs>
         <sec>
            <st>
               <p>Abstract</p>
            </st>
            <p>MatchMiner is a freely available program package for batch navigation among gene and gene product identifier types commonly encountered in microarray studies and other forms of 'omic' research. The user inputs a list of gene identifiers and then uses the Merge function to find the overlap with a second list of identifiers of either the same or a different type or uses the LookUp function to find corresponding identifiers.</p>
         </sec>
      </abs>
   </fm>
   <meta>
      <classifications>
         <classification type="BMC" subtype="man_spc_id" id="30010002">Bioinformatics</classification>
         <classification type="BMC" subtype="man_spc_id" id="30010010">Genome studies</classification>
      </classifications>
   </meta>
   <bdy>
      <sec>
         <st>
            <p>Rationale</p>
         </st>
         <p>One of the more painful tasks in 'omic' research <abbrgrp><abbr bid="B1">1</abbr><abbr bid="B2">2</abbr></abbrgrp> is navigating among different gene or gene product identifiers. After a cDNA microarray experiment, for example, one usually must translate from IMAGE clone ids to GenBank accession numbers, HUGO names, common names, or chromosome locations for a list of genes. As we generate more and more data from diverse platforms and species, such translations will become increasingly complex but also more important to the synthesis of a coherent biological picture. Beyond simply looking up additional information about a list of genes, such synthesis will require the ability to find the intersection between two lists of genes that are designated by the same or a different identifier type.</p>
         <p>Currently, the basic translations can be done on a gene-by-gene basis using public databases such as UniGene, LocusLink, OMIM (Online Mendelian Inheritance in Man), and the working draft of the human genome (from the University of California Santa Cruz (UCSC) or the National Center for Biotechnology Information) <abbrgrp><abbr bid="B3">3</abbr><abbr bid="B4">4</abbr><abbr bid="B5">5</abbr></abbrgrp> or else in batch through Source <abbrgrp><abbr bid="B6">6</abbr></abbrgrp> or GeneLynx <abbrgrp><abbr bid="B7">7</abbr></abbrgrp>. However, no single data source contains all the necessary information about every gene and, to complicate matters further, the relationships among identifiers are often not one-to-one. For example, there may be several GenBank accession numbers and multiple IMAGE clone ids for the same gene, and a single gene symbol may be an alias for multiple different genes. Therefore, any high-throughput solution to the problem must take these challenges into account and respond with an approach that minimizes the need for human intervention. At the same time, those instances when human intervention is necessary must be flagged and enough metadata must be provided for accurate decision-making without extensive further research.</p>
         <p>Motivated by many days spent at the computer doing these tedious, time-consuming translations for our own experimental data, we developed MatchMiner <abbrgrp><abbr bid="B8">8</abbr></abbrgrp> as a freely available public resource that automates the process for collections of genes. MatchMiner provides two primary functions. The first, LookUp, translates an input list of gene identifiers into a matching output list of identifiers of a different type; the second, Merge, combines two separate lists of either the same or different types of identifiers into one list that details all one-to-one, one-to-many, and many-to-many relationships between corresponding gene identifiers in the two lists.</p>
      </sec>
      <sec>
         <st>
            <p>Identifier navigation with MatchMiner</p>
         </st>
         <p>As shown schematically in Figures <figr fid="F1">1</figr> and <figr fid="F2">2</figr>, MatchMiner leverages information from the four public databases listed above, and from Affymetrix, by parsing them into relational tables for use in doing translations. The LookUp function can operate interactively on single identifiers or in batch mode on a list of identifiers in a file. When used interactively for one or a few genes, it saves the user the trouble of querying five different databases and collating the data. More important, however, is batch querying of a list file, for instance a list of the dozens or hundreds or thousands of genes that show interesting differences between samples in a microarray experiment. In this mode, the user specifies the input and output identifier types, as well as the search algorithms to be used in traversing the various data sources (Table <tblr tid="T1">1</tblr>). The program is context-sensitive in that it will search only the pertinent data sources (for example, only UniGene to identify IMAGE clone ids, which are not found in the other sources). An important feature is the optional output of diagnostic metadata that tell the user in which source (s) the identifier was found and whether an input identifier corresponds to more than one gene. This feature enables the user to judge the reliability of matches. The results can be displayed in HTML format or downloaded as tab-delimited text suitable for direct entry into a spreadsheet program. A summary indicates the number of successful and unsuccessful translations.</p>
         <fig id="F1">
            <title>
               <p>Figure 1</p>
            </title>
            <caption>
               <p>Information Flow in MatchMiner</p>
            </caption>
            <text>
               <p>Information Flow in MatchMiner. Input identifier lists are first translated into unique internal gene indices to form a translation table. The translation table is then either converted into another set of identifiers using the LookUp function or compared with another such table using the Merge function to generate a report showing the intersection of two separate identifier lists. The resulting output can be displayed as HTML or else saved as text for import into other programs.</p>
            </text>
            <graphic file="gb-2003-4-4-r27-1"/>
         </fig>
         <fig id="F2">
            <title>
               <p>Figure 2</p>
            </title>
            <caption>
               <p>Database relational table schema for MatchMiner</p>
            </caption>
            <text>
               <p>Database relational table schema for MatchMiner. <b>(a) </b>Logical database representation. Data are incorporated from the UCSC Human Genome Build, LocusLink, UniGene, OMIM, and the Affymetrix annotation sets for HU95 and HU133 chips. Each candidate gene is assigned a gene index in the GeneIdx table. These gene indexes are used as keys for all of the MatchMiner operations. The number of many-to-many relationships in the model illustrates the complexity of the data. <b>(b) </b>Physical representation of the database. The implementation currently includes 14 tables with about 12 million rows.</p>
            </text>
            <graphic file="gb-2003-4-4-r27-2"/>
         </fig>
         <tbl id="T1">
            <title>
               <p>Table 1</p>
            </title>
            <caption>
               <p>MatchMiner LookUp search options</p>
            </caption>
            <tblbdy cols="3">
               <r>
                  <c ca="left">
                     <p>Identifier type</p>
                  </c>
                  <c ca="left">
                     <p>Input algorithm</p>
                  </c>
                  <c ca="left">
                     <p>Output algorithm</p>
                  </c>
               </r>
               <r>
                  <c cspan="3">
                     <hr/>
                  </c>
               </r>
               <r>
                  <c ca="left">
                     <p>Name (Gene symbol, alias or descriptive name)</p>
                  </c>
                  <c ca="left">
                     <p>HUGO then Alias</p>
                  </c>
                  <c ca="left">
                     <p>HUGO then Alias</p>
                  </c>
               </r>
               <r>
                  <c>
                     <p/>
                  </c>
                  <c ca="left">
                     <p>Starts with Official. If not found, proceeds through all other sources.</p>
                  </c>
                  <c ca="left">
                     <p>Returns the HUGO name. If no name is flagged as HUGO, returns all aliases.</p>
                  </c>
               </r>
               <r>
                  <c>
                     <p/>
                  </c>
                  <c ca="left">
                     <p>ALL (HUGO and Alias)</p>
                  </c>
                  <c ca="left">
                     <p>ALL (HUGO and Alias)</p>
                  </c>
               </r>
               <r>
                  <c>
                     <p/>
                  </c>
                  <c ca="left">
                     <p>Searches all data sources for all matches to the symbol and flags those that are HUGO.</p>
                  </c>
                  <c ca="left">
                     <p>Returns all gene symbols and flags the HUGO symbol.</p>
                  </c>
               </r>
               <r>
                  <c>
                     <p/>
                  </c>
                  <c ca="left">
                     <p>Official</p>
                  </c>
                  <c ca="left">
                     <p>Official</p>
                  </c>
               </r>
               <r>
                  <c>
                     <p/>
                  </c>
                  <c ca="left">
                     <p>Searches all data sources for match as the HUGO name.</p>
                  </c>
                  <c ca="left">
                     <p>Returns the HUGO name. If not found, nothing is returned.</p>
                  </c>
               </r>
               <r>
                  <c>
                     <p/>
                  </c>
                  <c ca="left">
                     <p>Long</p>
                  </c>
                  <c ca="left">
                     <p>Long</p>
                  </c>
               </r>
               <r>
                  <c>
                     <p/>
                  </c>
                  <c ca="left">
                     <p>Searches all data sources for descriptive name.</p>
                  </c>
                  <c ca="left">
                     <p>Returns all descriptive names.</p>
                  </c>
               </r>
               <r>
                  <c ca="left">
                     <p>GenBank accession number</p>
                  </c>
                  <c ca="left">
                     <p>ALL</p>
                  </c>
                  <c ca="left">
                     <p>ALL</p>
                  </c>
               </r>
               <r>
                  <c>
                     <p/>
                  </c>
                  <c ca="left">
                     <p>Searches all data sources starting with UCSC known genes, then LocusLink, UniGene and UCSC ESTs until match found.</p>
                  </c>
                  <c ca="left">
                     <p>Returns accession number from UCSC known genes. If not found, proceeds through UniGene then UCSC EST.</p>
                  </c>
               </r>
               <r>
                  <c>
                     <p/>
                  </c>
                  <c ca="left">
                     <p>Data-source specific</p>
                  </c>
                  <c ca="left">
                     <p>Data-source specific</p>
                  </c>
               </r>
               <r>
                  <c>
                     <p/>
                  </c>
                  <c ca="left">
                     <p>Look up input in a specific data source.</p>
                  </c>
                  <c ca="left">
                     <p>Returns accession numbers found in a particular data source.</p>
                  </c>
               </r>
               <r>
                  <c ca="left">
                     <p>IMAGE clone</p>
                  </c>
                  <c ca="left">
                     <p>UniGene</p>
                  </c>
                  <c ca="left">
                     <p>UniGene</p>
                  </c>
               </r>
               <r>
                  <c>
                     <p/>
                  </c>
                  <c ca="left">
                     <p>Only data source with IMAGE clone ids.</p>
                  </c>
                  <c ca="left">
                     <p>Returns all IMAGE clone IDs associated with the UniGene</p>
                  </c>
               </r>
               <r>
                  <c ca="left">
                     <p>Cytogenetic location</p>
                  </c>
                  <c ca="left">
                     <p>ALL</p>
                  </c>
                  <c ca="left">
                     <p>ALL</p>
                  </c>
               </r>
               <r>
                  <c>
                     <p/>
                  </c>
                  <c ca="left">
                     <p>Searches all gene indexes for matching chromosome band.</p>
                  </c>
                  <c ca="left">
                     <p>Returns chromosome band from UCSC sequence to band translation. If not found, proceeds through all other sources with multiple bands listed separately.</p>
                  </c>
               </r>
               <r>
                  <c>
                     <p/>
                  </c>
                  <c>
                     <p/>
                  </c>
                  <c ca="left">
                     <p>UCSC</p>
                  </c>
               </r>
               <r>
                  <c>
                     <p/>
                  </c>
                  <c>
                     <p/>
                  </c>
                  <c ca="left">
                     <p>Returns chromosome band from UCSC sequence to band translation.</p>
                  </c>
               </r>
               <r>
                  <c ca="left">
                     <p>Database id</p>
                  </c>
                  <c ca="left">
                     <p>UniGene</p>
                  </c>
                  <c ca="left">
                     <p>UniGene</p>
                  </c>
               </r>
               <r>
                  <c>
                     <p/>
                  </c>
                  <c ca="left">
                     <p>Searches gene index for matching UniGene id.</p>
                  </c>
                  <c ca="left">
                     <p>Returns UniGene id.</p>
                  </c>
               </r>
               <r>
                  <c>
                     <p/>
                  </c>
                  <c ca="left">
                     <p>Affymetrix</p>
                  </c>
                  <c ca="left">
                     <p>Affymetrix</p>
                  </c>
               </r>
               <r>
                  <c>
                     <p/>
                  </c>
                  <c ca="left">
                     <p>Searches gene index for matching Affymetrix probe set identifier.</p>
                  </c>
                  <c ca="left">
                     <p>Returns Affymetrix probe set identifier.</p>
                  </c>
               </r>
               <r>
                  <c ca="left">
                     <p>Sequence location number (bp)</p>
                  </c>
                  <c ca="left">
                     <p>Not implemented</p>
                  </c>
                  <c ca="left">
                     <p>Transcription Start</p>
                  </c>
               </r>
               <r>
                  <c>
                     <p/>
                  </c>
                  <c>
                     <p/>
                  </c>
                  <c ca="left">
                     <p>Returns transcription start from UCSC Known Genes. If not found, proceeds to UCSC EST.</p>
                  </c>
               </r>
               <r>
                  <c ca="left">
                     <p>FISH clone</p>
                  </c>
                  <c ca="left">
                     <p>UCSC</p>
                  </c>
                  <c ca="left">
                     <p>UCSC</p>
                  </c>
               </r>
               <r>
                  <c>
                     <p/>
                  </c>
                  <c ca="left">
                     <p>Searches UCSC FISH clones for match to gene index based on sequence position overlap with UCSC known genes.</p>
                  </c>
                  <c ca="left">
                     <p>Returns FISH clone id.</p>
                  </c>
               </r>
            </tblbdy>
         </tbl>
         <p>The Merge function, the most powerful function of MatchMiner, identifies which genes are common to two input lists of identifiers and gives detailed output of the one-to-one, one-to-many, or many-to-many relationships between corresponding identifiers in the two lists. This function is used, for example, to compare datasets of different experiment types (for example, transcript expression, protein expression, array-based comparative genomic hybridization (CGH)) by identifying the genes in common between them. The output includes summary tallies as well as a gene-by-gene listing of items matched, unmatched and not found. As with the LookUp function, diagnostic resource information is provided. Any identifier with an ambiguous gene assignment (for example, an IMAGE clone id that belongs to two different UniGene clusters) is flagged for user intervention, with all possible assignments returned.</p>
      </sec>
      <sec>
         <st>
            <p>Performance</p>
         </st>
         <p>In one illustrative case that motivated development of MatchMiner, we (X. Lee, K.J.B., F.G. Gwadry, W.C.R., G. Riddick, S. L. Pelletier, S.N., and J. N.W., unpublished data) had to match up as many as possible of 9,706 cDNA microarray clones <abbrgrp><abbr bid="B9">9</abbr><abbr bid="B10">10</abbr></abbrgrp> with HU6,800 Affymetrix chip oligonucleotide sets <abbrgrp><abbr bid="B11">11</abbr></abbrgrp>, having run both platforms on the same 60 human cancer cell samples (the NCI-60). To do so, we developed an early form of MatchMiner. The particular task was to identify all relationships between the 9,706 IMAGE clone ids and 7,129 GenBank accession numbers based on UniGene cluster membership. To complete the task manually, one gene at a time at maximum speed (about 30 seconds per gene) would take over 140 hours - even if one could keep accurate track of the results. In contrast, the current version of MatchMiner took 10 minutes on a 750 MHz Pentium III PC with 320 MB RAM to generate the merged list, specifying all possible matches between IMAGE clone ids and GenBank accession numbers. When we compared MatchMiner Merge results with those obtained using the LookUp function for a random sample of the genes, there were no discrepancies. The same task with Source required translating both lists into UniGene clusters and then further processing the data. After identification and reformatting of entries with multiple UniGene cluster associations, the resulting lists were imported into Microsoft Access and queried to create the appropriate matches. The entire procedure gave results similar to those of MatchMiner but took approximately one hour, most of that user time.</p>
         <p>With the exception of MatchMiner, tools that can do some kind of translation are geared toward research dealing with expressed sequence, either at the RNA or protein level. However, many interesting questions can be asked from the perspective of genomic sequence. One example relates to the identification of genes represented in an array CGH experiment in which the targets on the chip are fluorescent <it>in situ </it>hybridization (FISH)- and site-tagged sequence (STS)-mapped bacterial artificial chromosome (BAC) clones. The challenge is to begin to interpret array CGH results in the context of the biological literature and of other classes of data. BAC clones are not generally annotated by the genes they span, but rather by their position in the cytogenetic and sequence-based maps. Therefore, an association between the BAC clones and genes must be made. MatchMiner provides this function with the ability to search on the FISH clone ids. Mapping of the FISH clones to genes is done by sequence alignment of the BAC ends during off-line construction of the overall MatchMiner database (Figure <figr fid="F3">3</figr>). MatchMiner takes 5 minutes to return the gene symbol for a list of 100 FISH-mapped BACs <abbrgrp><abbr bid="B12">12</abbr></abbrgrp>. Such a search is not possible using other tools. A summary of commonly used analogous tools and their capabilities can be found in Table <tblr tid="T2">2</tblr>.</p>
         <fig id="F3">
            <title>
               <p>Figure 3</p>
            </title>
            <caption>
               <p>Associating FISH-mapped BACs with genes</p>
            </caption>
            <text>
               <p>Associating FISH-mapped BACs with genes. Schematic view of FISH-mapped BACs from 1p36.33 near the PITSLRE kinase genes (UCSC Genome Browser, June 2002 freeze). Note that a single BAC can encompass one or more genes. In MatchMiner, the FISH-mapped BAC table from UCSC is imported, and chromosomal positions are read from the table for comparison with the transcriptional start positions of UCSC 'Known Genes'. If a transcriptional start is contained within the bounds of a BAC, that BAC is associated with the corresponding gene index. Thus, a BAC containing several genes will be associated with each of those genes.</p>
            </text>
            <graphic file="gb-2003-4-4-r27-3"/>
         </fig>
         <tbl id="T2">
            <title>
               <p>Table 2</p>
            </title>
            <caption>
               <p>Comparison of the capabilities of gene identifier translation tools</p>
            </caption>
            <tblbdy cols="8">
               <r>
                  <c ca="left">
                     <p>Program</p>
                  </c>
                  <c ca="left">
                     <p>Implementation</p>
                  </c>
                  <c ca="left">
                     <p>Search Types</p>
                  </c>
                  <c ca="left">
                     <p>Batch</p>
                  </c>
                  <c ca="left">
                     <p>Translation path traceable in interactive (single-gene) mode?</p>
                  </c>
                  <c ca="left">
                     <p>Translation path traceable in batch(gene-list) mode?</p>
                  </c>
                  <c ca="left">
                     <p>Multiple input associations flagged?</p>
                  </c>
                  <c ca="left">
                     <p>Output in form suitable for automated processing?</p>
                  </c>
               </r>
               <r>
                  <c cspan="8">
                     <hr/>
                  </c>
               </r>
               <r>
                  <c ca="left">
                     <p>MatchMiner</p>
                  </c>
                  <c ca="left">
                     <p>Command line, Web application</p>
                  </c>
                  <c ca="left">
                     <p>LookUp, Merge</p>
                  </c>
                  <c ca="left">
                     <p>Yes</p>
                  </c>
                  <c ca="left">
                     <p>Yes</p>
                  </c>
                  <c ca="left">
                     <p>Yes</p>
                  </c>
                  <c ca="left">
                     <p>Yes</p>
                  </c>
                  <c ca="left">
                     <p>Yes</p>
                  </c>
               </r>
               <r>
                  <c ca="left">
                     <p>Source</p>
                  </c>
                  <c ca="left">
                     <p>Web application</p>
                  </c>
                  <c ca="left">
                     <p>LookUp</p>
                  </c>
                  <c ca="left">
                     <p>Yes</p>
                  </c>
                  <c ca="left">
                     <p>Yes</p>
                  </c>
                  <c ca="left">
                     <p>No</p>
                  </c>
                  <c ca="left">
                     <p>Yes, if "Show all Cluster Ids if Multiple Clusters" option selected</p>
                  </c>
                  <c ca="left">
                     <p>Yes</p>
                  </c>
               </r>
               <r>
                  <c ca="left">
                     <p>Genelynx</p>
                  </c>
                  <c ca="left">
                     <p>Web application</p>
                  </c>
                  <c ca="left">
                     <p>LookUp</p>
                  </c>
                  <c ca="left">
                     <p>Yes</p>
                  </c>
                  <c ca="left">
                     <p>Yes</p>
                  </c>
                  <c ca="left">
                     <p>No</p>
                  </c>
                  <c ca="left">
                     <p>Yes</p>
                  </c>
                  <c ca="left">
                     <p>No</p>
                  </c>
               </r>
            </tblbdy>
         </tbl>
         <p>As noted previously, identifiers are not always unique or uniquely assigned. For example, GenBank accession numbers are specific to a sequence, but the assignment of that sequence to a gene may change over time. Even more disconcerting, common gene names or aliases are often used by different investigators for different genes. Therefore, it is important to look in detail at the results of searches to check for correspondences other than one-to-one and to examine the data source tags to get a sense of the strength of the association between identifiers.</p>
         <p>One non-obvious advantage of MatchMiner is that it can combine information from more than one of the data sources to show matches that could not be made on the basis of any single source. The gene <it>ACVR2B</it>, which has aliases <it>ACTR-IIB </it>and <it>ACTRIIB</it>, provides an example. LocusLink and OMIM both reference the HUGO symbol <it>ACVR2B</it>, but LocusLink does not reference <it>ACTRIIB</it>, and OMIM does not reference <it>ACTR-IIB</it>. Therefore, if one of the aliases were used as input, the success of any search outside of MatchMiner would be data-source dependent.</p>
      </sec>
      <sec>
         <st>
            <p>Algorithm and software development</p>
         </st>
         <p>MatchMiner was written in Java and can be deployed as either a web or command-line application, the latter suitable for high-throughput pipeline purposes. In its design and implementation, we leveraged a variety of open-source tools and libraries, including jUnit (unit testing framework), CVS (configuration management), Xerces (XML parser) and Ant (build tool). Before run-time, data from UniGene, LocusLink, OMIM, UCSC and Affymetrix are downloaded and parsed to generate an integrated database implemented under MySQL. If an entry from the imported data matches a candidate gene that was already identified, it is assigned the same gene index. If an entry does not match any of the candidate genes, then a new gene index is generated. Import begins with data from the UCSC's 'Known Gene' table, followed by UCSC's EST (expressed sequence tag) table, LocusLink, UniGene, OMIM and Affymetrix. Different identifiers are stored in different tables and several tables are required to resolve the many-to-many relationships between identifiers (Figure <figr fid="F2">2a,b</figr>). The central algorithm for resolving identifiers uses an instantiation of the ChainOfResponsibility pattern <abbrgrp><abbr bid="B13">13</abbr></abbrgrp>, which combines different searches sequentially in a logical manner. In MatchMiner, it maximizes the likelihood of correctly translating back and forth from identifiers to gene indices using the different databases. The algorithm is non-trivial. For each identifier type, we establish a ChainOfResponsibility hierarchy of the data sources based on their respective abilities to match the user input (Table <tblr tid="T3">3</tblr>). The search algorithms then use this ranking. For example, when an input list of gene names is processed using the 'ALL (HUGO, Alias)' search algorithm, the list is scanned for HUGO names, and each one found is associated with the corresponding unique internal gene index. The remaining unmatched gene names are then scanned again, this time matching aliases (Table <tblr tid="T1">1</tblr>). The rationale is that an official HUGO name is more likely to be the desired match, but any match is better than none. A similar approach is used when going from the unique index to an output list. For instance, if the desired output is cytogenetic location, MatchMiner first scans the UCSC build of the human genome. If the location is not found there, LocusLink and Unigene are searched (Table <tblr tid="T1">1</tblr>). The ChainOfResponsibility approach enables us to combine the precision of highly focused algorithms with the greater coverage of more broadly based ones.</p>
         <tbl id="T3">
            <title>
               <p>Table 3</p>
            </title>
            <caption>
               <p>ChainOfResponsibility hierarchies for data sources in MatchMiner</p>
            </caption>
            <tblbdy cols="2">
               <r>
                  <c ca="left">
                     <p>Identifier type</p>
                  </c>
                  <c ca="left">
                     <p>Hierarchy of source reliability</p>
                  </c>
               </r>
               <r>
                  <c cspan="2">
                     <hr/>
                  </c>
               </r>
               <r>
                  <c ca="left">
                     <p>Cytogenetic location</p>
                  </c>
                  <c ca="left">
                     <p>UCSC Known Genes, LocusLink, UniGene, UCSC EST, OMIM</p>
                  </c>
               </r>
               <r>
                  <c ca="left">
                     <p>GenBank accession number</p>
                  </c>
                  <c ca="left">
                     <p>UCSC Known Genes, LocusLink, UniGene, UCSC EST, OMIM</p>
                  </c>
               </r>
               <r>
                  <c ca="left">
                     <p>HUGO gene symbol</p>
                  </c>
                  <c ca="left">
                     <p>LocusLink, OMIM</p>
                  </c>
               </r>
               <r>
                  <c ca="left">
                     <p>IMAGE clone id</p>
                  </c>
                  <c ca="left">
                     <p>UniGene</p>
                  </c>
               </r>
               <r>
                  <c ca="left">
                     <p>Long gene name</p>
                  </c>
                  <c ca="left">
                     <p>UCSC Known Genes, LocusLink, UniGene, UCSC EST, OMIM</p>
                  </c>
               </r>
               <r>
                  <c ca="left">
                     <p>Affymetrix probe id</p>
                  </c>
                  <c ca="left">
                     <p>Affymetrix</p>
                  </c>
               </r>
               <r>
                  <c ca="left">
                     <p>UniGene cluster id</p>
                  </c>
                  <c ca="left">
                     <p>UniGene</p>
                  </c>
               </r>
            </tblbdy>
         </tbl>
         <p>Although currently human-specific, MatchMiner will be expanded in the near future to incorporate data from other species, with emphasis on mouse. Additional features to be implemented include the ability to handle lists of mixed types of identifiers, the ability to request multiple types of identifiers within a single search, and the incorporation of additional public sources for use in making translations. We will continue to enhance and develop MatchMiner under a contract funded by the Center for Cancer Research of the US National Cancer Institute.</p>
      </sec>
      <sec>
         <st>
            <p>Download</p>
         </st>
         <p>MatchMiner is available as a web-application or as a command line jar file at <abbrgrp><abbr bid="B8">8</abbr></abbrgrp>. The MatchMiner database is maintained on our server and updated at approximately 6-month intervals. Detailed documentation for both implementations is available at the site.</p>
         <p>In summary, MatchMiner is an efficient application for navigating the complex world of gene and gene product identifiers. It can batch search publicly available databases to convert between identifier types and can determine the intersection of two gene lists with different identifiers. MatchMiner will greatly enhance the ability of the research community to annotate and compare 'omic' datasets.</p>
      </sec>
   </bdy>
   <bm>
      <refgrp>
         <bibl id="B1">
            <title>
               <p>Fishing expeditions.</p>
            </title>
            <aug>
               <au>
                  <snm>Weinstein</snm>
                  <fnm>JN</fnm>
               </au>
            </aug>
            <source>Science</source>
            <pubdate>1998</pubdate>
            <volume>282</volume>
            <fpage>628</fpage>
            <lpage>629</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1126/science.282.5389.627g</pubid>
                  <pubid idtype="pmpid" link="fulltext">9841413</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B2">
            <title>
               <p>'Omic' and hypothesis-driven research in the molecular pharmacology of cancer.</p>
            </title>
            <aug>
               <au>
                  <snm>Weinstein</snm>
                  <fnm>JN</fnm>
               </au>
            </aug>
            <source>Curr Opin Pharmacol</source>
            <pubdate>2002</pubdate>
            <volume>2</volume>
            <fpage>361</fpage>
            <lpage>365</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1016/S1471-4892(02)00185-6</pubid>
                  <pubid idtype="pmpid" link="fulltext">12127867</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B3">
            <title>
               <p>GenBank.</p>
            </title>
            <aug>
               <au>
                  <snm>Benson</snm>
                  <fnm>DA</fnm>
               </au>
               <au>
                  <snm>Karsch-Mizrachi</snm>
                  <fnm>I</fnm>
               </au>
               <au>
                  <snm>Lipman</snm>
                  <fnm>DJ</fnm>
               </au>
               <au>
                  <snm>Ostell</snm>
                  <fnm>J</fnm>
               </au>
               <au>
                  <snm>Rapp</snm>
                  <fnm>BA</fnm>
               </au>
               <au>
                  <snm>Wheeler</snm>
                  <fnm>DL</fnm>
               </au>
            </aug>
            <source>Nucleic Acids Res</source>
            <pubdate>2002</pubdate>
            <volume>30</volume>
            <fpage>17</fpage>
            <lpage>20</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmcid">99127</pubid>
                  <pubid idtype="pmpid" link="fulltext">11752243</pubid>
                  <pubid idtype="doi">10.1093/nar/30.1.17</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B4">
            <title>
               <p>Database resources of the National Center for Biotechnology Information: 2002 update.</p>
            </title>
            <aug>
               <au>
                  <snm>Wheeler</snm>
                  <fnm>DL</fnm>
               </au>
               <au>
                  <snm>Church</snm>
                  <fnm>DM</fnm>
               </au>
               <au>
                  <snm>Lash</snm>
                  <fnm>AE</fnm>
               </au>
               <au>
                  <snm>Leipe</snm>
                  <fnm>DD</fnm>
               </au>
               <au>
                  <snm>Madden</snm>
                  <fnm>TL</fnm>
               </au>
               <au>
                  <snm>Pontius</snm>
                  <fnm>JU</fnm>
               </au>
               <au>
                  <snm>Schuler</snm>
                  <fnm>GD</fnm>
               </au>
               <au>
                  <snm>Schriml</snm>
                  <fnm>LM</fnm>
               </au>
               <au>
                  <snm>Tatusova</snm>
                  <fnm>TA</fnm>
               </au>
               <au>
                  <snm>Wagner</snm>
                  <fnm>L</fnm>
               </au>
               <etal/>
            </aug>
            <source>Nucleic Acids Res</source>
            <pubdate>2002</pubdate>
            <volume>30</volume>
            <fpage>13</fpage>
            <lpage>16</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmcid">99094</pubid>
                  <pubid idtype="pmpid" link="fulltext">11752242</pubid>
                  <pubid idtype="doi">10.1093/nar/30.1.13</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B5">
            <title>
               <p>Initial sequencing and analysis of the human genome.</p>
            </title>
            <aug>
               <au>
                  <snm>Lander</snm>
                  <fnm>ES</fnm>
               </au>
               <au>
                  <snm>Linton</snm>
                  <fnm>LM</fnm>
               </au>
               <au>
                  <snm>Birren</snm>
                  <fnm>B</fnm>
               </au>
               <au>
                  <snm>Nusbaum</snm>
                  <fnm>C</fnm>
               </au>
               <au>
                  <snm>Zody</snm>
                  <fnm>MC</fnm>
               </au>
               <au>
                  <snm>Baldwin</snm>
                  <fnm>J</fnm>
               </au>
               <au>
                  <snm>Devon</snm>
                  <fnm>K</fnm>
               </au>
               <au>
                  <snm>Dewar</snm>
                  <fnm>K</fnm>
               </au>
               <au>
                  <snm>Doyle</snm>
                  <fnm>M</fnm>
               </au>
               <au>
                  <snm>FitzHugh</snm>
                  <fnm>W</fnm>
               </au>
               <etal/>
            </aug>
            <source>Nature</source>
            <pubdate>2001</pubdate>
            <volume>409</volume>
            <fpage>860</fpage>
            <lpage>921</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1038/35057062</pubid>
                  <pubid idtype="pmpid" link="fulltext">11237011</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B6">
            <title>
               <p>Source</p>
            </title>
            <url>http://source.stanford.edu</url>
         </bibl>
         <bibl id="B7">
            <title>
               <p>GeneLynx: a portal to the human genome</p>
            </title>
            <url>http://www.genelynx.org</url>
         </bibl>
         <bibl id="B8">
            <title>
               <p>MatchMiner</p>
            </title>
            <url>http://discover.nci.nih.gov/matchminer</url>
         </bibl>
         <bibl id="B9">
            <title>
               <p>A gene expression database for the molecular pharmacology of cancer.</p>
            </title>
            <aug>
               <au>
                  <snm>Scherf</snm>
                  <fnm>U</fnm>
               </au>
               <au>
                  <snm>Ross</snm>
                  <fnm>DT</fnm>
               </au>
               <au>
                  <snm>Waltham</snm>
                  <fnm>M</fnm>
               </au>
               <au>
                  <snm>Smith</snm>
                  <fnm>LH</fnm>
               </au>
               <au>
                  <snm>Lee</snm>
                  <fnm>JK</fnm>
               </au>
               <au>
                  <snm>Tanabe</snm>
                  <fnm>L</fnm>
               </au>
               <au>
                  <snm>Kohn</snm>
                  <fnm>KW</fnm>
               </au>
               <au>
                  <snm>Reinhold</snm>
                  <fnm>WC</fnm>
               </au>
               <au>
                  <snm>Myers</snm>
                  <fnm>TG</fnm>
               </au>
               <au>
                  <snm>Andrews</snm>
                  <fnm>DT</fnm>
               </au>
               <etal/>
            </aug>
            <source>Nat Genet</source>
            <pubdate>2000</pubdate>
            <volume>24</volume>
            <fpage>236</fpage>
            <lpage>244</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1038/73439</pubid>
                  <pubid idtype="pmpid" link="fulltext">10700175</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B10">
            <title>
               <p>Systematic variation in gene expression patterns in human cancer cell lines.</p>
            </title>
            <aug>
               <au>
                  <snm>Ross</snm>
                  <fnm>DT</fnm>
               </au>
               <au>
                  <snm>Scherf</snm>
                  <fnm>U</fnm>
               </au>
               <au>
                  <snm>Eisen</snm>
                  <fnm>MB</fnm>
               </au>
               <au>
                  <snm>Perou</snm>
                  <fnm>CM</fnm>
               </au>
               <au>
                  <snm>Rees</snm>
                  <fnm>C</fnm>
               </au>
               <au>
                  <snm>Spellman</snm>
                  <fnm>P</fnm>
               </au>
               <au>
                  <snm>Iyer</snm>
                  <fnm>V</fnm>
               </au>
               <au>
                  <snm>Jeffrey</snm>
                  <fnm>SS</fnm>
               </au>
               <au>
                  <snm>Van de</snm>
                  <fnm>RM</fnm>
               </au>
               <au>
                  <snm>Waltham</snm>
                  <fnm>M</fnm>
               </au>
               <etal/>
            </aug>
            <source>Nat Genet</source>
            <pubdate>2000</pubdate>
            <volume>24</volume>
            <fpage>227</fpage>
            <lpage>235</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1038/73432</pubid>
                  <pubid idtype="pmpid" link="fulltext">10700174</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B11">
            <title>
               <p>Chemosensitivity prediction by transcriptional profiling.</p>
            </title>
            <aug>
               <au>
                  <snm>Staunton</snm>
                  <fnm>JE</fnm>
               </au>
               <au>
                  <snm>Slonim</snm>
                  <fnm>DK</fnm>
               </au>
               <au>
                  <snm>Coller</snm>
                  <fnm>HA</fnm>
               </au>
               <au>
                  <snm>Tamayo</snm>
                  <fnm>P</fnm>
               </au>
               <au>
                  <snm>Angelo</snm>
                  <fnm>MJ</fnm>
               </au>
               <au>
                  <snm>Park</snm>
                  <fnm>J</fnm>
               </au>
               <au>
                  <snm>Scherf</snm>
                  <fnm>U</fnm>
               </au>
               <au>
                  <snm>Lee</snm>
                  <fnm>JK</fnm>
               </au>
               <au>
                  <snm>Reinhold</snm>
                  <fnm>WO</fnm>
               </au>
               <au>
                  <snm>Weinstein</snm>
                  <fnm>JN</fnm>
               </au>
               <etal/>
            </aug>
            <source>Proc Natl Acad Sci USA</source>
            <pubdate>2001</pubdate>
            <volume>98</volume>
            <fpage>10787</fpage>
            <lpage>10792</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmcid">58553</pubid>
                  <pubid idtype="pmpid" link="fulltext">11553813</pubid>
                  <pubid idtype="doi">10.1073/pnas.191368598</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B12">
            <title>
               <p>Integration of cytogenetic landmarks into the draft sequence of the human genome.</p>
            </title>
            <aug>
               <au>
                  <snm>Cheung</snm>
                  <fnm>VG</fnm>
               </au>
               <au>
                  <snm>Nowak</snm>
                  <fnm>N</fnm>
               </au>
               <au>
                  <snm>Jang</snm>
                  <fnm>W</fnm>
               </au>
               <au>
                  <snm>Kirsch</snm>
                  <fnm>IR</fnm>
               </au>
               <au>
                  <snm>Zhao</snm>
                  <fnm>S</fnm>
               </au>
               <au>
                  <snm>Chen</snm>
                  <fnm>XN</fnm>
               </au>
               <au>
                  <snm>Furey</snm>
                  <fnm>TS</fnm>
               </au>
               <au>
                  <snm>Kim</snm>
                  <fnm>UJ</fnm>
               </au>
               <au>
                  <snm>Kuo</snm>
                  <fnm>WL</fnm>
               </au>
               <au>
                  <snm>Olivier</snm>
                  <fnm>M</fnm>
               </au>
               <etal/>
            </aug>
            <source>Nature</source>
            <pubdate>2001</pubdate>
            <volume>409</volume>
            <fpage>953</fpage>
            <lpage>958</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1038/35057192</pubid>
                  <pubid idtype="pmpid" link="fulltext">11237021</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B13">
            <aug>
               <au>
                  <snm>Gamma</snm>
                  <fnm>E</fnm>
               </au>
               <au>
                  <snm>Helm</snm>
                  <fnm>R</fnm>
               </au>
               <au>
                  <snm>Johnson</snm>
                  <fnm>R</fnm>
               </au>
               <au>
                  <snm>Vlissides</snm>
                  <fnm>J</fnm>
               </au>
            </aug>
            <source>Design Patterns</source>
            <publisher>Boston: Addison-Wesley</publisher>
            <pubdate>1995</pubdate>
         </bibl>
      </refgrp>
   </bm>
</art>
