<?xml version='1.0'?>
<!DOCTYPE art SYSTEM 'http://www.biomedcentral.com/xml/article.dtd'>
<art>
<ui>gb-2012-13-7-r61</ui>
<ji>1465-6906</ji>
<fm>
<dochead>Method</dochead>
<bibl>
<title><p>Bis-SNP: Combined DNA methylation and SNP calling for Bisulfite-seq data</p></title>
<aug>
<au id="A1"><snm>Liu</snm><fnm>Yaping</fnm><insr iid="I1"/><insr iid="I2"/><email>yapingli@usc.edu</email></au>
<au id="A2"><snm>Siegmund</snm><mi>D</mi><fnm>Kimberly</fnm><insr iid="I3"/><email>kims@usc.edu</email></au>
<au id="A3"><snm>Laird</snm><mi>W</mi><fnm>Peter</fnm><insr iid="I1"/><email>plaird@usc.edu</email></au>
<au id="A4" ca="yes"><snm>Berman</snm><mi>P</mi><fnm>Benjamin</fnm><insr iid="I1"/><insr iid="I3"/><email>bberman@usc.edu</email></au>
</aug>
<insg><ins id="I1"><p>USC Epigenome Center, University of Southern California, 1450 Biggy Street, Los Angeles, CA 90089, USA</p></ins><ins id="I2"><p>Genetics, Molecular and Cellular Biology Program, University of Southern California, 1975 Zonal Avenue KAM-B16, Los Angeles, CA 90089, USA</p></ins><ins id="I3"><p>Department of Preventive Medicine, Keck School of Medicine, University of Southern California, 1441 Eastlake Avenue, Los Angeles, CA 90089, USA</p></ins>
</insg>
<source>Genome Biology</source>
<issn>1465-6906</issn>
<pubdate>2012</pubdate>
<volume>13</volume>
<issue>7</issue>
<fpage>R61</fpage><url>http://genomebiology.com/2012/13/7/R61</url>
<xrefbib><pubidlist><pubid idtype="doi">10.1186/gb-2012-13-7-r61</pubid><pubid idtype="pmpid">22784381</pubid></pubidlist></xrefbib>
</bibl>
<history><rec><date><day>21</day><month>5</month><year>2012</year></date></rec><revrec><date><day>3</day><month>7</month><year>2012</year></date></revrec><acc><date><day>4</day><month>7</month><year>2012</year></date></acc><pub><date><day>11</day><month>7</month><year>2012</year></date></pub></history>
<cpyrt><year>2012</year><collab>Liu et al.; licensee BioMed Central Ltd.</collab><note>This is an open access article distributed under the terms of the Creative Commons Attribution License (<url>http://creativecommons.org/licenses/by/2.0</url>), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.</note></cpyrt>
<abs>
<sec><st><p>Abstract</p></st>
<p>Bisulfite treatment of DNA followed by high-throughput sequencing (Bisulfite-seq) is an important method for studying DNA methylation and epigenetic gene regulation, yet current software tools do not adequately address single nucleotide polymorphisms (SNPs). Identifying SNPs is important for accurate quantification of methylation levels and for identification of allele-specific epigenetic events such as imprinting. We have developed a model-based bisulfite SNP caller, Bis-SNP, that results in substantially better SNP calls than existing methods, thereby improving methylation estimates. At an average 30&#215; genomic coverage, Bis-SNP correctly identified 96% of SNPs using the default high-stringency settings. The open-source package is available at <url>http://epigenome.usc.edu/publicationdata/bissnp2011</url>.</p>
</sec>
</abs>
</fm>
<bdy>
<sec><st><p>Background</p></st>
<p>Cytosine methylation of DNA plays an important role in mammalian gene regulation, chromatin structure and imprinting during normal development and the development of pathological conditions such as cancer. With the dramatic increase in throughput made possible by next-generation DNA sequencing technologies, sodium bisulfite conversion followed by massively parallel sequencing (Bisulfite-seq) has become an increasingly popular method for investigating epigenetic profiles in the human genome (reviewed in <abbrgrp><abbr bid="B1">1</abbr></abbrgrp>). Several sequencing strategies have been applied that vary in terms of cost and the regions of the genome covered. Reduced Representation Bisulfite-Seq (RRBS <abbrgrp><abbr bid="B2">2</abbr></abbrgrp>) uses restriction fragment size selection to select a portion of the genome enriched for CpG Islands and gene regulatory sequences. Bisulfite Padlock Probes (BSPP <abbrgrp><abbr bid="B3">3</abbr></abbrgrp>) or solution-based hybridization capture (Agilent, Inc., Santa Clara, CA, USA) can be designed for customizable selection of hundreds of thousands of regions throughout the genome. Whole-Genome Bisulfite-Seq (WGBS <abbrgrp><abbr bid="B4">4</abbr></abbrgrp>) is the most comprehensive technique, covering more than 90% of cytosines in the human genome. Bisulfite-seq is well-suited to the investigation of epigenetic changes from clinical tissue samples <abbrgrp><abbr bid="B5">5</abbr><abbr bid="B6">6</abbr></abbrgrp>, and can be applied to very small quantities of DNA <abbrgrp><abbr bid="B7">7</abbr></abbrgrp> including formalin-fixed samples <abbrgrp><abbr bid="B8">8</abbr></abbrgrp>. WGBS and RRBS data have been used to profile a number of cell lines and human tissues by large sequencing consortia including the ENCODE project <abbrgrp><abbr bid="B9">9</abbr></abbrgrp>, the NIH Epigenomics Roadmap, and The Cancer Genome Atlas (TCGA), and these datasets are publicly available for download.</p>
<p>Bisulfite treatment of DNA converts unmethylated cytosines to uracils, which are replaced by thymines during amplification. This dramatic change to sequence composition necessitates specialized software for almost all sequence analysis tasks. Typically, the first step in processing high-throughput sequencing data is to map and align each read to the correct location in the reference genome (genome mapping), and a number of powerful tools have been developed to map bisulfite-converted reads (reviewed in <abbrgrp><abbr bid="B10">10</abbr></abbrgrp>). The next step is to identify differences between the reference genome and the sample genome, including single-nucleotide polymorphisms (SNPs) and insertion/deletion events (indels). The identification of SNPs has been an active area of research and a number of powerful statistical tools have been developed for SNP calling of non-bisulfite sequencing data <abbrgrp><abbr bid="B11">11</abbr><abbr bid="B12">12</abbr><abbr bid="B13">13</abbr></abbrgrp>. SNP calling of bisulfite sequencing data has significant complications. First, reads from the two genomic strands are not complementary, and this assumption of complementarity is made by all SNP calling algorithms. Second, true (evolutionary) C&gt;T SNPs in the sample cannot be distinguished from C&gt;T substitutions that are caused by bisulfite conversion, and can thus be misidentified as unmethylated Cs. Consequently, identification of such SNPs is important for accurate quantification of methylation levels, especially so given the fact that C&gt;T is the most common substitution in the human population (65% of all SNPs in dbSNP) and these usually occur in the CpG context <abbrgrp><abbr bid="B14">14</abbr></abbrgrp>.</p>
<p>Accurate SNP calling at the positions immediately surrounding a cytosine is equally important. Those nucleotides lying one or two positions 3' of the cytosine are particularly critical, as they are subject to the specificity of particular methyltransferases. These methyltransferase-specific context positions can be organism or cell type specific. In mammals, CpG dinucleotides are often highly methylated in most cell types, while CpA dinucleotides have much lower methylation levels and are cell type restricted <abbrgrp><abbr bid="B4">4</abbr><abbr bid="B15">15</abbr></abbrgrp>. In plants, by contrast, CHG trinucleotides are often methylated <abbrgrp><abbr bid="B16">16</abbr><abbr bid="B17">17</abbr></abbrgrp>. Other sequences within a slightly wider genomic neighborhood can also have strong <it>cis </it>effects on methylation, perhaps due to the presence of key regulatory motifs <abbrgrp><abbr bid="B18">18</abbr></abbrgrp>. Heterozygous SNPs in proximity to cytosines can be used to reveal widespread allele-specific methylation patterns <abbrgrp><abbr bid="B19">19</abbr></abbrgrp> and important regulatory changes such as loss of imprinting <abbrgrp><abbr bid="B20">20</abbr><abbr bid="B21">21</abbr><abbr bid="B22">22</abbr></abbrgrp>.</p>
<p>Despite the great interest in Bisulfite-seq and the availability of a number of tools for genomic mapping, no adequate software exists for SNP calling <abbrgrp><abbr bid="B10">10</abbr></abbrgrp>. In order to overcome the difficulty in identifying SNPs in bisulfite-treated sequences, some groups have relied on matched non-bisulfite sequencing data in the same sample <abbrgrp><abbr bid="B23">23</abbr><abbr bid="B24">24</abbr><abbr bid="B25">25</abbr></abbrgrp>. Others have used non-bisulfite SNP microarrays <abbrgrp><abbr bid="B26">26</abbr><abbr bid="B27">27</abbr></abbrgrp>, or used study designs relying on isogenic mouse strains with known parental genotypes <abbrgrp><abbr bid="B22">22</abbr><abbr bid="B24">24</abbr></abbrgrp>.</p>
<p>A key property of some bisulfite-related protocols is that G nucleotides on the strand opposing a C are not affected by conversion. This strand-specificity principle has been exploited in order to distinguish bisulfite conversion from C&gt;T SNPs <abbrgrp><abbr bid="B28">28</abbr></abbrgrp>. The Illumina-based protocol currently being used in most Bisulfite-seq studies has this important property, and thus it has been classified as a <it>directional </it>bisulfite-seq protocol <abbrgrp><abbr bid="B10">10</abbr></abbrgrp>. <it>Non-directional </it>protocols (those that also result in G&gt;A substitutions) have been used <abbrgrp><abbr bid="B17">17</abbr></abbrgrp>, but have not been widely adopted. Figure <figr fid="F1">1</figr> illustrates the directional protocol, where approximately half the reads at a given cytosine position (those mapping to the 'C-strand') can be used for methylation quantification but cannot distinguish C&gt;T SNPs. The other half (those mapping to the 'G-strand', boxed in Figure <figr fid="F1">1a</figr>) yield no methylation information but can be used to identify C&gt;T SNPs. When these C&gt;T SNPs are heterozygous, they can be used in the analysis of allele specific methylation (Additional File <supplr sid="S1">1</supplr>).</p>
<fig id="F1"><title><p>Figure 1</p></title><caption><p>Detecting single nucleotide polymorphisms from Bisulfite-seq data</p></caption><text>
   <p><b>Detecting single nucleotide polymorphisms from Bisulfite-seq data</b>. Hypothetical bisulfite-sequencing data is shown, with reference genome at top, genome of the individual sequenced (unobserved) in the middle, and bisulfite sequencing reads bottom. (<b>a</b>) shows three reference cytosine positions, with the first being a match to the reference genome and the second two being <it>homozygous </it>single nucleotide polymorphisms. The first case shows a true C:G genotype, and all reads on the same strand as the C (the 'C-strand') are read as T, indicating an unmethylated state (shown as blue). Because the Illumina Bisulfite-seq protocol is 'directional', reads on the opposite strand (the 'G-strand') are read as the true genotype, G ('genotype' reads on the G-strand are boxed in this figure). The second case illustrates a true C>T SNP, which can be distinguished by the A reads present on the G-strand. In this case, the reads on the C-strand are inferred to be from a true 'T' and should <it>not </it>be used for methylation calling (crossed out here). The third case shows a T>C SNP, which again can be identified based on G-strand reads. (<b>b</b>) A cytosine position with 50% unmethylated (T) and 50% methylated (C) reads can be associated with a heterozygous SNP on the same sequencing reads. In this case, the unmethylated reads are those on the 'A' allele chromosome (here shown as maternal) and the methylated reads are on the 'T' allele chromosome.</p>
</text><graphic file="gb-2012-13-7-r61-1"/></fig><suppl id="S1">
<title><p>Additional file 1</p></title>
<text><p><b>Detecting heterozygous C/T single nucleotide polymorphisms from Bisulfite-seq data</b>. Hypothetical bisulfite-seq data with all labels as in Figure <figr fid="F1">1</figr>. This illustrates detection of a C/T heterozygous position (left), and that the G-strand alleles can be used to associate methylation state of an adjacent cytosine on the opposite strand with two parental alleles.</p></text><file name="gb-2012-13-7-r61-S1.PDF">
   <p>Click here for file</p>
</file></suppl><p>The inherent directionality of Illumina Bisulfite-seq has thus far been used only in a limited and <it>ad hoc </it>way. The Salk Institute group filtered out cytosines which did not have one or more unconverted Cs on the C-strand, but this approach can result in lost information about completely unmethylated cytosines (which play a crucial role in gene regulation) <abbrgrp><abbr bid="B4">4</abbr><abbr bid="B29">29</abbr></abbrgrp>. Our own group filtered out reference Cs if opposing reads contained As, but the number of such A reads required was somewhat arbitrary <abbrgrp><abbr bid="B6">6</abbr></abbrgrp>. A third group removed all C/T reads on the C-strand, and called SNPs by requiring a minimum number of reads containing two different alleles <abbrgrp><abbr bid="B30">30</abbr></abbrgrp>. Importantly, none of these so-called 'k-allele' approaches took advantage of base calling quality scores, which have been shown to be extremely important for distinguishing true SNPs from sequencing errors <abbrgrp><abbr bid="B31">31</abbr></abbrgrp>. Others used various methods that did not attempt to identify C/T or other SNPs occurring at cytosines <abbrgrp><abbr bid="B3">3</abbr><abbr bid="B20">20</abbr><abbr bid="B21">21</abbr></abbrgrp>. Such methods may be useful for analyzing allele-specific patterns in a limited way, but do not address the need to improve methylation quantification by identifying SNPs.</p>
<p>Here, we describe a probabilistic SNP caller, <monospace>Bis-SNP</monospace>, that is based on methods that have proven successful in non-bisulfite SNP calling <abbrgrp><abbr bid="B12">12</abbr><abbr bid="B13">13</abbr></abbrgrp>. Bis-SNP uses Bayesian inference to evaluate a model of strand-specific base calls and base call quality scores, along with prior information on population SNP frequencies, experiment-specific bisulfite conversion efficiency, and site-specific DNA methylation estimates. It also takes advantage of base call quality score recalibration, an addition that has greatly improved SNP calling in the non-bisulfite context <abbrgrp><abbr bid="B12">12</abbr></abbrgrp>. Bis-SNP is open-source and based on the GATK framework <abbrgrp><abbr bid="B32">32</abbr></abbrgrp>, which takes advantage of the parallel Map-Reduce computation strategy and provide practical execution times. Bis-SNP accepts either single-end or paired-end mapped Bisulfite-seq data in the form of BAM files, and outputs SNP and methylation information using standard file formats. We show that Bis-SNP is a practical tool that can both (1) improve DNA methylation calling accuracy by detecting SNPs at cytosines and adjacent positions, and (2) identify heterozygous SNPs that can be used to investigate mono-allelic DNA methylation and polymorphisms in cis-regulatory sequences.</p>
</sec>
<sec><st><p>Results and discussion</p></st>
<sec><st><p>Bis-SNP workflow</p></st>
<p>The two primary steps in the Bis-SNP workflow are outlined in Figure <figr fid="F2">2a</figr> and include base quality re-calibration and local realignment followed by SNP calling. Bis-SNP accepts standard alignment files (<monospace>.bam</monospace> format), which can be generated by popular Bisulfite-seq mapping programs such as MAQ, Bismark, BSMAP, PASH, or Novoalign (reviewed in <abbrgrp><abbr bid="B10">10</abbr></abbrgrp>). This allows the user to decide which mapping criteria are most important for their specific application. This also makes Bis-SNP compatible with specialized mappers such as RRBSMAP <abbrgrp><abbr bid="B33">33</abbr></abbrgrp> and any other program that can output (<monospace>.bam</monospace>) files.</p>
<fig id="F2"><title><p>Figure 2</p></title><caption><p>Bis-SNP workflow</p></caption><text>
   <p><b>Bis-SNP workflow</b>. (<b>a</b>) Bis-SNP accepts <monospace>.bam</monospace> files, produced by a genome mapping tool (BSMAP, MAQ, Novoalign, Bismark, and so on). The local realignment and base quality recalibration steps result in a new BAM with the recalibrated base quality scores. Finally, Bis-SNP performs SNP calling and outputs both methylation levels and SNP calls. (<b>b</b>) The SNP calling step is performed on each genomic position independently. Differences between the reference genome and the sample genome can produce one of 10 possible allele pairs or genotype (<b>G</b>, only 4 shown here). Frequencies of all possible substitutions in the population are taken from the dbSNP database and represented as <it>&#960;</it>(<b>G</b>). A probabilistic model that incorporates prior probabilities for methylation level and bisulfite conversion efficiency is used to calculate the probability of observing the actual bisulfite read data (<b>D</b>) assuming each of the 10 genotypes (<it>Pr</it>(<b>G</b>|<b>D</b>)) Finally, bayesian inference uses the population frequencies of each SNP to calculate the posterior likelihood <it>Pr</it>(<b>D</b>|<b>G</b>).</p>
</text><graphic file="gb-2012-13-7-r61-2"/></fig>
<p>The Bis-SNP model relies on the accuracy of base quality scores, which are initially estimated by the instrument-specific base caller. However, these initial base scores do not accurately represent true error probabilities, which are highly dependent on local sequence context <abbrgrp><abbr bid="B12">12</abbr></abbrgrp>. In the GATK workflow, empirical mismatch rates for each nucleotide at each sequencing cycle are calculated by comparing base calls to the reference genome, and these mismatch rates are used to recalibrate instrument-generated values <abbrgrp><abbr bid="B12">12</abbr></abbrgrp>. We cannot use this default implementation with bisulfite-seq data, because true C&gt;T sequencing errors can not be identified when the underlying methylation state of each bisulfite-converted DNA fragment is unknown. Therefore, instead of treating Ts at reference cytosines as errors, we treat them as a 5th base <it>X</it>, and estimate these as a group separately from T&gt;T, A&gt;T, or G&gt;T. The effect is that we can effectively recalibrate base call quality scores for all except the <it>X </it>nucleotide, improving our ability to accurately identify SNPs. Importantly, we are able to improve SNP calling at cytosines by recalibrating 'G-strand' Gs that are complementary to the cytosine.</p>
<p>The user can choose among several output files. For methylation levels, Bis-SNP can return a standard UCSC <monospace>.bed</monospace> or <monospace>.wig</monospace> file, and a separate output file is generated for each cytosine context specified by the user on the command line. Example cytosine contexts are CG, CH, or CHH (H is the IUPAC symbol for A,C, or T). The <monospace>.wig</monospace> output contains the methylation percentage for each methylated cytosine, while the <monospace>.bed</monospace> format also contains the number of C/T reads the percentage is based on, plus the strand of each cytosine relative to the reference genome. For SNPs, Bis-SNP can return a Variant Calling Format (<monospace>.vcf</monospace>) file, which contains all SNP calls and likelihood scores in addition to methylation percentages.</p>
</sec>
<sec><st><p>Description of SNP calling algorithm</p></st>
<p>The core of the SNP calling algorithm is based on the Bayesian inference model of GATK <abbrgrp><abbr bid="B12">12</abbr></abbrgrp>, and implemented using GATK's LocusWalker class. For each locus, Bis-SNP evaluates one of ten possible diploid genotypes (<b>G</b>), as shown in Figure <figr fid="F2">2B</figr> (a diploid genotype is made up of two parental alleles, referred to as <it>A </it>and <it>B</it>). The prior probability of each genotype, <it>&#960;</it>(<b>G</b>), is determined using population data from dbSNP (including 1000 genomes data) similar to SOAPsnp <abbrgrp><abbr bid="B13">13</abbr></abbrgrp> (See Materials and Methods). In this model, the likelihood of observing all base calls at a particular locus, assuming a particular diploid genotype <it>AB</it>, is expressed as <it>Pr</it>(<b>D</b>|<b>G </b>= <it>AB</it>) and is the product of observing the base call at each individual read <it>j </it>(Equation 2 of Materials and Methods). As described below, <it>Pr</it>(<it>D<sub>j</sub></it>|<b>G </b>= <it>AB</it>) is calculated according to the strand of read <it>j </it>and several bisulfite-specific parameters, <it>&#946;,&#945; </it>and <it>&#947; </it>(Figure <figr fid="F2">2b</figr>).</p>
<p>In the GATK non-bisulfite SNP calling model, the probability of observing a base call different from the presumed genotype <b>G </b>is simply the base call quality score (defined as the probability of a base calling error). In the case of Bisulfite-seq, this is true for A:T genotypes but not C:G. For C:G genotypes, the probability of observing a T depends on the strand of the read, the methylation state, and the efficiency of bisulfite conversion. Reads on the G-strand opposite the cytosine are treated with the normal GATK model. Reads on the C-strand use an alternate model that considers C&gt;T substitutions as either potential errors or bisulfite conversions (see Materials and Methods). The probability of observing a bisulfite conversion event depends on both the underlying methylation state and bisulfite conversion errors. While none of these are observed directly, they are included in the model as variables <it>&#946;,&#945; </it>and <it>&#947; </it>as described in Equation 5 in the Methods section.</p>
<p>After bisulfite treatment, an unmethylated C that fails to get converted to a T is referred to as an <it>underconversion</it>, while a methylated C that is converted to T is referred to as an <it>overconversion</it>. The underconversion rate, <it>&#945;</it>, is often estimated using either a spike in control <abbrgrp><abbr bid="B4">4</abbr></abbrgrp> or the unmethylated mitochondrial genome <abbrgrp><abbr bid="B6">6</abbr></abbrgrp>. This rate can be set manually by the user and has a value of 0.25% by default. While bisulfite overconversion can not be reliably measured using current Bisulfite-seq data, we include an additional parameter, <it>&#947;</it>, which is set to 0% by default. In the future, this could be estimated by spiking in fully-methylated control DNA.</p>
<p>The percentage of methylated reads at a given cytosine position can vary widely. Since C reads and T reads yield more information about the presence of a C&gt;T SNP than T reads, the locus-specific methylation rate can strongly influence SNP calling. In mammalian genomes, CpG methylation levels are multimodal, with various classes of functional elements having distinct methylation patterns. At least four different classes exist with mean methylation rates ranging from around 0% to over 80% <abbrgrp><abbr bid="B4">4</abbr><abbr bid="B24">24</abbr></abbrgrp>. Furthermore, methylation at particular di- or tri-nucleotide contexts is organism and even cell type specific. To better understand how methylation estimates could affect SNP calling performance, we implemented several different methods for estimating the methylation frequency parameter <it>&#946;</it>, which we describe next.</p>
<p>First, we used a <it>naive </it>estimate for <it>&#946; </it>where the probability of a read being methylated or unmethylated at any particular cytosine position was 0.5. Second, we used <it>context-specific </it>estimates which were determined in a two-round procedure as follows. In the first round, <it>naive </it>estimates were used as described above, and the resulting SNP calls were used along with dbSNP to select a set of high-confidence non-SNP homozygous cytosines (probability&gt;99.99%). These homozygous cytosines were used to estimate average methylation levels for a set of cytosine sequence contexts that could be specified on the Bis-SNP command line (by default, set to <it>&#946;<sub>CG </sub></it>and <it>&#946;<sub>CH</sub></it>). In the third and final estimation method, <it>&#946; </it>was estimated for each cytosine locus individually using the number of C and T reads (<inline-formula><m:math name="gb-2012-13-7-r61-i1" xmlns:m="http://www.w3.org/1998/Math/MathML"><m:mfrac>
   <m:mrow>
      <m:mi>c</m:mi>
   </m:mrow>
   <m:mrow>
      <m:mi>c</m:mi>
      <m:mo class="MathClass-bin">+</m:mo>
      <m:mi>t</m:mi>
   </m:mrow>
</m:mfrac>
</m:math></inline-formula>). The rationale for this <it>locus-specific </it>method was our concern that genome-wide estimates might be inappropriate CpGs, given the strongly bimodal nature of CpG methylation levels. Each of these three <it>&#946; </it>estimation methods was run individually as described below. The default method for the public version of Bis-SNP is <it>locus-specific </it>estimation.</p>
</sec>
<sec><st><p>Evaluation of SNP calls at known SNPs</p></st>
<p>We evaluated Bis-SNP calling accuracy for each of the three different methylation estimation methods (<it>naive</it>, <it>context-specific</it>, and <it>locus-specific</it>). The latter two methods performed substantially better than <it>naive </it>estimation, so those are the only two discussed below. We evaluated accuracy using an actual whole-genome Bisulfite-seq dataset from a normal (male) human colon mucosa sample published previously by our lab <abbrgrp><abbr bid="B6">6</abbr></abbrgrp> (sequence available via accession dbGap:phs000385). All reads were 75 bp long single-end, and generated using the Illumina Genome Analyzer IIx platform. The complete dataset had an average read depth of 32X. The Bisulfite-seq data were compared to Illumina Human1M-Duo BeadChip SNP array data from same sample.</p>
<p>The primary goal of bisulfite sequencing is the accurate determination of cytosine methylation levels, so we first investigated the ability of Bis-SNP to correctly identify homozygous cytosines. As the 'ground truth', we used 435,120 positions identified as homozygous cytosines on the 1 M SNP array, and examined false negative and false positive calls made by Bis-SNP (Figure <figr fid="F3">3a-c</figr>). Calls at varying stringencies were generated by adjusting the Bis-SNP score cutoff, which is defined as the odds ratio between the first and second most likely genotype (see Methods). Evaluating the different Bis-SNP methylation estimates with and without base quality recalibration showed that the <it>locus-specific &#946; </it>estimation plus recalibration produced the most accurate results. Using the complete sequence dataset and the default score cutoff (Figure <figr fid="F3">3c, r</figr>ed circle), Bis-SNP was able to detect 95.22% of the true cytosines (414,327 features) with a false positive rate of 0.37% (2,461 features). We simulated lighter sequencing coverage by randomly picking reads from the full dataset to estimate accuracy at 8&#215; (Figure <figr fid="F3">3a</figr>) and 16&#215; (Figure <figr fid="F3">3b</figr>) genomic coverage. The reader should note that these false positive rates are not indicative of the genome-wide false positive rates, since most false positives come from heterozygous SNPs which are frequent on the SNP array but very infrequent in the genome.</p>
<fig id="F3"><title><p>Figure 3</p></title><caption><p>Bis-SNP error frequencies in detecting SNPs on the Illumina 1 M SNP array</p></caption><text>
   <p><b>Bis-SNP error frequencies in detecting SNPs on the Illumina 1 M SNP array</b>. Receiver Operating Characteristics (ROC curves) are shown for Bis-SNP accuracy at detecting SNPs in Bisulfite-seq data derived from human colonic mucosa tissue. The 'true' genotypes were determined using an Illumina Duo 1 M Human SNP array, and Bis-SNP results were only evaluated at these million genomic positions. All datasets were from <abbrgrp><abbr bid="B6">6</abbr></abbrgrp>. The three ROC curves at the top (a-c) show accuracy at positions corresponding to 435,120 homozygous cytosines on the 1 M SNP array. By randomly downsampling from the average 32&#215; read depth of the Bisulfite-seq data, we are able to show results corresponding to 8&#215; coverage (<b>a</b>), 16&#215; coverage (<b>b</b>). Bis-SNP using three different conditions is compared to Bismark and the method used in 'Berman2012' <abbrgrp><abbr bid="B6">6</abbr></abbrgrp>, both of which restrict their results to reference cytosines. For 'Berman2012', we varied the number of reverse strand G reads required to plot a range of stringencies. The three plots at the bottom (<b>d-f</b>) show accuracy at the 303,656 positions that are heterozygous according to the 1 M SNP array. For comparison, we show results from the k-allele method (similar to the approach of <abbrgrp><abbr bid="B30">30</abbr></abbrgrp>), Shoemaker2010 <abbrgrp><abbr bid="B20">20</abbr></abbrgrp> and <monospace>bisReadMapper</monospace><abbrgrp><abbr bid="B3">3</abbr></abbrgrp>.</p>
</text><graphic file="gb-2012-13-7-r61-3"/></fig>
<p>For comparison, we determined the accuracy of homozygous cytosine calling using several published methods (Figure <figr fid="F3">3a-c</figr>). <monospace>Bismark</monospace><abbrgrp><abbr bid="B34">34</abbr></abbrgrp> returns methylation estimates for all cytosines in the reference genome. It is thus not surprising that <monospace>Bismark</monospace> performs poorly for features on the 1 M SNP array, which were selected for their polymorphism and differences from the reference genome. Several other published studies use the same strategy and estimate methylation at all reference cytosines <abbrgrp><abbr bid="B35">35</abbr><abbr bid="B36">36</abbr></abbrgrp>. In our own earlier work <abbrgrp><abbr bid="B6">6</abbr></abbrgrp>, we also restricted methylation calling to reference cytosines. Thus it is not surprising that when we applied this method ('Berman2012') to the 1 M SNP array dataset, it achieved almost the same false negative rate as <monospace>Bismark</monospace>. However, 'Berman2012' filtered out positions where less than 90% of reads were C or T on the C-strand and G on the G-strand, resulting in a substantially lower false positive rates than <monospace>Bismark</monospace>, but not as low as Bis-SNP.</p>
<p>We next focused on the ability of Bis-SNP to determine heterozygous SNPs, which can be used both for improving methylation calling accuracy as well as allele-specific methylation analysis (see Figure <figr fid="F1">1b</figr>). Heterozygous SNPs are more difficult to identify than homozygous SNPs, due to the approximately 1/2 the read coverage for each allele. We excluded the haploid &#215; chromosome, leaving 303,656 autosomal loci called as heterozygous by the 1 M SNP array. As before, the <it>locus-specific &#946; </it>methylation estimation plus recalibration performed the best of all methods. Using the full dataset with the default Bis-SNP cutoff (Figure <figr fid="F3">3c</figr>, red circle), Bis-SNP was able to identify 93.18% of heterozygous SNPs (282,944 loci) with a false positive rate of 0.094% (755 loci). Of the 303,656 heterozygous loci examined, 242,347 (79.81%) were C/T heterozygotes. C&gt;T is the most common SNP in mammals, arising from evolutionary deamination of methylated cytosines. It is also the most difficult SNP to detect in bisulfite-treated DNA, because the C-strand reads are often uninformative (see Figure <figr fid="F1">1</figr>). As expected, Bis-SNP (and other methods) performed more poorly on C/T heterozygous SNPs than others, due to C&gt;T conversion ambiguity (Additional File <supplr sid="S2">2</supplr>).</p><suppl id="S2">
<title><p>Additional file 2</p></title>
<text><p><b>Bis-SNP error frequencies at C:T heterozygous SNPs</b>. The data for heterozygous SNP calling in Figure <figr fid="F3">3c</figr> is broken up into C:T SNPs vs. other heterozygous SNPs.</p></text><file name="gb-2012-13-7-r61-S2.PDF">
   <p>Click here for file</p>
</file></suppl><p>We compared Bis-SNP results to heterozygous SNPs called using two alternate 'k-allele' techniques that used read count cutoffs without incorporating base quality scores. We implemented a generalized form of the method used by <abbrgrp><abbr bid="B21">21</abbr><abbr bid="B30">30</abbr></abbrgrp> to use a variable read count cutoff. This cutoff, <it>k</it>, was defined as the minimum percentage of reads with a secondary allele necessary to call a heterozygous SNP. As in <abbrgrp><abbr bid="B30">30</abbr></abbrgrp>, we counted C and T as a single allele at reference cytosines (on the C-strand only). In addition to k-allele, we also tried the Shoemaker method <abbrgrp><abbr bid="B20">20</abbr></abbrgrp>, which does not evaluate C/T SNPs at all and requires observations of the less frequent allele on at least 20% of reads on each strand. Finally, we tried the <monospace>bisReadMapper</monospace> algorithm <abbrgrp><abbr bid="B3">3</abbr></abbrgrp>, which calls SNPs independently on each strand using a non-bisulfite SNP caller, SAMTOOLS <abbrgrp><abbr bid="B11">11</abbr></abbrgrp>, and reports only those SNPs that agree between strands. Figures <figr fid="F3">3d-f</figr> show that each variation of Bis-SNP performs better than other methods.</p>
<p>An important practical question is the minimum read depth required for accurate SNP identification. We addressed this problem by downsampling our 32&#215; Bisulfite-seq genome to various coverage levels from 2&#215; to 30&#215; (Figure <figr fid="F4">4</figr>). For each coverage level, we determined the number of false positives and false negatives across a range of Bis-SNP stringency cutoffs using the 1 M SNP array data, as in Figure <figr fid="F3">3</figr>. At each coverage level, we then selected the least stringent cutoff that produced a False Discovery Rate (FDR) of less than 5%, and plotted the number of true positives (sensitivity). For both homozygous cytosines (Figure <figr fid="F4">4a</figr>) and heterozygous SNPs (Figure <figr fid="F4">4b</figr>), sensitivity increased dramatically up to about 10&#215; coverage and then began to level off. Homozygous SNPs were almost fully detected (98% sensitivity) by 10&#215; coverage, while heterozygous SNPs had a more gradual increase from 80% detected at 10&#215; to 95% detected at 30&#215;.</p>
<fig id="F4"><title><p>Figure 4</p></title><caption><p>Sensitivity as a function of sequence coverage</p></caption><text>
   <p><b>Sensitivity as a function of sequence coverage</b>. Comparisons between Bis-SNP SNP calls and 1 M SNP array from Figure 3 ROC curves were extended to a range of coverage levels from 2&#215;-30&#215;. At each coverage level, we selected the least stringent threshold that yielded a False Discovery Rate (FDR) less than 0.05, and plotted the Sensitivity (1 - False Negative rate). As in Figure 3, separate plots show sensitivity at detecting homozygous cytosines (<b>a</b>) and heterozygous SNPs (<b>b</b>). For heterozygous SNPs, we include the overall detection rate (red line), as well as separate lines for C/T heterozygous SNPs (blue line) and non-C/T heterozygous SNPs (green line).</p>
</text><graphic file="gb-2012-13-7-r61-4"/></fig>
</sec>
<sec><st><p>Accuracy of genome-wide methylation calling</p></st>
<p>To verify the ability of Bis-SNP to correctly identify cytosines and improve methylation quantification genome-wide, we ran Bis-SNP across an entire chromosome for the OTB colon mucosa sample and four additional whole-genome bisulfite-seq samples (Table <tblr tid="T1">1</tblr>). TCGA normal lung and normal breast were generated by the USC Epigenome Center and aligned using BSMAP, while the two mouse methylomes were generated by UCSD and aligned using Novoalign <abbrgrp><abbr bid="B22">22</abbr></abbrgrp>. Runtimes for chromosome 1 were about 3 hours using a standard 12-core Intel server with 10 GB RAM (Intel, Santa Clara, CA, shown). The entire human genome takes about 30-40 hours on a single server (data not shown).</p>
<tbl id="T1"><title><p>Table 1</p></title><caption><p>Chromosome 1 Bis-SNP detection</p></caption><tblbdy cols="8">
      <r>
         <c ca="center">
            <p>
               <b>Sample</b>
            </p>
         </c>
         <c ca="center">
            <p>
               <b>Aligner</b>
            </p>
         </c>
         <c ca="center">
            <p>
               <b>reference</b>
            </p>
         </c>
         <c ca="center">
            <p>
               <b>cvg</b>
            </p>
         </c>
         <c ca="center">
            <p>
               <b>Het SNPs</b>
            </p>
         </c>
         <c ca="center">
            <p>
               <b>Hom SNPs</b>
            </p>
         </c>
         <c ca="center">
            <p>
               <b>Callable bases</b>
            </p>
         </c>
         <c ca="center">
            <p>
               <b>runtime</b>
            </p>
         </c>
      </r>
      <r>
         <c cspan="8">
            <hr/>
         </c>
      </r>
      <r>
         <c ca="center">
            <p>OTB</p>
         </c>
         <c ca="center">
            <p>MAQ</p>
         </c>
         <c ca="center">
            <p>hg18</p>
         </c>
         <c ca="center">
            <p>32&#215;</p>
         </c>
         <c ca="center">
            <p>119,103</p>
         </c>
         <c ca="center">
            <p>67,725</p>
         </c>
         <c ca="center">
            <p>211,042,010</p>
         </c>
         <c ca="center">
            <p>2.8 h</p>
         </c>
      </r>
      <r>
         <c cspan="8">
            <hr/>
         </c>
      </r>
      <r>
         <c ca="center">
            <p>TCGA-lung-normal</p>
         </c>
         <c ca="center">
            <p>BSMAP</p>
         </c>
         <c ca="center">
            <p>hg19</p>
         </c>
         <c ca="center">
            <p>19&#215;</p>
         </c>
         <c ca="center">
            <p>118,412</p>
         </c>
         <c ca="center">
            <p>58,309</p>
         </c>
         <c ca="center">
            <p>222,763,786</p>
         </c>
         <c ca="center">
            <p>3.1 h</p>
         </c>
      </r>
      <r>
         <c cspan="8">
            <hr/>
         </c>
      </r>
      <r>
         <c ca="center">
            <p>TCGA-breast-normal</p>
         </c>
         <c ca="center">
            <p>BSMAP</p>
         </c>
         <c ca="center">
            <p>hg19</p>
         </c>
         <c ca="center">
            <p>19&#215;</p>
         </c>
         <c ca="center">
            <p>113,009</p>
         </c>
         <c ca="center">
            <p>57,281</p>
         </c>
         <c ca="center">
            <p>221,014,965</p>
         </c>
         <c ca="center">
            <p>2.7 h</p>
         </c>
      </r>
      <r>
         <c cspan="8">
            <hr/>
         </c>
      </r>
      <r>
         <c ca="center">
            <p>Mouse-F1i</p>
         </c>
         <c ca="center">
            <p>Novoalign</p>
         </c>
         <c ca="center">
            <p>mm9</p>
         </c>
         <c ca="center">
            <p>50&#215;</p>
         </c>
         <c ca="center">
            <p>663,528</p>
         </c>
         <c ca="center">
            <p>65,364</p>
         </c>
         <c ca="center">
            <p>178,718,615</p>
         </c>
         <c ca="center">
            <p>3.1 h</p>
         </c>
      </r>
      <r>
         <c cspan="8">
            <hr/>
         </c>
      </r>
      <r>
         <c ca="center">
            <p>Mouse-F1r</p>
         </c>
         <c ca="center">
            <p>Novoalign</p>
         </c>
         <c ca="center">
            <p>mm9</p>
         </c>
         <c ca="center">
            <p>41&#215;</p>
         </c>
         <c ca="center">
            <p>682,979</p>
         </c>
         <c ca="center">
            <p>67,068</p>
         </c>
         <c ca="center">
            <p>178,847,508</p>
         </c>
         <c ca="center">
            <p>3.1 h</p>
         </c>
      </r>
   </tblbdy><tblfn>
      <p>Notes: All benchmarking performed using a single Intel(R) Xeon (X5650,2.67 GHz) server with 12 CPU cores and 10 GB memory. SE refers to single-end sequencing and PE to paired-end.</p>
   </tblfn></tbl>
<p>We used Bis-SNP to identify four classes of cytosines in the sample genome (Figure <figr fid="F5">5</figr> and Table <tblr tid="T2">2</tblr> 'Sample Genotypes'), and separated these by their corresponding sequences in the reference genome (Figure <figr fid="F5">5</figr> and Table <tblr tid="T2">2</tblr> 'Reference Genotypes'). As shown in Table <tblr tid="T2">2</tblr> about 0.5-0.6% of reference CpGs were lost in the sample genome, and 0.5-0.6% of CpGs in the sample genome were lost in the reference. The two mouse samples had significantly higher SNP rates, presumably due to true strain differences between the crossed strains and the C57BL/6J strain sequenced for the mouse reference genome. In both F1 mice, about 2.5% of reference CpGs were lost in the sample genome, and about 1.1% of CpGs in the sample genome were lost in the reference.</p>
<fig id="F5"><title><p>Figure 5</p></title><caption><p>Accurate methylation calling at SNPs</p></caption><text>
   <p><b>Accurate methylation calling at SNPs</b>. Bis-SNP was run on five different datasets, single-end sequencing from Colon Mucosa Tissue <abbrgrp><abbr bid="B6">6</abbr></abbrgrp> (<b>a</b>), two TCGA samples using paired-end sequencing from breast and lung tissues (normal, non-cancer), and two mouse samples using paired-end sequencing from <abbrgrp><abbr bid="B22">22</abbr></abbrgrp> (see Table 1). In each case, Bis-SNP was used to identify cytosines in one of four sequence context in the sample genome. For each sample genotype, cytosines were further divided by their sequence context in the reference genome ('ref CpG', 'ref CpH', or 'refNotC'). All cytosines within a particular category in a particular sample were averaged to yield a mean methylation level. The number of cytosines in each category can be found in Table 2.</p>
</text><graphic file="gb-2012-13-7-r61-5"/></fig>
<tbl id="T2"><title><p>Table 2:</p></title><caption><p>Chromosome 1 cytosine counts and methylation</p></caption><tblbdy cols="11">
      <r>
         <c ca="center">
            <p>
               <b>Sample</b>
            </p>
         </c>
         <c ca="center">
            <p>
               <b>Sample genotype</b>
            </p>
         </c>
         <c cspan="6" ca="center">
            <p>
               <b>Reference Genotypes</b>
            </p>
         </c>
         <c cspan="3" ca="center">
            <p>
               <b>% methylation</b>
            </p>
         </c>
      </r>
      <r>
         <c cspan="11">
            <hr/>
         </c>
      </r>
      <r>
         <c>
            <p/>
         </c>
         <c>
            <p/>
         </c>
         <c cspan="2" ca="center">
            <p>
               <b>Reference CpG</b>
            </p>
         </c>
         <c cspan="2" ca="center">
            <p>
               <b>Reference CpH</b>
            </p>
         </c>
         <c cspan="2" ca="center">
            <p>
               <b>Reference DpN (D = A,T,G)</b>
            </p>
         </c>
         <c ca="center">
            <p>
               <b>Ref CpG</b>
            </p>
         </c>
         <c ca="center">
            <p>
               <b>Ref CpH</b>
            </p>
         </c>
         <c ca="center">
            <p>
               <b>Ref DpN</b>
            </p>
         </c>
      </r>
      <r>
         <c cspan="11">
            <hr/>
         </c>
      </r>
      <r>
         <c ca="center">
            <p>OTB normal colon</p>
         </c>
         <c ca="center">
            <p>CpG</p>
         </c>
         <c ca="center">
            <p>3,758,803</p>
         </c>
         <c ca="center">
            <p>99.39%</p>
         </c>
         <c ca="center">
            <p>12,540</p>
         </c>
         <c ca="center">
            <p>0.02%</p>
         </c>
         <c ca="center">
            <p>11,838</p>
         </c>
         <c ca="center">
            <p>0.01%</p>
         </c>
         <c ca="center">
            <p>73%</p>
         </c>
         <c ca="center">
            <p>80%</p>
         </c>
         <c ca="center">
            <p>82%</p>
         </c>
      </r>
      <r>
         <c>
            <p/>
         </c>
         <c cspan="10">
            <hr/>
         </c>
      </r>
      <r>
         <c>
            <p/>
         </c>
         <c ca="center">
            <p>CpH</p>
         </c>
         <c ca="center">
            <p>7,773</p>
         </c>
         <c ca="center">
            <p>0.21%</p>
         </c>
         <c ca="center">
            <p>78,427,918</p>
         </c>
         <c ca="center">
            <p>99.95%</p>
         </c>
         <c ca="center">
            <p>18,804</p>
         </c>
         <c ca="center">
            <p>0.01%</p>
         </c>
         <c ca="center">
            <p>1%</p>
         </c>
         <c ca="center">
            <p>1%</p>
         </c>
         <c ca="center">
            <p>1%</p>
         </c>
      </r>
      <r>
         <c>
            <p/>
         </c>
         <c cspan="10">
            <hr/>
         </c>
      </r>
      <r>
         <c>
            <p/>
         </c>
         <c ca="center">
            <p>DpN</p>
         </c>
         <c ca="center">
            <p>5,658</p>
         </c>
         <c ca="center">
            <p>0.15%</p>
         </c>
         <c ca="center">
            <p>14,166</p>
         </c>
         <c ca="center">
            <p>0.02%</p>
         </c>
         <c ca="center">
            <p>128,570,817</p>
         </c>
         <c ca="center">
            <p>99.97%</p>
         </c>
         <c ca="center">
            <p>NA</p>
         </c>
         <c ca="center">
            <p>NA</p>
         </c>
         <c ca="center">
            <p>NA</p>
         </c>
      </r>
      <r>
         <c>
            <p/>
         </c>
         <c cspan="10">
            <hr/>
         </c>
      </r>
      <r>
         <c>
            <p/>
         </c>
         <c ca="center">
            <p>CpG/CpH het</p>
         </c>
         <c ca="center">
            <p>7,218</p>
         </c>
         <c ca="center">
            <p>0.19%</p>
         </c>
         <c ca="center">
            <p>8,998</p>
         </c>
         <c ca="center">
            <p>0.01%</p>
         </c>
         <c ca="center">
            <p>NA</p>
         </c>
         <c ca="center">
            <p>NA</p>
         </c>
         <c ca="center">
            <p>39%</p>
         </c>
         <c ca="center">
            <p>39%</p>
         </c>
         <c ca="center">
            <p>NA</p>
         </c>
      </r>
      <r>
         <c>
            <p/>
         </c>
         <c cspan="10">
            <hr/>
         </c>
      </r>
      <r>
         <c>
            <p/>
         </c>
         <c ca="center">
            <p>CpG/RpG het</p>
         </c>
         <c ca="center">
            <p>2,512</p>
         </c>
         <c ca="center">
            <p>0.07%</p>
         </c>
         <c ca="center">
            <p>NA</p>
         </c>
         <c ca="center">
            <p>NA</p>
         </c>
         <c ca="center">
            <p>1,826</p>
         </c>
         <c ca="center">
            <p>0.00%</p>
         </c>
         <c ca="center">
            <p>74%</p>
         </c>
         <c ca="center">
            <p>NA</p>
         </c>
         <c ca="center">
            <p>77%</p>
         </c>
      </r>
      <r>
         <c cspan="11">
            <hr/>
         </c>
      </r>
      <r>
         <c ca="center">
            <p>TCGA Normal lung</p>
         </c>
         <c ca="center">
            <p>CpG</p>
         </c>
         <c ca="center">
            <p>4,153,196</p>
         </c>
         <c ca="center">
            <p>99.52%</p>
         </c>
         <c ca="center">
            <p>10,995</p>
         </c>
         <c ca="center">
            <p>0.01%</p>
         </c>
         <c ca="center">
            <p>10,511</p>
         </c>
         <c ca="center">
            <p>0.01%</p>
         </c>
         <c ca="center">
            <p>76%</p>
         </c>
         <c ca="center">
            <p>84%</p>
         </c>
         <c ca="center">
            <p>85%</p>
         </c>
      </r>
      <r>
         <c>
            <p/>
         </c>
         <c cspan="10">
            <hr/>
         </c>
      </r>
      <r>
         <c>
            <p/>
         </c>
         <c ca="center">
            <p>CpH</p>
         </c>
         <c ca="center">
            <p>5,460</p>
         </c>
         <c ca="center">
            <p>0.13%</p>
         </c>
         <c ca="center">
            <p>85,031,960</p>
         </c>
         <c ca="center">
            <p>99.96%</p>
         </c>
         <c ca="center">
            <p>16,420</p>
         </c>
         <c ca="center">
            <p>0.01%</p>
         </c>
         <c ca="center">
            <p>1%</p>
         </c>
         <c ca="center">
            <p>1%</p>
         </c>
         <c ca="center">
            <p>1%</p>
         </c>
      </r>
      <r>
         <c>
            <p/>
         </c>
         <c cspan="10">
            <hr/>
         </c>
      </r>
      <r>
         <c>
            <p/>
         </c>
         <c ca="center">
            <p>DpN</p>
         </c>
         <c ca="center">
            <p>5,310</p>
         </c>
         <c ca="center">
            <p>0.13%</p>
         </c>
         <c ca="center">
            <p>13,725</p>
         </c>
         <c ca="center">
            <p>0.02%</p>
         </c>
         <c ca="center">
            <p>133,490,905</p>
         </c>
         <c ca="center">
            <p>99.98%</p>
         </c>
         <c ca="center">
            <p>NA</p>
         </c>
         <c ca="center">
            <p>NA</p>
         </c>
         <c ca="center">
            <p>NA</p>
         </c>
      </r>
      <r>
         <c>
            <p/>
         </c>
         <c cspan="10">
            <hr/>
         </c>
      </r>
      <r>
         <c>
            <p/>
         </c>
         <c ca="center">
            <p>CpG/CpH het</p>
         </c>
         <c ca="center">
            <p>6,682</p>
         </c>
         <c ca="center">
            <p>0.16%</p>
         </c>
         <c ca="center">
            <p>8,529</p>
         </c>
         <c ca="center">
            <p>0.01%</p>
         </c>
         <c ca="center">
            <p>NA</p>
         </c>
         <c ca="center">
            <p>NA</p>
         </c>
         <c ca="center">
            <p>37%</p>
         </c>
         <c ca="center">
            <p>39%</p>
         </c>
         <c ca="center">
            <p>NA</p>
         </c>
      </r>
      <r>
         <c>
            <p/>
         </c>
         <c cspan="10">
            <hr/>
         </c>
      </r>
      <r>
         <c>
            <p/>
         </c>
         <c ca="center">
            <p>CpG/RpG het</p>
         </c>
         <c ca="center">
            <p>2,476</p>
         </c>
         <c ca="center">
            <p>0.06%</p>
         </c>
         <c ca="center">
            <p>NA</p>
         </c>
         <c ca="center">
            <p>NA</p>
         </c>
         <c ca="center">
            <p>1,993</p>
         </c>
         <c ca="center">
            <p>0.00%</p>
         </c>
         <c ca="center">
            <p>80%</p>
         </c>
         <c ca="center">
            <p>NA</p>
         </c>
         <c ca="center">
            <p>78%</p>
         </c>
      </r>
      <r>
         <c cspan="11">
            <hr/>
         </c>
      </r>
      <r>
         <c ca="center">
            <p>TCGA normal breast</p>
         </c>
         <c ca="center">
            <p>CpG</p>
         </c>
         <c ca="center">
            <p>4,100,643</p>
         </c>
         <c ca="center">
            <p>99.54%</p>
         </c>
         <c ca="center">
            <p>10,893</p>
         </c>
         <c ca="center">
            <p>0.01%</p>
         </c>
         <c ca="center">
            <p>10,657</p>
         </c>
         <c ca="center">
            <p>0.01%</p>
         </c>
         <c ca="center">
            <p>75%</p>
         </c>
         <c ca="center">
            <p>85%</p>
         </c>
         <c ca="center">
            <p>86%</p>
         </c>
      </r>
      <r>
         <c>
            <p/>
         </c>
         <c cspan="10">
            <hr/>
         </c>
      </r>
      <r>
         <c>
            <p/>
         </c>
         <c ca="center">
            <p>CpH</p>
         </c>
         <c ca="center">
            <p>5,286</p>
         </c>
         <c ca="center">
            <p>0.13%</p>
         </c>
         <c ca="center">
            <p>80,654,084</p>
         </c>
         <c ca="center">
            <p>99.96%</p>
         </c>
         <c ca="center">
            <p>13,390</p>
         </c>
         <c ca="center">
            <p>0.01%</p>
         </c>
         <c ca="center">
            <p>1%</p>
         </c>
         <c ca="center">
            <p>1%</p>
         </c>
         <c ca="center">
            <p>1%</p>
         </c>
      </r>
      <r>
         <c>
            <p/>
         </c>
         <c cspan="10">
            <hr/>
         </c>
      </r>
      <r>
         <c>
            <p/>
         </c>
         <c ca="center">
            <p>DpN</p>
         </c>
         <c ca="center">
            <p>4,954</p>
         </c>
         <c ca="center">
            <p>0.12%</p>
         </c>
         <c ca="center">
            <p>13,310</p>
         </c>
         <c ca="center">
            <p>0.02%</p>
         </c>
         <c ca="center">
            <p>136,180,779</p>
         </c>
         <c ca="center">
            <p>99.98%</p>
         </c>
         <c ca="center">
            <p>NA</p>
         </c>
         <c ca="center">
            <p>NA</p>
         </c>
         <c ca="center">
            <p>NA</p>
         </c>
      </r>
      <r>
         <c>
            <p/>
         </c>
         <c cspan="10">
            <hr/>
         </c>
      </r>
      <r>
         <c>
            <p/>
         </c>
         <c ca="center">
            <p>CpG/CpH het</p>
         </c>
         <c ca="center">
            <p>6,289</p>
         </c>
         <c ca="center">
            <p>0.15%</p>
         </c>
         <c ca="center">
            <p>8,120</p>
         </c>
         <c ca="center">
            <p>0.01%</p>
         </c>
         <c ca="center">
            <p>NA</p>
         </c>
         <c ca="center">
            <p>NA</p>
         </c>
         <c ca="center">
            <p>39%</p>
         </c>
         <c ca="center">
            <p>40%</p>
         </c>
         <c ca="center">
            <p>NA</p>
         </c>
      </r>
      <r>
         <c>
            <p/>
         </c>
         <c cspan="10">
            <hr/>
         </c>
      </r>
      <r>
         <c>
            <p/>
         </c>
         <c ca="center">
            <p>CpG/RpG het</p>
         </c>
         <c ca="center">
            <p>2,413</p>
         </c>
         <c ca="center">
            <p>0.06%</p>
         </c>
         <c ca="center">
            <p>NA</p>
         </c>
         <c ca="center">
            <p>NA</p>
         </c>
         <c ca="center">
            <p>1,854</p>
         </c>
         <c ca="center">
            <p>0.00%</p>
         </c>
         <c ca="center">
            <p>78%</p>
         </c>
         <c ca="center">
            <p>NA</p>
         </c>
         <c ca="center">
            <p>79%</p>
         </c>
      </r>
      <r>
         <c cspan="11">
            <hr/>
         </c>
      </r>
      <r>
         <c ca="center">
            <p>Xie 2012 Mouse F1i (chr1)</p>
         </c>
         <c ca="center">
            <p>CpG</p>
         </c>
         <c ca="center">
            <p>2,125,320</p>
         </c>
         <c ca="center">
            <p>97.51%</p>
         </c>
         <c ca="center">
            <p>10,990</p>
         </c>
         <c ca="center">
            <p>0.02%</p>
         </c>
         <c ca="center">
            <p>11,757</p>
         </c>
         <c ca="center">
            <p>0.01%</p>
         </c>
         <c ca="center">
            <p>76%</p>
         </c>
         <c ca="center">
            <p>83%</p>
         </c>
         <c ca="center">
            <p>84%</p>
         </c>
      </r>
      <r>
         <c>
            <p/>
         </c>
         <c cspan="10">
            <hr/>
         </c>
      </r>
      <r>
         <c>
            <p/>
         </c>
         <c ca="center">
            <p>CpH</p>
         </c>
         <c ca="center">
            <p>4,314</p>
         </c>
         <c ca="center">
            <p>0.20%</p>
         </c>
         <c ca="center">
            <p>57,706,841</p>
         </c>
         <c ca="center">
            <p>99.87%</p>
         </c>
         <c ca="center">
            <p>20,312</p>
         </c>
         <c ca="center">
            <p>0.02%</p>
         </c>
         <c ca="center">
            <p>3%</p>
         </c>
         <c ca="center">
            <p>3%</p>
         </c>
         <c ca="center">
            <p>3%</p>
         </c>
      </r>
      <r>
         <c>
            <p/>
         </c>
         <c cspan="10">
            <hr/>
         </c>
      </r>
      <r>
         <c>
            <p/>
         </c>
         <c ca="center">
            <p>DpN</p>
         </c>
         <c ca="center">
            <p>5,300</p>
         </c>
         <c ca="center">
            <p>0.24%</p>
         </c>
         <c ca="center">
            <p>20,905</p>
         </c>
         <c ca="center">
            <p>0.04%</p>
         </c>
         <c ca="center">
            <p>118,570,097</p>
         </c>
         <c ca="center">
            <p>99.96%</p>
         </c>
         <c ca="center">
            <p>NA</p>
         </c>
         <c ca="center">
            <p>NA</p>
         </c>
         <c ca="center">
            <p>NA</p>
         </c>
      </r>
      <r>
         <c>
            <p/>
         </c>
         <c cspan="10">
            <hr/>
         </c>
      </r>
      <r>
         <c>
            <p/>
         </c>
         <c ca="center">
            <p>CpG/CpH het</p>
         </c>
         <c ca="center">
            <p>28,896</p>
         </c>
         <c ca="center">
            <p>1.33%</p>
         </c>
         <c ca="center">
            <p>36,735</p>
         </c>
         <c ca="center">
            <p>0.06%</p>
         </c>
         <c ca="center">
            <p>NA</p>
         </c>
         <c ca="center">
            <p>NA</p>
         </c>
         <c ca="center">
            <p>43%</p>
         </c>
         <c ca="center">
            <p>42%</p>
         </c>
         <c ca="center">
            <p>NA</p>
         </c>
      </r>
      <r>
         <c>
            <p/>
         </c>
         <c cspan="10">
            <hr/>
         </c>
      </r>
      <r>
         <c>
            <p/>
         </c>
         <c ca="center">
            <p>CpG/RpG het</p>
         </c>
         <c ca="center">
            <p>15,754</p>
         </c>
         <c ca="center">
            <p>0.72%</p>
         </c>
         <c ca="center">
            <p>NA</p>
         </c>
         <c ca="center">
            <p>NA</p>
         </c>
         <c ca="center">
            <p>12,917</p>
         </c>
         <c ca="center">
            <p>0.01%</p>
         </c>
         <c ca="center">
            <p>78%</p>
         </c>
         <c ca="center">
            <p>NA</p>
         </c>
         <c ca="center">
            <p>82%</p>
         </c>
      </r>
      <r>
         <c cspan="11">
            <hr/>
         </c>
      </r>
      <r>
         <c ca="center">
            <p>Xie 2012 Mouse F1r (chr1)</p>
         </c>
         <c ca="center">
            <p>CpG</p>
         </c>
         <c ca="center">
            <p>2,199,907</p>
         </c>
         <c ca="center">
            <p>97.52%</p>
         </c>
         <c ca="center">
            <p>11,268</p>
         </c>
         <c ca="center">
            <p>0.02%</p>
         </c>
         <c ca="center">
            <p>11,974</p>
         </c>
         <c ca="center">
            <p>0.01%</p>
         </c>
         <c ca="center">
            <p>75%</p>
         </c>
         <c ca="center">
            <p>83%</p>
         </c>
         <c ca="center">
            <p>84%</p>
         </c>
      </r>
      <r>
         <c>
            <p/>
         </c>
         <c cspan="10">
            <hr/>
         </c>
      </r>
      <r>
         <c>
            <p/>
         </c>
         <c ca="center">
            <p>CpH</p>
         </c>
         <c ca="center">
            <p>4,476</p>
         </c>
         <c ca="center">
            <p>0.20%</p>
         </c>
         <c ca="center">
            <p>58,685,115</p>
         </c>
         <c ca="center">
            <p>99.87%</p>
         </c>
         <c ca="center">
            <p>20,933</p>
         </c>
         <c ca="center">
            <p>0.02%</p>
         </c>
         <c ca="center">
            <p>3%</p>
         </c>
         <c ca="center">
            <p>3%</p>
         </c>
         <c ca="center">
            <p>4%</p>
         </c>
      </r>
      <r>
         <c>
            <p/>
         </c>
         <c cspan="10">
            <hr/>
         </c>
      </r>
      <r>
         <c>
            <p/>
         </c>
         <c ca="center">
            <p>DpN</p>
         </c>
         <c ca="center">
            <p>5,171</p>
         </c>
         <c ca="center">
            <p>0.23%</p>
         </c>
         <c ca="center">
            <p>20,765</p>
         </c>
         <c ca="center">
            <p>0.04%</p>
         </c>
         <c ca="center">
            <p>117,647,445</p>
         </c>
         <c ca="center">
            <p>99.96%</p>
         </c>
         <c ca="center">
            <p>NA</p>
         </c>
         <c ca="center">
            <p>NA</p>
         </c>
         <c ca="center">
            <p>NA</p>
         </c>
      </r>
      <r>
         <c>
            <p/>
         </c>
         <c cspan="10">
            <hr/>
         </c>
      </r>
      <r>
         <c>
            <p/>
         </c>
         <c ca="center">
            <p>CpG/CpH het</p>
         </c>
         <c ca="center">
            <p>29,983</p>
         </c>
         <c ca="center">
            <p>1.33%</p>
         </c>
         <c ca="center">
            <p>38,159</p>
         </c>
         <c ca="center">
            <p>0.06%</p>
         </c>
         <c ca="center">
            <p>NA</p>
         </c>
         <c ca="center">
            <p>NA</p>
         </c>
         <c ca="center">
            <p>43%</p>
         </c>
         <c ca="center">
            <p>42%</p>
         </c>
         <c ca="center">
            <p>NA</p>
         </c>
      </r>
      <r>
         <c>
            <p/>
         </c>
         <c cspan="10">
            <hr/>
         </c>
      </r>
      <r>
         <c>
            <p/>
         </c>
         <c ca="center">
            <p>CpG/RpG het</p>
         </c>
         <c ca="center">
            <p>16,371</p>
         </c>
         <c ca="center">
            <p>0.73%</p>
         </c>
         <c ca="center">
            <p>NA</p>
         </c>
         <c ca="center">
            <p>NA</p>
         </c>
         <c ca="center">
            <p>13,147</p>
         </c>
         <c ca="center">
            <p>0.01%</p>
         </c>
         <c ca="center">
            <p>78%</p>
         </c>
         <c ca="center">
            <p>NA</p>
         </c>
         <c ca="center">
            <p>82%</p>
         </c>
      </r>
   </tblbdy><tblfn>
      <p>Notes: 'het' signifies heterozygous. Two non-reference bases in a row automatically filtered out. CpH = C(A/C/T). DpN = (A/T/G)(A/C/T/G). RpG = (A/G)G. CpG/TpG heterozygous genotypes are filtered out because they can not be used for methylation calling.</p>
   </tblfn></tbl>
<p>We next compared average methylation levels across each sample genotype (Figure <figr fid="F5">5</figr>). As expected, homozygous CpHs were consistently low, while homozygous CpGs were consistently high, regardless of the corresponding reference sequence. Both mouse frontal cortex brain samples showed elevated levels of CpH methylation as described in the original publication <abbrgrp><abbr bid="B22">22</abbr></abbrgrp>. Interestingly, homozygous CpGs that represented SNPs (where the sample differed from the reference genome) had consistently higher methylation. This fits with what is known about mammalian genome evolution - evolutionary C&gt;T changes occur much more frequently at methylated than unmethylated CpGs because the C&gt;T deamination and deamination repair process is methylation-specific. We next looked at heterozygous CpGs (Figure <figr fid="F5">5</figr>, right). CpG/CpH positions had methylation about halfway between CpG homozygous and CpH homozygous positions. At CpG/ApG or CpG/GpG heterozygous positions, methylation can only be measured for the C allele, and the methylation state is about the same as homozygous CpGs. CpG/TpG heterozygous positions are not shown, because we can not accurately measure methylation at these positions. Together, these data show that Bis-SNP genotype calling produces accurate methylation quantification even when the sample genome differs from the reference genome.</p>
</sec>
</sec>
<sec><st><p>Conclusions</p></st>
<p>We have described a publicly-available software tool, <monospace>Bis-SNP</monospace>, which extracts methylation information and SNP information simultaneously from data generated using the Illumina Bisulfite-seq protocol. Command-line executables (Additional File <supplr sid="S3">3</supplr>) and open-source code (Additional File <supplr sid="S4">4</supplr>) are both freely available for download <abbrgrp><abbr bid="B37">37</abbr></abbrgrp>. The directional nature of the Illumina protocol allows for analysis of DNA methylation and the identification of a SNP at the same position, by combining information from each strand separately. This is the dominant Bisulfite-sequencing protocol in use today by individual labs and genomics consortia such as ENCODE, the NIH Epigenomics Roadmap, and The Cancer Genome Atlas. By correctly identifying and filtering SNPs correctly, we can obtain more accurate methylation levels and heterozygous SNPs, including C/T SNPs, can be used to identify allele-specific methylation patterns. Bis-SNP is implemented using the efficient GATK framework, which allows for runtimes that are reasonable for modern whole-genome analysis. An entire 32&#215; whole-genome dataset took about 30 hours to run on a typical 12-processor compute node with 10 GB of memory, or 3 hours when each chromosome was run in parallel on a separate compute node. This performance profile makes Bis-SNP accessible to most users.</p><suppl id="S3">
<title><p>Additional file 3</p></title>
<text><p><b>Bis-SNP executable, utility scripts, and User Manual</b>. We suggest that the user download the most recent version of these files directly from <abbrgrp><abbr bid="B37">37</abbr></abbrgrp>.</p></text><file name="gb-2012-13-7-r61-S3.GZ">
   <p>Click here for file</p>
</file></suppl><suppl id="S4">
<title><p>Additional file 4</p></title>
<text><p><b>Bis-SNP source code</b>. We suggest that the user download the most recent version of these files directly from <abbrgrp><abbr bid="B37">37</abbr></abbrgrp>.</p></text><file name="gb-2012-13-7-r61-S4.GZ">
   <p>Click here for file</p>
</file></suppl><p>We included the capability to perform base quality re-calibration on bisulfite-seq data, which improves the overall SNP calling accuracy of Bis-SNP. Not only do more accurate base quality scores allow us better identification of SNPs as shown here, but could be used in the future to calculate more precise DNA methylation estimates. Biological DNA samples do not typically have a large number of cytosines that are always 100% methylated, so there is not a reliable way to identify true C&gt;T mismatches and recalibrate quality scores at these positions. Recalibration could be improved in the future by spiking a library of DNA that has not been treated with bisulfite into the same sequencing lane.</p>
<p>The potential applications of Bisulfite-seq in basic biology and medicine are broad, and Bis-SNP can be used for the majority of Bisulfite-seq experimental designs including Whole-Genome Bisulfite-Seq (WGBS), Reduced Representation Bisulfite-Seq (RRBS), and customizable genome selection methods. While we have focused on human studies, Bis-SNP can output methylation levels split up according to user-defined cytosine contexts, which makes it applicable to analysis of <it>Arabidopsis </it>or any other organism. It also allows Bis-SNP to accommodate novel study designs, such as <it>in vitro </it>methylation by methyltransferases with arbitrary sequence specificities, or even the study 5-hydromethyl-cytosine (5-hmC) using a novel bisulfite-sequencing approach <abbrgrp><abbr bid="B38">38</abbr></abbrgrp>.</p>
<p>An intriguing potential use of Bisulfite-seq and Bis-SNP is the study of genome-wide associations between SNPs and DNA methylation patterns (i.e. <it>methQTLs</it>, reviewed in <abbrgrp><abbr bid="B39">39</abbr></abbrgrp>). While the experimental designs thus far have envisioned paired SNP and methylation assays, our encouraging results with Bis-SNP suggest that both could be captured in a single Bisulfite-sequencing experiment. Sequencing depths of 50&#215; or greater for Whole-genome Bisulfite-seq are not unattainable from a cost perspective, and would likely provide sufficient SNP and methylation coverage for methQTL studies. Another potential application could be a Genome-Wide Association Study (GWAS) that uses Bisulfite-seq rather than traditional sequencing, to identify disease associations at the genetic and epigenetic levels simultaneously. This could be especially useful given the large number of GWAS hits that appear to affect regulatory regions rather than gene coding regions. Bis-SNP and other Bisulfite-seq analysis tools will be important in the development of these exciting new technologies.</p>
</sec>
<sec><st><p>Materials and methods</p></st>
<sec><st><p>Local realignment, base quality recalibration and other BAM file preprocessing</p></st>
<p>Reads with mapping quality scores less than 30 and those mapped to multiple genomic regions were removed, as are PCR duplicates (optional). For paired-end reads, we remove read pairs that do not have the <monospace>ProperlyPaired</monospace> field set.</p>
<p>We use GATK to perform local multiple sequence realignment and sequence recalibration mostly as described <abbrgrp><abbr bid="B12">12</abbr></abbrgrp>. Since most of bisulfite sequencing mapping tools (e.g. Bismark, BSMAP, MAQ etc) do not provide correct CIGAR string in the BAM file for GATK's indel realignment, the CIGAR string is recalculated when necessary. We extend GATK's <monospace>RealignerTargetCreator</monospace> to count mismatch number but not count thymine as a mismatch when the reference genome position is cytosine. After we create a potential indel interval, we realign using a modified version of GATK's <monospace>IndelRealigner</monospace>. PCR duplicate reads are marked after indel realignment.</p>
<p>For base quality recalibration, we modify the GATK algorithm to account for bisulfite conversion by extending the GATK <monospace>CountVariantWalker</monospace> and <monospace>TableRecalibrationWalker</monospace> classes. The algorithm first tabulates empirical mismatches to the reference at all loci not known to vary in the population (i.e., not in dbSNP build 135). These counts are categorized by their reported instrument-reported quality score (<it>R</it>) and position (cycle) within the read (<it>C</it>). In tabulating mismatches, we do not count thymine as a mismatch when the reference genome position is cytosine (on the second end of a paired-end read, we instead don't count adenine as a mismatch when the reference is guanine).</p>
<p>By default, only positions with a recalibrated Base Calling Quality Score of greater than 5 are used for SNP calling. This quality cutoff can be set using a command line parameter (see User Manual in Additional File <supplr sid="S3">3</supplr>).</p>
</sec>
<sec><st><p>BisSNP probabilistic model</p></st>
<p>We begin with the bayesian likelihood model of GATK (<abbrgrp><abbr bid="B12">12</abbr></abbrgrp>), and make a number of bisulfite-specific adaptations. Assuming the underlying genome is diploid, we let <b>D </b>= (<it>D</it><sub>1</sub>, <it>D</it><sub>2</sub>, ..., <it>D<sub>r</sub></it>) represent the base calls at a particular genomic position <it>i </it>that is covered by <it>r </it>sequencing reads. We then calculate the posterior probability by (1) as in GATK:</p>
<p><display-formula id="M1"><m:math name="gb-2012-13-7-r61-i2" xmlns:m="http://www.w3.org/1998/Math/MathML"><m:mrow>
   <m:mi>P</m:mi>
   <m:mi>r</m:mi>
   <m:mrow>
      <m:mo class="MathClass-open">(</m:mo>
      <m:mrow>
         <m:mi>G</m:mi>
         <m:mo class="MathClass-rel">|</m:mo>
         <m:mi mathvariant="bold">D</m:mi>
      </m:mrow>
      <m:mo class="MathClass-close">)</m:mo>
   </m:mrow>
   <m:mo class="MathClass-rel">=</m:mo>
   <m:mfrac>
      <m:mrow>
         <m:mi>&#960;</m:mi>
         <m:mrow>
            <m:mo class="MathClass-open">(</m:mo>
            <m:mrow>
               <m:mi>G</m:mi>
            </m:mrow>
            <m:mo class="MathClass-close">)</m:mo>
         </m:mrow>
         <m:mi>P</m:mi>
         <m:mi>r</m:mi>
         <m:mrow>
            <m:mo class="MathClass-open">(</m:mo>
            <m:mrow>
               <m:mi mathvariant="bold">D</m:mi>
               <m:mo class="MathClass-rel">|</m:mo>
               <m:mi>G</m:mi>
            </m:mrow>
            <m:mo class="MathClass-close">)</m:mo>
         </m:mrow>
      </m:mrow>
      <m:mrow>
         <m:mi>P</m:mi>
         <m:mi>r</m:mi>
         <m:mrow>
            <m:mo class="MathClass-open">(</m:mo>
            <m:mrow>
               <m:mi mathvariant="bold">D</m:mi>
            </m:mrow>
            <m:mo class="MathClass-close">)</m:mo>
         </m:mrow>
      </m:mrow>
   </m:mfrac>
</m:mrow>
</m:math></display-formula></p>
<p>Here, <it>G </it>is the underlying diploid genotype, <it>AB</it>, with <it>A </it>and <it>B </it>being the two parental alleles. <it>&#960;</it>(<it>G</it>) is a genotype prior probability for observing the given genotype based on the genotype of the reference genome and population frequencies, the same as discussed in Table <tblr tid="T1">1</tblr> of SOAPsnp paper <abbrgrp><abbr bid="B13">13</abbr></abbrgrp>. <it>Pr</it>(<it>D</it>) is defined as the sum over all possible genotypes &#8721;<it><sub>AB </sub>&#960;</it>(<it>AB</it>) <it>Pr </it>(<b>D</b>|<it>AB</it>), but is the same in each case and can generally be ignored since we are concerned with likelihood ratios. We assume that each of the two alleles are equally likely to be sequenced, and calculate the overall likelihood of <b>D </b>as the product of all individual reads (2),(3):</p>
<p><display-formula id="M2"><m:math name="gb-2012-13-7-r61-i3" xmlns:m="http://www.w3.org/1998/Math/MathML"><m:mrow>
   <m:mi>P</m:mi>
   <m:mi>r</m:mi>
   <m:mrow>
      <m:mo class="MathClass-open">(</m:mo>
      <m:mrow>
         <m:mi mathvariant="bold">D</m:mi>
         <m:mo class="MathClass-rel">|</m:mo>
         <m:mi>G</m:mi>
      </m:mrow>
      <m:mo class="MathClass-close">)</m:mo>
   </m:mrow>
   <m:mo class="MathClass-rel">=</m:mo>
   <m:munderover accentunder="false" accent="false">
      <m:mrow>
         <m:mo mathsize="big"> &#8719;</m:mo>
      </m:mrow>
      <m:mrow>
         <m:mi>j</m:mi>
         <m:mo class="MathClass-rel">=</m:mo>
         <m:mn>1</m:mn>
      </m:mrow>
      <m:mrow>
         <m:mi>r</m:mi>
      </m:mrow>
   </m:munderover>
   <m:mi>P</m:mi>
   <m:mi>r</m:mi>
   <m:mrow>
      <m:mo class="MathClass-open">(</m:mo>
      <m:mrow>
         <m:msub>
            <m:mrow>
               <m:mi>D</m:mi>
            </m:mrow>
            <m:mrow>
               <m:mi>j</m:mi>
            </m:mrow>
         </m:msub>
         <m:mo class="MathClass-rel">|</m:mo>
         <m:mi>G</m:mi>
      </m:mrow>
      <m:mo class="MathClass-close">)</m:mo>
   </m:mrow>
</m:mrow>
</m:math></display-formula></p>
<p><display-formula id="M3"><m:math name="gb-2012-13-7-r61-i4" xmlns:m="http://www.w3.org/1998/Math/MathML"><m:mrow>
   <m:mi>P</m:mi>
   <m:mi>r</m:mi>
   <m:mrow>
      <m:mo class="MathClass-open">(</m:mo>
      <m:mrow>
         <m:msub>
            <m:mrow>
               <m:mi>D</m:mi>
            </m:mrow>
            <m:mrow>
               <m:mi>j</m:mi>
            </m:mrow>
         </m:msub>
         <m:mo class="MathClass-rel">|</m:mo>
         <m:mi>G</m:mi>
         <m:mo class="MathClass-rel">=</m:mo>
         <m:mi>A</m:mi>
         <m:mi>B</m:mi>
      </m:mrow>
      <m:mo class="MathClass-close">)</m:mo>
   </m:mrow>
   <m:mo class="MathClass-rel">=</m:mo>
   <m:mfrac>
      <m:mrow>
         <m:mn>1</m:mn>
      </m:mrow>
      <m:mrow>
         <m:mn>2</m:mn>
      </m:mrow>
   </m:mfrac>
   <m:mi>P</m:mi>
   <m:mi>r</m:mi>
   <m:mrow>
      <m:mo class="MathClass-open">(</m:mo>
      <m:mrow>
         <m:msub>
            <m:mrow>
               <m:mi>D</m:mi>
            </m:mrow>
            <m:mrow>
               <m:mi>j</m:mi>
            </m:mrow>
         </m:msub>
         <m:mo class="MathClass-rel">|</m:mo>
         <m:mi>A</m:mi>
      </m:mrow>
      <m:mo class="MathClass-close">)</m:mo>
   </m:mrow>
   <m:mo class="MathClass-bin">+</m:mo>
   <m:mfrac>
      <m:mrow>
         <m:mn>1</m:mn>
      </m:mrow>
      <m:mrow>
         <m:mn>2</m:mn>
      </m:mrow>
   </m:mfrac>
   <m:mi>P</m:mi>
   <m:mi>r</m:mi>
   <m:mrow>
      <m:mo class="MathClass-open">(</m:mo>
      <m:mrow>
         <m:msub>
            <m:mrow>
               <m:mi>D</m:mi>
            </m:mrow>
            <m:mrow>
               <m:mi>j</m:mi>
            </m:mrow>
         </m:msub>
         <m:mo class="MathClass-rel">|</m:mo>
         <m:mi>B</m:mi>
      </m:mrow>
      <m:mo class="MathClass-close">)</m:mo>
   </m:mrow>
</m:mrow>
</m:math></display-formula></p>
<p>The following steps are shown for single-end sequences. For paired end sequences, the first end is treated as described, but the second end is reverse complemented before performing these calculations (because the Illumina second end is the complementary strand of the same template as the first end). This changes G&gt;A bisulfite substitutions, which occur on the second end, to the actual C&gt;T substitutions present on the bisulfite-converted template. The recalibrated base quality scores are on a phred scale which represents the probability <it>&#949; </it>that the position is an error, which is used in the following calculation.</p>
<p>When the underlying allele is adenine (<monospace>a</monospace>), thymine (<monospace>t</monospace>), bisulfite conversion does not apply and the probability estimation is straightforward as shown for <monospace>t</monospace>:</p>
<p><display-formula id="M4"><m:math name="gb-2012-13-7-r61-i5" xmlns:m="http://www.w3.org/1998/Math/MathML"><m:mrow>
   <m:mtext>Pr</m:mtext>
   <m:mrow>
      <m:mo class="MathClass-open">(</m:mo>
      <m:mrow>
         <m:msub>
            <m:mrow>
               <m:mi>D</m:mi>
            </m:mrow>
            <m:mrow>
               <m:mi>j</m:mi>
            </m:mrow>
         </m:msub>
         <m:mo class="MathClass-rel">|</m:mo>
         <m:mi>B</m:mi>
         <m:mo class="MathClass-rel">=</m:mo>
         <m:mstyle class="text">
            <m:mtext class="textsf" mathvariant="sans-serif">t</m:mtext>
         </m:mstyle>
      </m:mrow>
      <m:mo class="MathClass-close">)</m:mo>
   </m:mrow>
   <m:mo class="MathClass-rel">=</m:mo>
   <m:mfenced separators="" open="{" close="">
      <m:mrow>
         <m:mtable equalrows="false" columnlines="none none none none none none none none none none none none none none none none none none none" equalcolumns="false" class="array">
            <m:mtr>
               <m:mtd class="array" columnalign="left">
                  <m:mfrac>
                     <m:mrow>
                        <m:msub>
                           <m:mrow>
                              <m:mi>&#949;</m:mi>
                           </m:mrow>
                           <m:mrow>
                              <m:mi>j</m:mi>
                           </m:mrow>
                        </m:msub>
                     </m:mrow>
                     <m:mrow>
                        <m:mn>3</m:mn>
                     </m:mrow>
                  </m:mfrac>
               </m:mtd>
               <m:mtd class="array" columnalign="left">
                  <m:mspace width="1em" class="quad"/>
                  <m:mstyle class="text">
                     <m:mtext class="textsf" mathvariant="sans-serif">if</m:mtext>
                  </m:mstyle>
                  <m:mspace width="0.3em" class="thinspace"/>
                  <m:msub>
                     <m:mrow>
                        <m:mi>D</m:mi>
                     </m:mrow>
                     <m:mrow>
                        <m:mi>j</m:mi>
                     </m:mrow>
                  </m:msub>
                  <m:mo class="MathClass-rel">&#8800;</m:mo>
                  <m:mstyle class="text">
                     <m:mtext class="textsf" mathvariant="sans-serif">t</m:mtext>
                  </m:mstyle>
               </m:mtd>
            </m:mtr>
            <m:mtr>
               <m:mtd class="array" columnalign="left">
                  <m:mn>1</m:mn>
                  <m:mo class="MathClass-bin">-</m:mo>
                  <m:msub>
                     <m:mrow>
                        <m:mi>&#949;</m:mi>
                     </m:mrow>
                     <m:mrow>
                        <m:mi>j</m:mi>
                     </m:mrow>
                  </m:msub>
               </m:mtd>
               <m:mtd class="array" columnalign="left">
                  <m:mspace width="1em" class="quad"/>
                  <m:mstyle class="text">
                     <m:mtext class="textsf" mathvariant="sans-serif">if</m:mtext>
                  </m:mstyle>
                  <m:mspace width="0.3em" class="thinspace"/>
                  <m:msub>
                     <m:mrow>
                        <m:mi>D</m:mi>
                     </m:mrow>
                     <m:mrow>
                        <m:mi>j</m:mi>
                     </m:mrow>
                  </m:msub>
                  <m:mo class="MathClass-rel">=</m:mo>
                  <m:mstyle class="text">
                     <m:mtext class="textsf" mathvariant="sans-serif">t</m:mtext>
                  </m:mstyle>
               </m:mtd>
            </m:mtr>
            <m:mtr>
               <m:mtd class="array" columnalign="left"/>
            </m:mtr>
         </m:mtable>
      </m:mrow>
   </m:mfenced>
</m:mrow>
</m:math></display-formula></p>
<p>Here, <it>&#949;<sub>j </sub></it>is the probability of a sequencing or base calling error at position <it>j</it>, i.e. probability that the true allele <it>B </it>is a t, but base call <it>D<sub>j </sub></it>is observed as an <monospace>a</monospace>, <monospace>c</monospace>, or <monospace>g</monospace>. The likelihood function for <monospace>a</monospace> is equivalent to that of Equation (4). When the underlying allele is a <monospace>c</monospace> or a <monospace>g</monospace>, however, the probabilities are strand-specific since bisulfite conversion only affects one strand in the directional Bisulfite-seq protocol (Figure <figr fid="F1">1</figr>). The probability of seeing a <monospace>t</monospace> in the read depends on the probability that the position is methylated (<it>&#946;</it>), as well as the bisulfite conversion efficiency (<it>&#945; </it>and <it>&#947;</it>). Bisulfite treatment converts all unmethylated cytosines to thymine, but in practice it is not 100% efficient <abbrgrp><abbr bid="B4">4</abbr></abbrgrp>. The parameter <it>&#945; </it>is the estimated frequency of unmethylated cytosines which are not converted (typically taken from unmethylated spiked in DNA <abbrgrp><abbr bid="B4">4</abbr></abbrgrp> or the mammalian mitochondrial sequences, which we have found to be almost completely unmethylated <abbrgrp><abbr bid="B6">6</abbr></abbrgrp>. In this case, <it>&#945; </it>= <it>&#946;<sub>chr</sub>M</it>). By default, <it>&#945; </it>is set to 0.0025 but can be specified by the user. We also include a <it>&#947; </it>parameter for <it>over-conversion</it>, i.e. the rate at which methylated cytosines are converted. Although this is not routinely measured in practice, it could be estimated by including an enzymatically methylated control DNA <abbrgrp><abbr bid="B40">40</abbr></abbrgrp>, or a sequencing library without bisulfite conversion. By default, <it>&#947; </it>is set to 0 but can be specified by the user. The full likelihood calculation for cytosines is as follows:</p>
<p><display-formula id="M5"><m:math name="gb-2012-13-7-r61-i6" xmlns:m="http://www.w3.org/1998/Math/MathML"><m:mrow>
   <m:mtable class="gathered">
      <m:mtr>
         <m:mtd>
            <m:mi>P</m:mi>
            <m:mi>r</m:mi>
            <m:mrow>
               <m:mo class="MathClass-open">(</m:mo>
               <m:mrow>
                  <m:msub>
                     <m:mrow>
                        <m:mi>D</m:mi>
                     </m:mrow>
                     <m:mrow>
                        <m:mi>j</m:mi>
                     </m:mrow>
                  </m:msub>
                  <m:mo class="MathClass-rel">|</m:mo>
                  <m:mi>B</m:mi>
                  <m:mo class="MathClass-rel">=</m:mo>
                  <m:mstyle class="text">
                     <m:mtext class="textsf" mathvariant="sans-serif">c</m:mtext>
                  </m:mstyle>
               </m:mrow>
               <m:mo class="MathClass-close">)</m:mo>
            </m:mrow>
            <m:mo class="MathClass-rel">=</m:mo>
            <m:mfenced separators="" open="{" close="">
               <m:mrow>
                  <m:mtable equalrows="false" columnlines="none none none none none none none none none none none none none none none none none none none" equalcolumns="false" class="array">
                     <m:mtr>
                        <m:mtd class="array" columnalign="center">
                           <m:mrow>
                              <m:mo class="MathClass-open">(</m:mo>
                              <m:mrow>
                                 <m:mn>1</m:mn>
                                 <m:mo class="MathClass-bin">-</m:mo>
                                 <m:msub>
                                    <m:mrow>
                                       <m:mi>&#949;</m:mi>
                                    </m:mrow>
                                    <m:mrow>
                                       <m:mi>j</m:mi>
                                    </m:mrow>
                                 </m:msub>
                              </m:mrow>
                              <m:mo class="MathClass-close">)</m:mo>
                           </m:mrow>
                           <m:mrow>
                              <m:mo class="MathClass-open">[</m:mo>
                              <m:mrow>
                                 <m:msub>
                                    <m:mrow>
                                       <m:mi>&#946;</m:mi>
                                    </m:mrow>
                                    <m:mrow>
                                       <m:mi>j</m:mi>
                                    </m:mrow>
                                 </m:msub>
                                 <m:mrow>
                                    <m:mo class="MathClass-open">(</m:mo>
                                    <m:mrow>
                                       <m:mn>1</m:mn>
                                       <m:mo class="MathClass-bin">-</m:mo>
                                       <m:mi>&#947;</m:mi>
                                    </m:mrow>
                                    <m:mo class="MathClass-close">)</m:mo>
                                 </m:mrow>
                                 <m:mo class="MathClass-bin">+</m:mo>
                                 <m:mrow>
                                    <m:mo class="MathClass-open">(</m:mo>
                                    <m:mrow>
                                       <m:mn>1</m:mn>
                                       <m:mo class="MathClass-bin">-</m:mo>
                                       <m:msub>
                                          <m:mrow>
                                             <m:mi>&#946;</m:mi>
                                          </m:mrow>
                                          <m:mrow>
                                             <m:mi>j</m:mi>
                                          </m:mrow>
                                       </m:msub>
                                    </m:mrow>
                                    <m:mo class="MathClass-close">)</m:mo>
                                 </m:mrow>
                                 <m:mi>&#945;</m:mi>
                              </m:mrow>
                              <m:mo class="MathClass-close">]</m:mo>
                           </m:mrow>
                        </m:mtd>
                        <m:mtd class="array" columnalign="center">
                           <m:mspace width="1em" class="quad"/>
                           <m:mstyle class="text">
                              <m:mtext class="textsf" mathvariant="sans-serif">if</m:mtext>
                           </m:mstyle>
                           <m:mspace width="0.3em" class="thinspace"/>
                           <m:msub>
                              <m:mrow>
                                 <m:mi>D</m:mi>
                              </m:mrow>
                              <m:mrow>
                                 <m:mi>j</m:mi>
                              </m:mrow>
                           </m:msub>
                           <m:mo class="MathClass-rel">=</m:mo>
                           <m:msup>
                              <m:mrow>
                                 <m:mstyle class="text">
                                    <m:mtext class="textsf" mathvariant="sans-serif">c</m:mtext>
                                 </m:mstyle>
                              </m:mrow>
                              <m:mrow>
                                 <m:mo class="MathClass-bin">+</m:mo>
                              </m:mrow>
                           </m:msup>
                        </m:mtd>
                     </m:mtr>
                     <m:mtr>
                        <m:mtd class="array" columnalign="center">
                           <m:mfrac>
                              <m:mrow>
                                 <m:msub>
                                    <m:mrow>
                                       <m:mi>&#949;</m:mi>
                                    </m:mrow>
                                    <m:mrow>
                                       <m:mi>j</m:mi>
                                    </m:mrow>
                                 </m:msub>
                              </m:mrow>
                              <m:mrow>
                                 <m:mn>3</m:mn>
                              </m:mrow>
                           </m:mfrac>
                           <m:mo class="MathClass-bin">+</m:mo>
                           <m:mrow>
                              <m:mo class="MathClass-open">(</m:mo>
                              <m:mrow>
                                 <m:mn>1</m:mn>
                                 <m:mo class="MathClass-bin">-</m:mo>
                                 <m:msub>
                                    <m:mrow>
                                       <m:mi>&#949;</m:mi>
                                    </m:mrow>
                                    <m:mrow>
                                       <m:mi>j</m:mi>
                                    </m:mrow>
                                 </m:msub>
                              </m:mrow>
                              <m:mo class="MathClass-close">)</m:mo>
                           </m:mrow>
                           <m:mrow>
                              <m:mo class="MathClass-open">[</m:mo>
                              <m:mrow>
                                 <m:msub>
                                    <m:mrow>
                                       <m:mi>&#946;</m:mi>
                                    </m:mrow>
                                    <m:mrow>
                                       <m:mi>j</m:mi>
                                    </m:mrow>
                                 </m:msub>
                                 <m:mi>&#947;</m:mi>
                                 <m:mo class="MathClass-bin">+</m:mo>
                                 <m:mrow>
                                    <m:mo class="MathClass-open">(</m:mo>
                                    <m:mrow>
                                       <m:mn>1</m:mn>
                                       <m:mo class="MathClass-bin">-</m:mo>
                                       <m:msub>
                                          <m:mrow>
                                             <m:mi>&#946;</m:mi>
                                          </m:mrow>
                                          <m:mrow>
                                             <m:mi>j</m:mi>
                                          </m:mrow>
                                       </m:msub>
                                    </m:mrow>
                                    <m:mo class="MathClass-close">)</m:mo>
                                 </m:mrow>
                                 <m:mrow>
                                    <m:mo class="MathClass-open">(</m:mo>
                                    <m:mrow>
                                       <m:mn>1</m:mn>
                                       <m:mo class="MathClass-bin">-</m:mo>
                                       <m:mi>&#945;</m:mi>
                                    </m:mrow>
                                    <m:mo class="MathClass-close">)</m:mo>
                                 </m:mrow>
                              </m:mrow>
                              <m:mo class="MathClass-close">]</m:mo>
                           </m:mrow>
                        </m:mtd>
                        <m:mtd class="array" columnalign="center">
                           <m:mspace width="1em" class="quad"/>
                           <m:mstyle class="text">
                              <m:mtext class="textsf" mathvariant="sans-serif">if</m:mtext>
                           </m:mstyle>
                           <m:mspace width="0.3em" class="thinspace"/>
                           <m:msub>
                              <m:mrow>
                                 <m:mi>D</m:mi>
                              </m:mrow>
                              <m:mrow>
                                 <m:mi>j</m:mi>
                              </m:mrow>
                           </m:msub>
                           <m:mo class="MathClass-rel">=</m:mo>
                           <m:msup>
                              <m:mrow>
                                 <m:mstyle class="text">
                                    <m:mtext class="textsf" mathvariant="sans-serif">t</m:mtext>
                                 </m:mstyle>
                              </m:mrow>
                              <m:mrow>
                                 <m:mo class="MathClass-bin">+</m:mo>
                              </m:mrow>
                           </m:msup>
                        </m:mtd>
                     </m:mtr>
                     <m:mtr>
                        <m:mtd class="array" columnalign="center">
                           <m:mn>1</m:mn>
                           <m:mo class="MathClass-bin">-</m:mo>
                           <m:msub>
                              <m:mrow>
                                 <m:mi>&#949;</m:mi>
                              </m:mrow>
                              <m:mrow>
                                 <m:mi>j</m:mi>
                              </m:mrow>
                           </m:msub>
                        </m:mtd>
                        <m:mtd class="array" columnalign="center">
                           <m:mspace width="1em" class="quad"/>
                           <m:mstyle class="text">
                              <m:mtext class="textsf" mathvariant="sans-serif">if</m:mtext>
                           </m:mstyle>
                           <m:mspace width="0.3em" class="thinspace"/>
                           <m:msub>
                              <m:mrow>
                                 <m:mi>D</m:mi>
                              </m:mrow>
                              <m:mrow>
                                 <m:mi>j</m:mi>
                              </m:mrow>
                           </m:msub>
                           <m:mo class="MathClass-rel">=</m:mo>
                           <m:msup>
                              <m:mrow>
                                 <m:mstyle class="text">
                                    <m:mtext class="textsf" mathvariant="sans-serif">c</m:mtext>
                                 </m:mstyle>
                              </m:mrow>
                              <m:mrow>
                                 <m:mo class="MathClass-bin">-</m:mo>
                              </m:mrow>
                           </m:msup>
                        </m:mtd>
                     </m:mtr>
                     <m:mtr>
                        <m:mtd class="array" columnalign="center">
                           <m:mfrac>
                              <m:mrow>
                                 <m:msub>
                                    <m:mrow>
                                       <m:mi>&#949;</m:mi>
                                    </m:mrow>
                                    <m:mrow>
                                       <m:mi>j</m:mi>
                                    </m:mrow>
                                 </m:msub>
                              </m:mrow>
                              <m:mrow>
                                 <m:mn>3</m:mn>
                              </m:mrow>
                           </m:mfrac>
                        </m:mtd>
                        <m:mtd class="array" columnalign="center">
                           <m:mspace width="1em" class="quad"/>
                           <m:mstyle class="text">
                              <m:mtext class="textsf" mathvariant="sans-serif">otherwise</m:mtext>
                           </m:mstyle>
                        </m:mtd>
                     </m:mtr>
                     <m:mtr>
                        <m:mtd class="array" columnalign="center"/>
                     </m:mtr>
                  </m:mtable>
               </m:mrow>
            </m:mfenced>
         </m:mtd>
      </m:mtr>
      <m:mtr>
         <m:mtd>
            <m:msub>
               <m:mrow>
                  <m:mi>&#946;</m:mi>
               </m:mrow>
               <m:mrow>
                  <m:mi>j</m:mi>
               </m:mrow>
            </m:msub>
            <m:mfenced separators="" open="(" close=")">
               <m:mrow>
                  <m:mstyle class="text">
                     <m:mtext class="textsf" mathvariant="sans-serif">1</m:mtext>
                  </m:mstyle>
                  <m:mo class="MathClass-bin">-</m:mo>
                  <m:mi>&#947;</m:mi>
               </m:mrow>
            </m:mfenced>
            <m:mo class="MathClass-rel">=</m:mo>
            <m:mstyle class="text">
               <m:mtext class="textsf" mathvariant="sans-serif">methylated&#160;and&#160;</m:mtext>
            </m:mstyle>
            <m:mfenced separators="" open="(" close=")">
               <m:mrow>
                  <m:mstyle class="text">
                     <m:mtext class="textsf" mathvariant="sans-serif">properly</m:mtext>
                  </m:mstyle>
               </m:mrow>
            </m:mfenced>
            <m:mstyle class="text">
               <m:mtext class="textsf" mathvariant="sans-serif">&#160;not&#160;converted</m:mtext>
            </m:mstyle>
         </m:mtd>
      </m:mtr>
      <m:mtr>
         <m:mtd>
            <m:msub>
               <m:mrow>
                  <m:mi>&#946;</m:mi>
               </m:mrow>
               <m:mrow>
                  <m:mi>j</m:mi>
               </m:mrow>
            </m:msub>
            <m:mi>&#947;</m:mi>
            <m:mo class="MathClass-rel">=</m:mo>
            <m:mstyle class="text">
               <m:mtext class="textsf" mathvariant="sans-serif">methylated&#160;and&#160;</m:mtext>
            </m:mstyle>
            <m:mfenced separators="" open="(" close=")">
               <m:mrow>
                  <m:mstyle class="text">
                     <m:mtext class="textsf" mathvariant="sans-serif">improperly</m:mtext>
                  </m:mstyle>
               </m:mrow>
            </m:mfenced>
            <m:mstyle class="text">
               <m:mtext class="textsf" mathvariant="sans-serif">&#160;converted</m:mtext>
            </m:mstyle>
         </m:mtd>
      </m:mtr>
      <m:mtr>
         <m:mtd>
            <m:mrow>
               <m:mo class="MathClass-open">(</m:mo>
               <m:mrow>
                  <m:mn>1</m:mn>
                  <m:mo class="MathClass-bin">-</m:mo>
                  <m:msub>
                     <m:mrow>
                        <m:mi>&#946;</m:mi>
                     </m:mrow>
                     <m:mrow>
                        <m:mi>j</m:mi>
                     </m:mrow>
                  </m:msub>
               </m:mrow>
               <m:mo class="MathClass-close">)</m:mo>
            </m:mrow>
            <m:mi>&#945;</m:mi>
            <m:mo class="MathClass-rel">=</m:mo>
            <m:mstyle class="text">
               <m:mtext class="textsf" mathvariant="sans-serif">unmethylated&#160;and&#160;</m:mtext>
            </m:mstyle>
            <m:mfenced separators="" open="(" close=")">
               <m:mrow>
                  <m:mstyle class="text">
                     <m:mtext class="textsf" mathvariant="sans-serif">improperly</m:mtext>
                  </m:mstyle>
               </m:mrow>
            </m:mfenced>
            <m:mstyle class="text">
               <m:mtext class="textsf" mathvariant="sans-serif">&#160;not&#160;converted</m:mtext>
            </m:mstyle>
         </m:mtd>
      </m:mtr>
      <m:mtr>
         <m:mtd>
            <m:mrow>
               <m:mo class="MathClass-open">(</m:mo>
               <m:mrow>
                  <m:mstyle class="text">
                     <m:mtext class="textsf" mathvariant="sans-serif">1</m:mtext>
                  </m:mstyle>
                  <m:mo class="MathClass-bin">-</m:mo>
                  <m:msub>
                     <m:mrow>
                        <m:mi>&#946;</m:mi>
                     </m:mrow>
                     <m:mrow>
                        <m:mi>j</m:mi>
                     </m:mrow>
                  </m:msub>
               </m:mrow>
               <m:mo class="MathClass-close">)</m:mo>
            </m:mrow>
            <m:mrow>
               <m:mo class="MathClass-open">(</m:mo>
               <m:mrow>
                  <m:mstyle class="text">
                     <m:mtext class="textsf" mathvariant="sans-serif">1</m:mtext>
                  </m:mstyle>
                  <m:mo class="MathClass-bin">-</m:mo>
                  <m:mi>&#945;</m:mi>
               </m:mrow>
               <m:mo class="MathClass-close">)</m:mo>
            </m:mrow>
            <m:mo class="MathClass-rel">=</m:mo>
            <m:mstyle class="text">
               <m:mtext class="textsf" mathvariant="sans-serif">unmethylated&#160;and&#160;</m:mtext>
            </m:mstyle>
            <m:mfenced separators="" open="(" close=")">
               <m:mrow>
                  <m:mstyle class="text">
                     <m:mtext class="textsf" mathvariant="sans-serif">properly</m:mtext>
                  </m:mstyle>
               </m:mrow>
            </m:mfenced>
            <m:mstyle class="text">
               <m:mtext class="textsf" mathvariant="sans-serif">&#160;converted</m:mtext>
            </m:mstyle>
         </m:mtd>
      </m:mtr>
      <m:mtr>
         <m:mtd/>
      </m:mtr>
   </m:mtable>
</m:mrow>
</m:math></display-formula></p>
<p>The key to these calculations is that reads on the same strand as the inferred cytosine allele (denoted with +) are treated differently than reads from the opposite strand (denoted with -). As expected based on the example in Figure <figr fid="F1">1</figr>, a true allele of <it>B </it>= <monospace>c</monospace> results in a very high probability of seeing a <monospace>t</monospace><sup>+ </sup>(a '<monospace>t</monospace>' read on the C-strand), but a very low probability of seeing a <monospace>t</monospace><sup>- </sup>(an '<monospace>a</monospace>' read on the G-strand). The genotype <it>G<sub>best </sub></it>with the highest posterior probability <it>Pr</it>(<it>G</it>|<b>D</b>) is chosen, and the final output score is the odds ratio between the best (<it>G<sub>best</sub></it>) and the second best (<it>G<sub>nextbest</sub></it>), as in Equation (6). In practice, we optimize execution by evaluating only the subset of the 10 possible diploid genotypes that are possible given the sequences read.</p>
<p><display-formula id="M6"><m:math name="gb-2012-13-7-r61-i7" xmlns:m="http://www.w3.org/1998/Math/MathML"><m:mrow>
   <m:mi>s</m:mi>
   <m:mi>c</m:mi>
   <m:mi>o</m:mi>
   <m:mi>r</m:mi>
   <m:mi>e</m:mi>
   <m:mo class="MathClass-rel">=</m:mo>
   <m:mi>l</m:mi>
   <m:mi>o</m:mi>
   <m:mi>g</m:mi>
   <m:mrow>
      <m:mo class="MathClass-open">(</m:mo>
      <m:mrow>
         <m:mfrac>
            <m:mrow>
               <m:mi>P</m:mi>
               <m:mi>r</m:mi>
               <m:mrow>
                  <m:mo class="MathClass-open">(</m:mo>
                  <m:mrow>
                     <m:msub>
                        <m:mrow>
                           <m:mi>G</m:mi>
                        </m:mrow>
                        <m:mrow>
                           <m:mi>b</m:mi>
                           <m:mi>e</m:mi>
                           <m:mi>s</m:mi>
                           <m:mi>t</m:mi>
                        </m:mrow>
                     </m:msub>
                     <m:mo class="MathClass-rel">|</m:mo>
                     <m:mi mathvariant="bold">D</m:mi>
                  </m:mrow>
                  <m:mo class="MathClass-close">)</m:mo>
               </m:mrow>
            </m:mrow>
            <m:mrow>
               <m:mi>P</m:mi>
               <m:mi>r</m:mi>
               <m:mrow>
                  <m:mo class="MathClass-open">(</m:mo>
                  <m:mrow>
                     <m:msub>
                        <m:mrow>
                           <m:mi>G</m:mi>
                        </m:mrow>
                        <m:mrow>
                           <m:mi>n</m:mi>
                           <m:mi>e</m:mi>
                           <m:mi>x</m:mi>
                           <m:mi>t</m:mi>
                           <m:mi>b</m:mi>
                           <m:mi>e</m:mi>
                           <m:mi>s</m:mi>
                           <m:mi>t</m:mi>
                        </m:mrow>
                     </m:msub>
                     <m:mo class="MathClass-rel">|</m:mo>
                     <m:mi mathvariant="bold">D</m:mi>
                  </m:mrow>
                  <m:mo class="MathClass-close">)</m:mo>
               </m:mrow>
            </m:mrow>
         </m:mfrac>
      </m:mrow>
   </m:mrow>
</m:mrow>
</m:math></display-formula></p>
<p>Bisulfite efficiency, i.e. <it>&#945; </it>and <it>&#947; </it>typically vary by less than 1%, so the critical parameter included in Equation 5 is the methylation rate <it>&#946;</it>. Since this rate varies by genomic context, organism, and even cell type, we allow the user to specify the possible contexts as a set of <it>n </it>nucleotides sequences specified by their IUPAC degeneracy codes (for instance, <it>CH </it>represents <it>CC</it>, <it>CT</it>, or <it>CA</it>). In mammalian genomes where typically only the single base 3' of the cytosine is considered relevant, the user would specify CG and CH (the <monospace>Bis-SNP</monospace> default). For <it>Arabidopsis</it>, one might specify CG, CHH, and CHG. Any arbitrary number of 5' and 3' bases may be specified in order to accommodate the full range of Bisulfite-seq assays. For instance a CCGG pattern could be specified for MspI restriction sites inherent to the RRBS protocol ( <abbrgrp><abbr bid="B41">41</abbr></abbrgrp>).</p>
<p>One methylation output file (BED6+2 format) is created for each cytosine context specified by the user. For each cytosine determined to have the particular sequence context, the percent methylated (the number of C reads on the C-strand divided by the number of C or T reads on the C-strand) is output as the score field. To aid in statistical analysis, a second field contains the total number of C/T reads.</p>
</sec>
<sec><st><p>Five-prime bisulfite non-conversion filter</p></st>
<p>Non-conversion of unmethylated Cs is known to preferentially affect the 5' end of Illumina-generated reads, most likely driven by the re-annealing of sequences adjacent to the fully methylated sequence adapters during bisulfite conversion. We control for this using a 5' non-conversion filter as implemented in our earlier work <abbrgrp><abbr bid="B6">6</abbr></abbrgrp>. For each read, we walk along the read from 5' to 3', and we remove any Cs on the C-strand until we reach the first reference C which is converted to a T. By applying this filter, early bisulfite conversion in early cycles is brought to levels very similar to those of late cycles, thus removing a potential source of methylation bias (data not shown). Notice that this filter should be turned off for RRBS data, which gleans most of its methylation data from the first cycle (see user manual).</p>
</sec>
<sec><st><p>Pre-SNP calling quality filters</p></st>
<p>Using the approach of GATK, we apply additional quality filters before SNP calling to avoid known sources of false positives. SNPs found in clusters (two or more within a ten-base-pair window) were filtered out. SNPs with coverage depth above 120, Strand Bias(SB) score more than -0.02, or Quality by Depth(QD) less than 1.0 are filtered out. All of these parameters are configurable (see User Manual). If BAM contains Mapping Quality scores, suspicious regions are filtered out when greater than 10% of aligned reads (minimum of 40 reads) have mapping quality of 0.</p>
<p>Bisulfite sequencing can have higher strand biases since high bisulfite concentration can lead to DNA degradation when the depurination step causes random strand breaks <abbrgrp><abbr bid="B42">42</abbr><abbr bid="B43">43</abbr></abbrgrp>. We calculated strand bias score as in GATK, but bisulfite converted reads have an apparent strand bias which is higher than the actual strand bias, since the G-strand contributes more than the C-strand at cytosines. For this reason, we used a substantially less stringent strand bias cutoff (-0.02) than the GATK default.</p>
</sec>
<sec><st><p>Downsampling coverage</p></st>
<p>We downsampled the human colon mucosa Bisulfite-seq dataset into different mean coverages using GATK, which randomly picks <it>z </it>reads at each individual nucleotide locus. The following formula is used, where <it>N </it>is the mean coverage of total dataset before downsampling (32&#215; in this case), <it>n </it>is the desired downsampling coverage, and <it>m </it>is the actual coverage at the particular locus.</p>
<p><display-formula id="M7"><m:math name="gb-2012-13-7-r61-i8" xmlns:m="http://www.w3.org/1998/Math/MathML"><m:mrow>
   <m:mi>z</m:mi>
   <m:mo class="MathClass-rel">=</m:mo>
   <m:mfrac>
      <m:mrow>
         <m:mi>m</m:mi>
         <m:mo class="MathClass-bin">*</m:mo>
         <m:mi>n</m:mi>
      </m:mrow>
      <m:mrow>
         <m:mi>N</m:mi>
      </m:mrow>
   </m:mfrac>
</m:mrow>
</m:math></display-formula></p>
</sec>
<sec><st><p>External tools used for comparison</p></st>
<sec><st><p>K-allele method</p></st>
<p>The K-allele method was used to identify heterozygous SNPs as a generalization of described methods <abbrgrp><abbr bid="B21">21</abbr><abbr bid="B30">30</abbr></abbrgrp>, both of which count the number of alternate alleles present and exclude C/T SNPs. For reference cytosine positions, we only use counts from the <it>G-strand</it>, while at other positions we combine the two strands to get read counts. After these filters, we use a <it>K </it>cutoff which can vary from 0-10 and apply the <it>K</it>-allele threshold as follows. For positions with <it>n </it>passing reads where <it>n </it>is less than 10, we require that each of the two alleles have at least <it>K </it>reads. For positions where <it>n </it>is greater than 10, we require at least <inline-formula><m:math name="gb-2012-13-7-r61-i9" xmlns:m="http://www.w3.org/1998/Math/MathML"><m:mi>n</m:mi>
<m:mfrac>
   <m:mrow>
      <m:mi>k</m:mi>
   </m:mrow>
   <m:mrow>
      <m:mn>10</m:mn>
   </m:mrow>
</m:mfrac>
</m:math></inline-formula> reads. Fore reference, the Hudson Alpha group <abbrgrp><abbr bid="B21">21</abbr></abbrgrp> used a set definition <it>K </it>of 7 reads and at least 10%, and excluded all C/T SNPs. The UCLA group <abbrgrp><abbr bid="B30">30</abbr></abbrgrp> specified that the allele with the lower read count had to contain at least 40% of reads, and excluded C/T reads.</p>
</sec>
<sec><st><p>bisReadMapper</p></st>
<p>We downloaded <monospace>bisReadMapper</monospace> version 1 <abbrgrp><abbr bid="B3">3</abbr></abbrgrp>. We first use <monospace>genomePrep.pl</monospace> to preprocess the reference genome and extract cytosine position in each chromosome. The built in read mapper could not handle our large BAM file, so we circumvented the mapping step and used the BAM files directly as input. This is not a standard part of the bisReadMapper package, and required us to divide our BAM alignment files to separate reads aligning to the forward strand of the reference genome from those aligning to the reverse strand. We used the following <monospace>bisReadMapper</monospace> parameters: <monospace>allC=1; length=75; snp=dbsnp135.rod; alignMode=S; qualBase=33; trim3=0; trim5=0; refDir=/path/to/GenomePreparationProcessedDir/</monospace></p>
</sec>
<sec><st><p>Shoemaker</p></st>
<p>The Shoemaker <abbrgrp><abbr bid="B20">20</abbr></abbrgrp> method was implemented as described in their supplemental materials with clarifications from the author. The reads are handled differently based on the ratio of C to T nucleotides within the read and the ratio of G to A nucleotides (if C to T ratio was higher, it was considered a bisulfite-converted C-strand read, otherwise it was considered a complementary read from the 2nd end and it was reverse complemented). All reads are then demethylated <it>in silico </it>(Cs converted to Ts). Input reads are filtered by their criteria: (1) Base calls at the examined SNP site and three flanking positions on either side needed to have a minimum Base Quality score of 15. (2) If a certain base was present in more than 20% of reads on one strand, its reverse complement needed to be present on at least 20% of the reads on the opposing strand. Only positions passing these two criteria were analyzed. Base Quality scores were used to weight the nucleotide count contributions to the nucleotide frequency matrix. This matrix was normalized, multiplied by the read count to get final nucleotide number matrix in each location (normalized and weighted A,C,G,T number in each loci). The Fisher exact test was applied to each nucleotide in each of the alleles (e.g. nucleotide number of G vs. nucleotide number of not G, expected nucleotide number of G vs. expected nucleotide number of not G). Two p-values of each allele were multiplied together for each of ten possible genotypes and then normalized. The SNPs were selected out when (1) The best genotype was 10 times more than the next most likely genotype, (2) the SNP was in reported in dbSNP, and (3) had at least 10&#215; read depth.</p>
</sec>
<sec><st><p>Bismark</p></st>
<p>We downloaded Bismark-0.50 <abbrgrp><abbr bid="B34">34</abbr></abbrgrp>. We converted our input BAM file to SAM format and ran <monospace>genome_methylation_bismark2bedGraph.pl</monospace> to extract cytosines. Default settings were used.</p>
</sec>
<sec><st><p>Berman2012</p></st>
<p>We implemented a generalized version of the method described in our earlier work <abbrgrp><abbr bid="B6">6</abbr></abbrgrp>. We only included reference cytosine positions that had at least 3 overlapping C or T reads. We required at least <it>k</it>% of reads on the C-strand to be C or T, and <it>k</it>% of the reads on the G-strand to be G. The default setting (used in <abbrgrp><abbr bid="B6">6</abbr></abbrgrp> and shown as an orange rectangle in Figure <figr fid="F3">3</figr>) was <it>k </it>= 10%.</p>
</sec>
</sec>
<sec><st><p>Datasets used for whole-genome comparisons</p></st>
<sec><st><p>OTB-colon</p></st>
<p>75 bp Single End Whole-Genome Bisulfite-Seq data from <abbrgrp><abbr bid="B6">6</abbr></abbrgrp> was generated using Illumina GAIIx sequencing (available at dbGap:phs000385). Sample was normal adjacent colon mucosa from a male colon cancer patient.</p>
</sec>
<sec><st><p>TCGA-lung and TCGA-breast</p></st>
<p>100 bp Paired End Whole Genome Bisulfite-Seq (WGBS) data generated at USC by the TCGA (The Cancer Genome Atlas) USC-JHU Epigenome Characterization Center. Data is unpublished, but available for download via the UCSC Cancer Genomics Hub (CG-Hub <abbrgrp><abbr bid="B44">44</abbr></abbrgrp>). The lung normal sample is adjacent tissue from case TCGA-60-2722 (data available in CG-Hub analysis ID 964a8130-d061-472f-9839-9c1f07b24205), and the breast normal sample is adjacent tissue from case TCGA-A7-A0CE (CG-Hub analysis ID 279507dd-4c62-4975-877d-5cfebd2e7c6f.</p>
</sec>
<sec><st><p>Mouse-F1i and Mouse-F1r</p></st>
<p>One hundred-base pair paired-end sequence datasets from two independent mouse samples were used<abbrgrp><abbr bid="B22">22</abbr></abbrgrp>. We downloaded alignments from the original publication (GEO accessions GSM753569 and GSM753570), which were performed using Novoalign. High-confidence genotypes were available for both parental strains via the Mouse Genome Database. We inferred high-confidence genotypes for the progeny only when each parent was homozygous at the particular position.</p>
</sec>
</sec>
</sec>
<sec><st><p>Abbreviations</p></st>
<p>CpG: dinucleotide sequencing consisting of a cytosine followed by guanine; CpH: cytosine followed by an H nucleotide (H is one of C, A, or T); SNP: Single-nucleotide polymorphisms; WGBS: Whole-Genome Bisulfite-Seq; RRBS: Reduced Representation Bisulfite Sequencing; BSPP: Bisulfite Padlock Probes; ENCODE: ENCyclopedia Of DNA Elements; TCGA: The Cancer Genome Atlas; GATK: Genome Analysis Toolkit; VCF: Variant Calling Format; FDR: False Discovery Rate; IUPAC: International Union of Pure and Applied Chemistry; GWAS: Genome-Wide Association Study; BAM: Binary version of the Sequence Alignment/Map (SAM) format; SB: Strand Bias; QD: Quality by Depth.</p>
</sec>
<sec><st><p>Competing interests</p></st>
<p>The authors declare that they have no competing interests.</p>
</sec>
<sec><st><p>Authors' contributions</p></st>
<p>YL, PWL, and BPB conceived and designed the study. YL and BPB conceived the statistical approach with input from KDS. YL implemented Bis-SNP and all other computational tools. BPB and YL wrote the manuscript, with input from KS and PWL. All authors have read and approved the manuscript for publication.</p>
</sec>
</bdy>
<bm><ack>
<sec><st><p>Acknowledgements</p></st>
<p>Support to YL, PWL, and BPB was provided by NIH grant number U24CA143882. We acknowledge our colleagues at the USC Epigenome Center for useful discussions and suggestions. High performance computing support was provided by the USC High Performance Computing Center <abbrgrp><abbr bid="B45">45</abbr></abbrgrp>. We wish to thank Robert Shoemaker, Dinh Diep, Kun Zhang, and Felix Krueger for clarifications and assistance with their software tools.</p>
</sec></ack>
<refgrp><bibl id="B1"><title><p>Principles and challenges of genomewide DNA methylation analysis</p></title><aug><au><snm>Laird</snm><fnm>PW</fnm></au></aug><source>Nat Rev Genet</source><pubdate>2010</pubdate><volume>11</volume><fpage>191</fpage><lpage>203</lpage><xrefbib><pubid idtype="pmpid" link="fulltext">20125086</pubid></xrefbib></bibl><bibl id="B2"><title><p>Genome-scale DNA methylation maps of pluripotent and differentiated cells</p></title><aug><au><snm>Meissner</snm><fnm>A</fnm></au><au><snm>Mikkelsen</snm><fnm>TS</fnm></au><au><snm>Gu</snm><fnm>H</fnm></au><au><snm>Wernig</snm><fnm>M</fnm></au><au><snm>Hanna</snm><fnm>J</fnm></au><au><snm>Sivachenko</snm><fnm>A</fnm></au><au><snm>Zhang</snm><fnm>X</fnm></au><au><snm>Bernstein</snm><fnm>BE</fnm></au><au><snm>Nusbaum</snm><fnm>C</fnm></au><au><snm>Jaffe</snm><fnm>DB</fnm></au><au><snm>Gnirke</snm><fnm>A</fnm></au><au><snm>Jaenisch</snm><fnm>R</fnm></au><au><snm>Lander</snm><fnm>ES</fnm></au></aug><source>Nature</source><pubdate>2008</pubdate><volume>454</volume><fpage>766</fpage><lpage>70</lpage><xrefbib><pubidlist><pubid idtype="pmcid">2896277</pubid><pubid idtype="pmpid" link="fulltext">18600261</pubid></pubidlist></xrefbib></bibl><bibl id="B3"><title><p>Library-free methylation sequencing with bisulfite padlock probes</p></title><aug><au><snm>Diep</snm><fnm>D</fnm></au><au><snm>Plongthongkum</snm><fnm>N</fnm></au><au><snm>Gore</snm><fnm>A</fnm></au><au><snm>Fung</snm><fnm>HL</fnm></au><au><snm>Shoemaker</snm><fnm>R</fnm></au><au><snm>Zhang</snm><fnm>K</fnm></au></aug><source>Nat Methods</source><pubdate>2012</pubdate><volume>9</volume><fpage>270</fpage><lpage>2</lpage><xrefbib><pubidlist><pubid idtype="doi">10.1038/nmeth.1871</pubid><pubid idtype="pmcid">3461232</pubid><pubid idtype="pmpid" link="fulltext">22306810</pubid></pubidlist></xrefbib></bibl><bibl id="B4"><title><p>Human DNA methylomes at base resolution show widespread epigenomic differences</p></title><aug><au><snm>Lister</snm><fnm>R</fnm></au><au><snm>Pelizzola</snm><fnm>M</fnm></au><au><snm>Dowen</snm><fnm>RH</fnm></au><au><snm>Hawkins</snm><fnm>RD</fnm></au><au><snm>Hon</snm><fnm>G</fnm></au><au><snm>Tonti-Filippini</snm><fnm>J</fnm></au><au><snm>Nery</snm><fnm>JR</fnm></au><au><snm>Lee</snm><fnm>L</fnm></au><au><snm>Ye</snm><fnm>Z</fnm></au><au><snm>Ngo</snm><fnm>QM</fnm></au><au><snm>Edsall</snm><fnm>L</fnm></au><au><snm>Antosiewicz-Bourget</snm><fnm>J</fnm></au><au><snm>Stewart</snm><fnm>R</fnm></au><au><snm>Ruotti</snm><fnm>V</fnm></au><au><snm>Millar</snm><fnm>AH</fnm></au><au><snm>Thomson</snm><fnm>JA</fnm></au><au><snm>Ren</snm><fnm>B</fnm></au><au><snm>Ecker</snm><fnm>JR</fnm></au></aug><source>Nature</source><pubdate>2009</pubdate><volume>462</volume><fpage>315</fpage><lpage>22</lpage><xrefbib><pubidlist><pubid idtype="doi">10.1038/nature08514</pubid><pubid idtype="pmcid">2857523</pubid><pubid idtype="pmpid" link="fulltext">19829295</pubid></pubidlist></xrefbib></bibl><bibl id="B5"><title><p>Increased methylation variation in epigenetic domains across cancer types</p></title><aug><au><snm>Hansen</snm><fnm>KD</fnm></au><au><snm>Timp</snm><fnm>W</fnm></au><au><snm>Bravo</snm><fnm>HC</fnm></au><au><snm>Sabunciyan</snm><fnm>S</fnm></au><au><snm>Langmead</snm><fnm>B</fnm></au><au><snm>McDonald</snm><fnm>OG</fnm></au><au><snm>Wen</snm><fnm>B</fnm></au><au><snm>Wu</snm><fnm>H</fnm></au><au><snm>Liu</snm><fnm>Y</fnm></au><au><snm>Diep</snm><fnm>D</fnm></au><au><snm>Briem</snm><fnm>E</fnm></au><au><snm>Zhang</snm><fnm>K</fnm></au><au><snm>Irizarry</snm><fnm>RA</fnm></au><au><snm>Feinberg</snm><fnm>AP</fnm></au></aug><source>Nat Genet</source><pubdate>2011</pubdate><volume>43</volume><fpage>768</fpage><lpage>75</lpage><xrefbib><pubidlist><pubid idtype="doi">10.1038/ng.865</pubid><pubid idtype="pmcid">3145050</pubid><pubid idtype="pmpid" link="fulltext">21706001</pubid></pubidlist></xrefbib></bibl><bibl id="B6"><title><p>Regions of focal DNA hypermethylation and long-range hypomethylation in colorectal cancer coincide with nuclear lamina-associated domains</p></title><aug><au><snm>Berman</snm><fnm>BP</fnm></au><au><snm>Weisenberger</snm><fnm>DJ</fnm></au><au><snm>Aman</snm><fnm>JF</fnm></au><au><snm>Hinoue</snm><fnm>T</fnm></au><au><snm>Ramjan</snm><fnm>Z</fnm></au><au><snm>Liu</snm><fnm>Y</fnm></au><au><snm>Noushmehr</snm><fnm>H</fnm></au><au><snm>Lange</snm><fnm>CPE</fnm></au><au><snm>van Dijk</snm><fnm>CM</fnm></au><au><snm>Tollenaar</snm><fnm>RAEM</fnm></au><au><snm>Van Den Berg</snm><fnm>D</fnm></au><au><snm>Laird</snm><fnm>PW</fnm></au></aug><source>Nat Genet</source><pubdate>2012</pubdate><volume>44</volume><fpage>40</fpage><lpage>6</lpage></bibl><bibl id="B7"><title><p>Ultra-low-input, tagmentation-based whole genome bisulfite sequencing</p></title><aug><au><snm>Adey</snm><fnm>A</fnm></au><au><snm>Shendure</snm><fnm>J</fnm></au></aug><source>Genome Res</source><pubdate>2012</pubdate></bibl><bibl id="B8"><title><p>Genome-scale DNA methylation mapping of clinical samples at single-nucleotide resolution</p></title><aug><au><snm>Gu</snm><fnm>H</fnm></au><au><snm>Bock</snm><fnm>C</fnm></au><au><snm>Mikkelsen</snm><fnm>TS</fnm></au><au><snm>J&#228;ger</snm><fnm>N</fnm></au><au><snm>Smith</snm><fnm>ZD</fnm></au><au><snm>Tomazou</snm><fnm>E</fnm></au><au><snm>Gnirke</snm><fnm>A</fnm></au><au><snm>Lander</snm><fnm>ES</fnm></au><au><snm>Meissner</snm><fnm>A</fnm></au></aug><source>Nat Methods</source><pubdate>2010</pubdate><volume>7</volume><fpage>133</fpage><lpage>6</lpage><xrefbib><pubidlist><pubid idtype="doi">10.1038/nmeth.1414</pubid><pubid idtype="pmcid">2860480</pubid><pubid idtype="pmpid" link="fulltext">20062050</pubid></pubidlist></xrefbib></bibl><bibl id="B9"><title><p>The ENCODE (ENCyclopedia Of DNA Elements) Project</p></title><aug><au><cnm>ENCODE Project Consortium</cnm></au></aug><source>Science</source><pubdate>2004</pubdate><volume>306</volume><fpage>636</fpage><lpage>40</lpage><xrefbib><pubid idtype="pmpid" link="fulltext">15499007</pubid></xrefbib></bibl><bibl id="B10"><title><p>DNA methylome analysis using short bisulfite sequencing data</p></title><aug><au><snm>Krueger</snm><fnm>F</fnm></au><au><snm>Kreck</snm><fnm>B</fnm></au><au><snm>Franke</snm><fnm>A</fnm></au><au><snm>Andrews</snm><fnm>SR</fnm></au></aug><source>Nat Methods</source><pubdate>2012</pubdate><volume>9</volume><fpage>145</fpage><lpage>51</lpage><xrefbib><pubidlist><pubid idtype="doi">10.1038/nmeth.1828</pubid><pubid idtype="pmpid" link="fulltext">22290186</pubid></pubidlist></xrefbib></bibl><bibl id="B11"><title><p>The Sequence Alignment/Map format and SAMtools</p></title><aug><au><snm>Li</snm><fnm>H</fnm></au><au><snm>Handsaker</snm><fnm>B</fnm></au><au><snm>Wysoker</snm><fnm>A</fnm></au><au><snm>Fennell</snm><fnm>T</fnm></au><au><snm>Ruan</snm><fnm>J</fnm></au><au><snm>Homer</snm><fnm>N</fnm></au><au><snm>Marth</snm><fnm>G</fnm></au><au><snm>Abecasis</snm><fnm>G</fnm></au><au><snm>Durbin</snm><fnm>R</fnm></au><au><cnm>1000 Genome Project Data Processing Subgroup</cnm></au></aug><source>Bioinformatics</source><pubdate>2009</pubdate><volume>25</volume><fpage>2078</fpage><lpage>9</lpage><xrefbib><pubidlist><pubid idtype="doi">10.1093/bioinformatics/btp352</pubid><pubid idtype="pmcid">2723002</pubid><pubid idtype="pmpid" link="fulltext">19505943</pubid></pubidlist></xrefbib></bibl><bibl id="B12"><title><p>A framework for variation discovery and genotyping using next-generation DNA sequencing data</p></title><aug><au><snm>DePristo</snm><fnm>MA</fnm></au><au><snm>Banks</snm><fnm>E</fnm></au><au><snm>Poplin</snm><fnm>R</fnm></au><au><snm>Garimella</snm><fnm>KV</fnm></au><au><snm>Maguire</snm><fnm>JR</fnm></au><au><snm>Hartl</snm><fnm>C</fnm></au><au><snm>Philippakis</snm><fnm>AA</fnm></au><au><snm>del Angel</snm><fnm>G</fnm></au><au><snm>Rivas</snm><fnm>MA</fnm></au><au><snm>Hanna</snm><fnm>M</fnm></au><au><snm>McKenna</snm><fnm>A</fnm></au><au><snm>Fennell</snm><fnm>TJ</fnm></au><au><snm>Kernytsky</snm><fnm>AM</fnm></au><au><snm>Sivachenko</snm><fnm>AY</fnm></au><au><snm>Cibulskis</snm><fnm>K</fnm></au><au><snm>Gabriel</snm><fnm>SB</fnm></au><au><snm>Altshuler</snm><fnm>D</fnm></au><au><snm>Daly</snm><fnm>MJ</fnm></au></aug><source>Nat Genet</source><pubdate>2011</pubdate><volume>43</volume><fpage>491</fpage><lpage>8</lpage><xrefbib><pubidlist><pubid idtype="doi">10.1038/ng.806</pubid><pubid idtype="pmcid">3083463</pubid><pubid idtype="pmpid" link="fulltext">21478889</pubid></pubidlist></xrefbib></bibl><bibl id="B13"><title><p>SNP detection for massively parallel whole-genome resequencing</p></title><aug><au><snm>Li</snm><fnm>R</fnm></au><au><snm>Li</snm><fnm>Y</fnm></au><au><snm>Fang</snm><fnm>X</fnm></au><au><snm>Yang</snm><fnm>H</fnm></au><au><snm>Wang</snm><fnm>J</fnm></au><au><snm>Kristiansen</snm><fnm>K</fnm></au><au><snm>Wang</snm><fnm>J</fnm></au></aug><source>Genome Res</source><pubdate>2009</pubdate><volume>19</volume><fpage>1124</fpage><lpage>32</lpage><xrefbib><pubidlist><pubid idtype="doi">10.1101/gr.088013.108</pubid><pubid idtype="pmcid">2694485</pubid><pubid idtype="pmpid" link="fulltext">19420381</pubid></pubidlist></xrefbib></bibl><bibl id="B14"><title><p>Neighboring-nucleotide effects on single nucleotide polymorphisms: a study of 2.6 million polymorphisms across the human genome</p></title><aug><au><snm>Zhao</snm><fnm>Z</fnm></au><au><snm>Boerwinkle</snm><fnm>E</fnm></au></aug><source>Genome Res</source><pubdate>2002</pubdate><volume>12</volume><fpage>1679</fpage><lpage>86</lpage><xrefbib><pubidlist><pubid idtype="doi">10.1101/gr.287302</pubid><pubid idtype="pmcid">187558</pubid><pubid idtype="pmpid" link="fulltext">12421754</pubid></pubidlist></xrefbib></bibl><bibl id="B15"><title><p>Non-CpG methylation is prevalent in embryonic stem cells and may be mediated by DNA methyltransferase 3a</p></title><aug><au><snm>Ramsahoye</snm><fnm>BH</fnm></au><au><snm>Biniszkiewicz</snm><fnm>D</fnm></au><au><snm>Lyko</snm><fnm>F</fnm></au><au><snm>Clark</snm><fnm>V</fnm></au><au><snm>Bird</snm><fnm>AP</fnm></au><au><snm>Jaenisch</snm><fnm>R</fnm></au></aug><source>Proc Natl Acad Sci USA</source><pubdate>2000</pubdate><volume>97</volume><fpage>5237</fpage><lpage>42</lpage><xrefbib><pubidlist><pubid idtype="doi">10.1073/pnas.97.10.5237</pubid><pubid idtype="pmcid">25812</pubid><pubid idtype="pmpid" link="fulltext">10805783</pubid></pubidlist></xrefbib></bibl><bibl id="B16"><title><p>Highly integrated single-base resolution maps of the epigenome in Arabidopsis</p></title><aug><au><snm>Lister</snm><fnm>R</fnm></au><au><snm>O&apos;Malley</snm><fnm>RC</fnm></au><au><snm>Tonti-Filippini</snm><fnm>J</fnm></au><au><snm>Gregory</snm><fnm>BD</fnm></au><au><snm>Berry</snm><fnm>CC</fnm></au><au><snm>Millar</snm><fnm>AH</fnm></au><au><snm>Ecker</snm><fnm>JR</fnm></au></aug><source>Cell</source><pubdate>2008</pubdate><volume>133</volume><fpage>523</fpage><lpage>36</lpage><xrefbib><pubidlist><pubid idtype="doi">10.1016/j.cell.2008.03.029</pubid><pubid idtype="pmcid">2723732</pubid><pubid idtype="pmpid" link="fulltext">18423832</pubid></pubidlist></xrefbib></bibl><bibl id="B17"><title><p>Shotgun bisulphite sequencing of the Arabidopsis genome reveals DNA methylation patterning</p></title><aug><au><snm>Cokus</snm><fnm>SJ</fnm></au><au><snm>Feng</snm><fnm>S</fnm></au><au><snm>Zhang</snm><fnm>X</fnm></au><au><snm>Chen</snm><fnm>Z</fnm></au><au><snm>Merriman</snm><fnm>B</fnm></au><au><snm>Haudenschild</snm><fnm>CD</fnm></au><au><snm>Pradhan</snm><fnm>S</fnm></au><au><snm>Nelson</snm><fnm>SF</fnm></au><au><snm>Pellegrini</snm><fnm>M</fnm></au><au><snm>Jacobsen</snm><fnm>SE</fnm></au></aug><source>Nature</source><pubdate>2008</pubdate><volume>452</volume><fpage>215</fpage><lpage>9</lpage><xrefbib><pubidlist><pubid idtype="doi">10.1038/nature06745</pubid><pubid idtype="pmcid">2377394</pubid><pubid idtype="pmpid" link="fulltext">18278030</pubid></pubidlist></xrefbib></bibl><bibl id="B18"><title><p>Identification of genetic elements that autonomously determine DNA methylation states</p></title><aug><au><snm>Lienert</snm><fnm>F</fnm></au><au><snm>Wirbelauer</snm><fnm>C</fnm></au><au><snm>Som</snm><fnm>I</fnm></au><au><snm>Dean</snm><fnm>A</fnm></au><au><snm>Mohn</snm><fnm>F</fnm></au><au><snm>Sch&#252;beler</snm><fnm>D</fnm></au></aug><source>Nat Genet</source><pubdate>2011</pubdate><volume>43</volume><fpage>1091</fpage><lpage>7</lpage><xrefbib><pubidlist><pubid idtype="doi">10.1038/ng.946</pubid><pubid idtype="pmpid" link="fulltext">21964573</pubid></pubidlist></xrefbib></bibl><bibl id="B19"><title><p>Allele-specificDNA methylation: beyond imprinting</p></title><aug><au><snm>Tycko</snm><fnm>B</fnm></au></aug><source>Hum Mol Genet</source><pubdate>2010</pubdate><volume>19</volume><fpage>R210</fpage><lpage>20</lpage><xrefbib><pubidlist><pubid idtype="doi">10.1093/hmg/ddq376</pubid><pubid idtype="pmcid">2953749</pubid><pubid idtype="pmpid" link="fulltext">20855472</pubid></pubidlist></xrefbib></bibl><bibl id="B20"><title><p>Allele-specific methylation is prevalent and is contributed by CpG-SNPs in the human genome</p></title><aug><au><snm>Shoemaker</snm><fnm>R</fnm></au><au><snm>Deng</snm><fnm>J</fnm></au><au><snm>Wang</snm><fnm>W</fnm></au><au><snm>Zhang</snm><fnm>K</fnm></au></aug><source>Genome Res</source><pubdate>2010</pubdate><volume>20</volume><fpage>883</fpage><lpage>9</lpage><xrefbib><pubidlist><pubid idtype="doi">10.1101/gr.104695.109</pubid><pubid idtype="pmcid">2892089</pubid><pubid idtype="pmpid" link="fulltext">20418490</pubid></pubidlist></xrefbib></bibl><bibl id="B21"><title><p>Analysis of DNA methylation in a three-generation family reveals widespread genetic influence on epigenetic regulation</p></title><aug><au><snm>Gertz</snm><fnm>J</fnm></au><au><snm>Varley</snm><fnm>KE</fnm></au><au><snm>Reddy</snm><fnm>TE</fnm></au><au><snm>Bowling</snm><fnm>KM</fnm></au><au><snm>Pauli</snm><fnm>F</fnm></au><au><snm>Parker</snm><fnm>SL</fnm></au><au><snm>Kucera</snm><fnm>KS</fnm></au><au><snm>Willard</snm><fnm>HF</fnm></au><au><snm>Myers</snm><fnm>RM</fnm></au></aug><source>PLoS Genet</source><pubdate>2011</pubdate><volume>7</volume><fpage>e1002228</fpage><xrefbib><pubidlist><pubid idtype="doi">10.1371/journal.pgen.1002228</pubid><pubid idtype="pmcid">3154961</pubid><pubid idtype="pmpid" link="fulltext">21852959</pubid></pubidlist></xrefbib></bibl><bibl id="B22"><title><p>Base-resolution analyses of sequence and parent-of-origin dependent DNA methylation in the mouse genome</p></title><aug><au><snm>Xie</snm><fnm>W</fnm></au><au><snm>Barr</snm><fnm>CL</fnm></au><au><snm>Kim</snm><fnm>A</fnm></au><au><snm>Yue</snm><fnm>F</fnm></au><au><snm>Lee</snm><fnm>AY</fnm></au><au><snm>Eubanks</snm><fnm>J</fnm></au><au><snm>Dempster</snm><fnm>EL</fnm></au><au><snm>Ren</snm><fnm>B</fnm></au></aug><source>Cell</source><pubdate>2012</pubdate><volume>148</volume><fpage>816</fpage><lpage>31</lpage><xrefbib><pubidlist><pubid idtype="doi">10.1016/j.cell.2011.12.035</pubid><pubid idtype="pmpid" link="fulltext">22341451</pubid></pubidlist></xrefbib></bibl><bibl id="B23"><title><p>The DNA methylome of human peripheral blood mononuclear cells</p></title><aug><au><snm>Li</snm><fnm>Y</fnm></au><au><snm>Zhu</snm><fnm>J</fnm></au><au><snm>Tian</snm><fnm>G</fnm></au><au><snm>Li</snm><fnm>N</fnm></au><au><snm>Li</snm><fnm>Q</fnm></au><au><snm>Ye</snm><fnm>M</fnm></au><au><snm>Zheng</snm><fnm>H</fnm></au><au><snm>Yu</snm><fnm>J</fnm></au><au><snm>Wu</snm><fnm>H</fnm></au><au><snm>Sun</snm><fnm>J</fnm></au><au><snm>Zhang</snm><fnm>H</fnm></au><au><snm>Chen</snm><fnm>Q</fnm></au><au><snm>Luo</snm><fnm>R</fnm></au><au><snm>Chen</snm><fnm>M</fnm></au><au><snm>He</snm><fnm>Y</fnm></au><au><snm>Jin</snm><fnm>X</fnm></au><au><snm>Zhang</snm><fnm>Q</fnm></au><au><snm>Yu</snm><fnm>C</fnm></au><au><snm>Zhou</snm><fnm>G</fnm></au><au><snm>Sun</snm><fnm>J</fnm></au><au><snm>Huang</snm><fnm>Y</fnm></au><au><snm>Zheng</snm><fnm>H</fnm></au><au><snm>Cao</snm><fnm>H</fnm></au><au><snm>Zhou</snm><fnm>X</fnm></au><au><snm>Guo</snm><fnm>S</fnm></au><au><snm>Hu</snm><fnm>X</fnm></au><au><snm>Li</snm><fnm>X</fnm></au><au><snm>Kristiansen</snm><fnm>K</fnm></au><au><snm>Bolund</snm><fnm>L</fnm></au><au><snm>Xu</snm><fnm>J</fnm></au><au><snm>Wang</snm><fnm>W</fnm></au><au><snm>Yang</snm><fnm>H</fnm></au><au><snm>Wang</snm><fnm>J</fnm></au><au><snm>Li</snm><fnm>R</fnm></au><au><snm>Beck</snm><fnm>S</fnm></au><au><snm>Wang</snm><fnm>J</fnm></au><au><snm>Zhang</snm><fnm>X</fnm></au></aug><source>PLoS Biol</source><pubdate>2010</pubdate><volume>8</volume><fpage>e1000533</fpage><xrefbib><pubidlist><pubid idtype="doi">10.1371/journal.pbio.1000533</pubid><pubid idtype="pmcid">2976721</pubid><pubid idtype="pmpid" link="fulltext">21085693</pubid></pubidlist></xrefbib></bibl><bibl id="B24"><title><p>DNA-binding factors shape the mouse methylome at distal regulatory regions</p></title><aug><au><snm>Stadler</snm><fnm>MB</fnm></au><au><snm>Murr</snm><fnm>R</fnm></au><au><snm>Burger</snm><fnm>L</fnm></au><au><snm>Ivanek</snm><fnm>R</fnm></au><au><snm>Lienert</snm><fnm>F</fnm></au><au><snm>Sch&#228;oler</snm><fnm>A</fnm></au><au><snm>Wirbelauer</snm><fnm>C</fnm></au><au><snm>Oakeley</snm><fnm>EJ</fnm></au><au><snm>Gaidatzis</snm><fnm>D</fnm></au><au><snm>Tiwari</snm><fnm>VK</fnm></au><au><snm>Sch&#228;ubeler</snm><fnm>D</fnm></au></aug><source>Nature</source><pubdate>2011</pubdate><volume>480</volume><fpage>490</fpage><lpage>5</lpage><xrefbib><pubid idtype="pmpid" link="fulltext">22170606</pubid></xrefbib></bibl><bibl id="B25"><title><p>Global DNA hypomethylation coupled to repressive chromatin domain formation and gene silencing in breast cancer</p></title><aug><au><snm>Hon</snm><fnm>GC</fnm></au><au><snm>Hawkins</snm><fnm>RD</fnm></au><au><snm>Caballero</snm><fnm>OL</fnm></au><au><snm>Lo</snm><fnm>C</fnm></au><au><snm>Lister</snm><fnm>R</fnm></au><au><snm>Pelizzola</snm><fnm>M</fnm></au><au><snm>Valsesia</snm><fnm>A</fnm></au><au><snm>Ye</snm><fnm>Z</fnm></au><au><snm>Kuan</snm><fnm>S</fnm></au><au><snm>Edsall</snm><fnm>LE</fnm></au><au><snm>Camargo</snm><fnm>AA</fnm></au><au><snm>Stevenson</snm><fnm>BJ</fnm></au><au><snm>Ecker</snm><fnm>JR</fnm></au><au><snm>Bafna</snm><fnm>V</fnm></au><au><snm>Strausberg</snm><fnm>RL</fnm></au><au><snm>Simpson</snm><fnm>AJ</fnm></au><au><snm>Ren</snm><fnm>B</fnm></au></aug><source>Genome Res</source><pubdate>2012</pubdate><volume>22</volume><fpage>246</fpage><lpage>58</lpage><xrefbib><pubidlist><pubid idtype="doi">10.1101/gr.125872.111</pubid><pubid idtype="pmcid">3266032</pubid><pubid idtype="pmpid" link="fulltext">22156296</pubid></pubidlist></xrefbib></bibl><bibl id="B26"><title><p>Comparison of sequencing-based methods to profile DNA methylation and identification of monoallelic epigenetic modifications</p></title><aug><au><snm>Harris</snm><fnm>RA</fnm></au><au><snm>Wang</snm><fnm>T</fnm></au><au><snm>Coarfa</snm><fnm>C</fnm></au><au><snm>Nagarajan</snm><fnm>RP</fnm></au><au><snm>Hong</snm><fnm>C</fnm></au><au><snm>Downey</snm><fnm>SL</fnm></au><au><snm>Johnson</snm><fnm>BE</fnm></au><au><snm>Fouse</snm><fnm>SD</fnm></au><au><snm>Delaney</snm><fnm>A</fnm></au><au><snm>Zhao</snm><fnm>Y</fnm></au><au><snm>Olshen</snm><fnm>A</fnm></au><au><snm>Ballinger</snm><fnm>T</fnm></au><au><snm>Zhou</snm><fnm>X</fnm></au><au><snm>Forsberg</snm><fnm>KJ</fnm></au><au><snm>Gu</snm><fnm>J</fnm></au><au><snm>Echipare</snm><fnm>L</fnm></au><au><snm>O&apos;Geen</snm><fnm>H</fnm></au><au><snm>Lister</snm><fnm>R</fnm></au><au><snm>Pelizzola</snm><fnm>M</fnm></au><au><snm>Xi</snm><fnm>Y</fnm></au><au><snm>Epstein</snm><fnm>CB</fnm></au><au><snm>Bernstein</snm><fnm>BE</fnm></au><au><snm>Hawkins</snm><fnm>RD</fnm></au><au><snm>Ren</snm><fnm>B</fnm></au><au><snm>Chung</snm><fnm>WY</fnm></au><au><snm>Gu</snm><fnm>H</fnm></au><au><snm>Bock</snm><fnm>C</fnm></au><au><snm>Gnirke</snm><fnm>A</fnm></au><au><snm>Zhang</snm><fnm>MQ</fnm></au><au><snm>Haussler</snm><fnm>D</fnm></au><au><snm>Ecker</snm><fnm>JR</fnm></au><au><snm>Li</snm><fnm>W</fnm></au><au><snm>Farnham</snm><fnm>PJ</fnm></au><au><snm>Waterland</snm><fnm>RA</fnm></au><au><snm>Meissner</snm><fnm>A</fnm></au><au><snm>Marra</snm><fnm>MA</fnm></au><au><snm>Hirst</snm><fnm>M</fnm></au><au><snm>Milosavljevic</snm><fnm>A</fnm></au><au><snm>Costello</snm><fnm>JF</fnm></au></aug><source>Nat Biotechnol</source><pubdate>2010</pubdate><volume>28</volume><fpage>1097</fpage><lpage>105</lpage><xrefbib><pubidlist><pubid idtype="doi">10.1038/nbt.1682</pubid><pubid idtype="pmcid">2955169</pubid><pubid idtype="pmpid" link="fulltext">20852635</pubid></pubidlist></xrefbib></bibl><bibl id="B27"><title><p>Allelic skewing of DNA methylation is widespread across the genome</p></title><aug><au><snm>Schalkwyk</snm><fnm>LC</fnm></au><au><snm>Meaburn</snm><fnm>EL</fnm></au><au><snm>Smith</snm><fnm>R</fnm></au><au><snm>Dempster</snm><fnm>EL</fnm></au><au><snm>Jeffries</snm><fnm>AR</fnm></au><au><snm>Davies</snm><fnm>MN</fnm></au><au><snm>Plomin</snm><fnm>R</fnm></au><au><snm>Mill</snm><fnm>J</fnm></au></aug><source>Am J Hum Genet</source><pubdate>2010</pubdate><volume>86</volume><fpage>196</fpage><lpage>212</lpage><xrefbib><pubidlist><pubid idtype="doi">10.1016/j.ajhg.2010.01.014</pubid><pubid idtype="pmcid">2820163</pubid><pubid idtype="pmpid" link="fulltext">20159110</pubid></pubidlist></xrefbib></bibl><bibl id="B28"><title><p>Analysis of repetitive element DNA methylation by MethyLight</p></title><aug><au><snm>Weisenberger</snm><fnm>DJ</fnm></au><au><snm>Campan</snm><fnm>M</fnm></au><au><snm>Long</snm><fnm>TI</fnm></au><au><snm>Kim</snm><fnm>M</fnm></au><au><snm>Woods</snm><fnm>C</fnm></au><au><snm>Fiala</snm><fnm>E</fnm></au><au><snm>Ehrlich</snm><fnm>M</fnm></au><au><snm>Laird</snm><fnm>PW</fnm></au></aug><source>Nucleic Acids Res</source><pubdate>2005</pubdate><volume>33</volume><fpage>6823</fpage><lpage>36</lpage><xrefbib><pubidlist><pubid idtype="doi">10.1093/nar/gki987</pubid><pubid idtype="pmcid">1301596</pubid><pubid idtype="pmpid" link="fulltext">16326863</pubid></pubidlist></xrefbib></bibl><bibl id="B29"><title><p>Hotspots of aberrant epigenomic reprogramming in human induced pluripotent stem cells</p></title><aug><au><snm>Lister</snm><fnm>R</fnm></au><au><snm>Pelizzola</snm><fnm>M</fnm></au><au><snm>Kida</snm><fnm>YS</fnm></au><au><snm>Hawkins</snm><fnm>RD</fnm></au><au><snm>Nery</snm><fnm>JR</fnm></au><au><snm>Hon</snm><fnm>G</fnm></au><au><snm>Antosiewicz-Bourget</snm><fnm>J</fnm></au><au><snm>O&apos;Malley</snm><fnm>R</fnm></au><au><snm>Castanon</snm><fnm>R</fnm></au><au><snm>Klugman</snm><fnm>S</fnm></au><au><snm>Downes</snm><fnm>M</fnm></au><au><snm>Yu</snm><fnm>R</fnm></au><au><snm>Stewart</snm><fnm>R</fnm></au><au><snm>Ren</snm><fnm>B</fnm></au><au><snm>Thomson</snm><fnm>JA</fnm></au><au><snm>Evans</snm><fnm>RM</fnm></au><au><snm>Ecker</snm><fnm>JR</fnm></au></aug><source>Nature</source><pubdate>2011</pubdate><volume>471</volume><fpage>68</fpage><lpage>73</lpage><xrefbib><pubidlist><pubid idtype="doi">10.1038/nature09798</pubid><pubid idtype="pmcid">3100360</pubid><pubid idtype="pmpid" link="fulltext">21289626</pubid></pubidlist></xrefbib></bibl><bibl id="B30"><title><p>A comparative analysis of DNA methylation across human embryonic stem cell lines</p></title><aug><au><snm>Chen</snm><fnm>PY</fnm></au><au><snm>Feng</snm><fnm>S</fnm></au><au><snm>Joo</snm><fnm>JWJ</fnm></au><au><snm>Jacobsen</snm><fnm>SE</fnm></au><au><snm>Pellegrini</snm><fnm>M</fnm></au></aug><source>Genome Biol</source><pubdate>2011</pubdate><volume>12</volume><fpage>R62</fpage><xrefbib><pubidlist><pubid idtype="doi">10.1186/gb-2011-12-7-r62</pubid><pubid idtype="pmcid">3218824</pubid><pubid idtype="pmpid" link="fulltext">21733148</pubid></pubidlist></xrefbib></bibl><bibl id="B31"><title><p>Mapping short DNA sequencing reads and calling variants using mapping quality scores</p></title><aug><au><snm>Li</snm><fnm>H</fnm></au><au><snm>Ruan</snm><fnm>J</fnm></au><au><snm>Durbin</snm><fnm>R</fnm></au></aug><source>Genome Res</source><pubdate>2008</pubdate><volume>18</volume><fpage>1851</fpage><lpage>8</lpage><xrefbib><pubidlist><pubid idtype="doi">10.1101/gr.078212.108</pubid><pubid idtype="pmcid">2577856</pubid><pubid idtype="pmpid" link="fulltext">18714091</pubid></pubidlist></xrefbib></bibl><bibl id="B32"><title><p>The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data</p></title><aug><au><snm>McKenna</snm><fnm>A</fnm></au><au><snm>Hanna</snm><fnm>M</fnm></au><au><snm>Banks</snm><fnm>E</fnm></au><au><snm>Sivachenko</snm><fnm>A</fnm></au><au><snm>Cibulskis</snm><fnm>K</fnm></au><au><snm>Kernytsky</snm><fnm>A</fnm></au><au><snm>Garimella</snm><fnm>K</fnm></au><au><snm>Altshuler</snm><fnm>D</fnm></au><au><snm>Gabriel</snm><fnm>S</fnm></au><au><snm>Daly</snm><fnm>M</fnm></au><au><snm>DePristo</snm><fnm>MA</fnm></au></aug><source>Genome Res</source><pubdate>2010</pubdate><volume>20</volume><fpage>1297</fpage><lpage>303</lpage><xrefbib><pubidlist><pubid idtype="doi">10.1101/gr.107524.110</pubid><pubid idtype="pmcid">2928508</pubid><pubid idtype="pmpid" link="fulltext">20644199</pubid></pubidlist></xrefbib></bibl><bibl id="B33"><title><p>RRBSMAP: a fast, accurate and user-friendly alignment tool for reduced representation bisulfite sequencing</p></title><aug><au><snm>Xi</snm><fnm>Y</fnm></au><au><snm>Bock</snm><fnm>C</fnm></au><au><snm>M&#252;ller</snm><fnm>F</fnm></au><au><snm>Sun</snm><fnm>D</fnm></au><au><snm>Meissner</snm><fnm>A</fnm></au><au><snm>Li</snm><fnm>W</fnm></au></aug><source>Bioinformatics</source><pubdate>2012</pubdate><volume>28</volume><fpage>430</fpage><lpage>2</lpage><xrefbib><pubidlist><pubid idtype="doi">10.1093/bioinformatics/btr668</pubid><pubid idtype="pmpid" link="fulltext">22155871</pubid></pubidlist></xrefbib></bibl><bibl id="B34"><title><p>Bismark: a flexible aligner and methylation caller for Bisulfite-Seq applications</p></title><aug><au><snm>Krueger</snm><fnm>F</fnm></au><au><snm>Andrews</snm><fnm>SR</fnm></au></aug><source>Bioinformatics</source><pubdate>2011</pubdate><volume>27</volume><fpage>1571</fpage><lpage>2</lpage><xrefbib><pubidlist><pubid idtype="doi">10.1093/bioinformatics/btr167</pubid><pubid idtype="pmcid">3102221</pubid><pubid idtype="pmpid" link="fulltext">21493656</pubid></pubidlist></xrefbib></bibl><bibl id="B35"><title><p>Dynamic changes in the human methylome during differentiation</p></title><aug><au><snm>Laurent</snm><fnm>L</fnm></au><au><snm>Wong</snm><fnm>E</fnm></au><au><snm>Li</snm><fnm>G</fnm></au><au><snm>Huynh</snm><fnm>T</fnm></au><au><snm>Tsirigos</snm><fnm>A</fnm></au><au><snm>Ong</snm><fnm>CT</fnm></au><au><snm>Low</snm><fnm>HM</fnm></au><au><snm>Kin Sung</snm><fnm>KW</fnm></au><au><snm>Rigoutsos</snm><fnm>I</fnm></au><au><snm>Loring</snm><fnm>J</fnm></au><au><snm>Wei</snm><fnm>CL</fnm></au></aug><source>Genome Res</source><pubdate>2010</pubdate><volume>20</volume><fpage>320</fpage><lpage>31</lpage><xrefbib><pubidlist><pubid idtype="doi">10.1101/gr.101907.109</pubid><pubid idtype="pmcid">2840979</pubid><pubid idtype="pmpid" link="fulltext">20133333</pubid></pubidlist></xrefbib></bibl><bibl id="B36"><title><p>Directional DNA methylation changes and complex intermediate states accompany lineage specificity in the adult hematopoietic compartment</p></title><aug><au><snm>Hodges</snm><fnm>E</fnm></au><au><snm>Molaro</snm><fnm>A</fnm></au><au><snm>Dos Santos</snm><fnm>CO</fnm></au><au><snm>Thekkat</snm><fnm>P</fnm></au><au><snm>Song</snm><fnm>Q</fnm></au><au><snm>Uren</snm><fnm>PJ</fnm></au><au><snm>Park</snm><fnm>J</fnm></au><au><snm>Butler</snm><fnm>J</fnm></au><au><snm>Rafii</snm><fnm>S</fnm></au><au><snm>McCombie</snm><fnm>WR</fnm></au><au><snm>Smith</snm><fnm>AD</fnm></au><au><snm>Hannon</snm><fnm>GJ</fnm></au></aug><source>Mol Cell</source><pubdate>2011</pubdate><volume>44</volume><fpage>17</fpage><lpage>28</lpage><xrefbib><pubidlist><pubid idtype="doi">10.1016/j.molcel.2011.08.026</pubid><pubid idtype="pmcid">3412369</pubid><pubid idtype="pmpid" link="fulltext">21924933</pubid></pubidlist></xrefbib></bibl><bibl id="B37"><title><p>Bis-SNP website</p></title><aug><au><cnm>USC Epigenome Center</cnm></au></aug><url>http://epigenome.usc.edu/publicationdata/bissnp2011</url></bibl><bibl id="B38"><title><p>Quantitative Sequencing of 5-Methylcytosine and 5-Hydroxymethylcytosine at Single-Base Resolution</p></title><aug><au><snm>Booth</snm><fnm>MJ</fnm></au><au><snm>Branco</snm><fnm>MR</fnm></au><au><snm>Ficz</snm><fnm>G</fnm></au><au><snm>Oxley</snm><fnm>D</fnm></au><au><snm>Krueger</snm><fnm>F</fnm></au><au><snm>Reik</snm><fnm>W</fnm></au><au><snm>Balasubramanian</snm><fnm>S</fnm></au></aug><source>Science</source><pubdate>2012</pubdate></bibl><bibl id="B39"><title><p>Epigenome-wide association studies for common human diseases</p></title><aug><au><snm>Rakyan</snm><fnm>VK</fnm></au><au><snm>Down</snm><fnm>TA</fnm></au><au><snm>Balding</snm><fnm>DJ</fnm></au><au><snm>Beck</snm><fnm>S</fnm></au></aug><source>Nat Rev Genet</source><pubdate>2011</pubdate><volume>12</volume><fpage>529</fpage><lpage>41</lpage><xrefbib><pubidlist><pubid idtype="doi">10.1038/nrg3000</pubid><pubid idtype="pmpid" link="fulltext">21747404</pubid></pubidlist></xrefbib></bibl><bibl id="B40"><title><p>Cloning, characterization, and expression in Escherichia coli of the gene coding for the CpG DNA methylase from Spiroplasma sp. strain MQ1(M.SssI)</p></title><aug><au><snm>Renbaum</snm><fnm>P</fnm></au><au><snm>Abrahamove</snm><fnm>D</fnm></au><au><snm>Fainsod</snm><fnm>A</fnm></au><au><snm>Wilson</snm><fnm>GG</fnm></au><au><snm>Rottem</snm><fnm>S</fnm></au><au><snm>Razin</snm><fnm>A</fnm></au></aug><source>Nucleic Acids Res</source><pubdate>1990</pubdate><volume>18</volume><fpage>1145</fpage><lpage>52</lpage><xrefbib><pubidlist><pubid idtype="doi">10.1093/nar/18.5.1145</pubid><pubid idtype="pmcid">330428</pubid><pubid idtype="pmpid" link="fulltext">2181400</pubid></pubidlist></xrefbib></bibl><bibl id="B41"><title><p>High-throughput bisulfite sequencing in mammalian genomes</p></title><aug><au><snm>Smith</snm><fnm>ZD</fnm></au><au><snm>Gu</snm><fnm>H</fnm></au><au><snm>Bock</snm><fnm>C</fnm></au><au><snm>Gnirke</snm><fnm>A</fnm></au><au><snm>Meissner</snm><fnm>A</fnm></au></aug><source>Methods</source><pubdate>2009</pubdate><volume>48</volume><fpage>226</fpage><lpage>32</lpage><xrefbib><pubidlist><pubid idtype="doi">10.1016/j.ymeth.2009.05.003</pubid><pubid idtype="pmcid">2864123</pubid><pubid idtype="pmpid" link="fulltext">19442738</pubid></pubidlist></xrefbib></bibl><bibl id="B42"><title><p>A bisulfite method of 5-methylcytosine mapping that minimizes template degradation</p></title><aug><au><snm>Raizis</snm><fnm>AM</fnm></au><au><snm>Schmitt</snm><fnm>F</fnm></au><au><snm>Jost</snm><fnm>JP</fnm></au></aug><source>Anal Biochem</source><pubdate>1995</pubdate><volume>226</volume><fpage>161</fpage><lpage>6</lpage><xrefbib><pubidlist><pubid idtype="doi">10.1006/abio.1995.1204</pubid><pubid idtype="pmpid" link="fulltext">7785768</pubid></pubidlist></xrefbib></bibl><bibl id="B43"><title><p>A new method for accurate assessment of DNA quality after bisulfite treatment</p></title><aug><au><snm>Ehrich</snm><fnm>M</fnm></au><au><snm>Zoll</snm><fnm>S</fnm></au><au><snm>Sur</snm><fnm>S</fnm></au><au><snm>van den Boom</snm><fnm>D</fnm></au></aug><source>Nucleic Acids Res</source><pubdate>2007</pubdate><volume>35</volume><fpage>e29</fpage><xrefbib><pubidlist><pubid idtype="doi">10.1093/nar/gkl1134</pubid><pubid idtype="pmcid">1865059</pubid><pubid idtype="pmpid" link="fulltext">17259213</pubid></pubidlist></xrefbib></bibl><bibl id="B44"><title><p>Cancer Genomics Hub (CG-Hub)</p></title><aug><au><cnm>UC Santa Cruz</cnm></au></aug><url>https://cghub.ucsc.edu/</url></bibl><bibl id="B45"><title><p>High Performance Computing and Communications Center (HPCC)</p></title><aug><au><cnm>USC</cnm></au></aug><url>http://www.usc.edu/hpcc/</url></bibl></refgrp>
</bm>
</art>