Systematic analysis of transcribed loci in ENCODE regions using RACE sequencing reveals extensive transcription in the human genome1Molecular, Cellular and Developmental Biology Department, KBT918, Yale University, 266 Whitney Avenue, New Haven, Connecticut 06511, USA 2Computer Science Department, Yale University, 51 Prospect St., New Haven, Connecticut 06511, USA 3Molecular Biophysics and Biochemistry Department, Yale University, 260 Whitney Avenue, New Haven, Connecticut 06511, USA 4Genetics Department, Yale University, 333 Cedar Street, New Haven, Connecticut 06511, USA
Genome Biology 2008, 9:R3doi:10.1186/gb-2008-9-1-r3
Subject areas: Bioinformatics, Genome studies, Molecular biology Additional filesAdditional data file 1: Shown are examples of RACE PCR products on an agarose gel. Format: PDF Size: 47KB Download file This file can be viewed with: Adobe Acrobat Reader Additional data file 2: The scores are computed using a subset (from one experiment) of our first set of RACE sequences, as described in the first row of Table 1. We then determined a reasonable threshold on these scores according to the overall distribution of the scores of those 'unique' matches (for every sequence, we choose the match with the highest score that locates on the expected chromosome, if one exists). We selected a threshold of 70 for the 'fitness scores', because this value clearly separates the two sets of matches, which we can interpret as high-quality and low-quality ones. Clearly, lowering the threshold will increase the sensitivity of our analysis, while decreasing the specificity. Format: PDF Size: 3KB Download file This file can be viewed with: Adobe Acrobat Reader Additional data file 3: Further explanation of consensus splice site analyses : In order to decide the window size for the consensus splice site analysis, we considered a simplified model in which a nucleotide sequence of length N is generated by randomly selecting A, C, G, T with equal probability of 1/4, and then computed the probability (prob_pattern) of that sequence containing at least one pattern of a consensus splice site (for example, having either 'GT' or 'AG' in the sequence). This is as follows: prob_pattern(N) = count_pattern(N)/(4^N), where count_pattern(1) = 0, count_pattern(2) = 1, count_pattern(N) = 4^(N - 2) - count_pattern(N - 2) + 4 × count_pattern(N - 1), for N > 2. Although this formula does not take into account many sophisticated factors in reality, it can provide us a good guideline on selecting the window size for our analysis. This file shows the squared values of such probabilities (which can be considered as a lower bound of the probability for a random sequence to have a complete consensus pattern) for N ranging from 2 to 13. In the analysis of this paper, we selected the window size to be 8 to ensure at least twofold enrichment in the number of sequences that we identified compared with that in the simplified model, given the same number of sequences. Format: PDF Size: 4KB Download file This file can be viewed with: Adobe Acrobat Reader Additional data file 4: This file can be uploaded to the University of California at Santa Cruz Genome Brower to view all RACE products. Format: BED Size: 50KB Download file |


on Google Scholar







author email
corresponding author email