Abstract
Background
A significant challenge in bioinformatics is to develop methods for detecting and modeling patterns in variable DNA sequence sites, such as proteinbinding sites in regulatory DNA. Current approaches sometimes perform poorly when positions in the site do not independently affect protein binding. We developed a statistical technique for modeling the correlation structure in variable DNA sequence sites. The method places no restrictions on the number of correlated positions or on their spatial relationship within the site. No prior empirical evidence for the correlation structure is necessary.
Results
We applied our method to the recombination signal sequences (RSS) that direct assembly of Bcell and Tcell antigenreceptor genes via V(D)J recombination. The technique is based on model selection by crossvalidation and produces models that allow computation of an information score for any signallength sequence. We also modeled RSS using order zero and order one Markov chains. The scores from all models are highly correlated with measured recombination efficiencies, but the models arising from our technique are better than the Markov models at discriminating RSS from nonRSS.
Conclusions
Our modeldevelopment procedure produces models that estimate well the recombinogenic potential of RSS and are better at RSS recognition than the order zero and order one Markov models. Our models are, therefore, valuable for studying the regulation of both physiologic and aberrant V(D)J recombination. The approach could be equally powerful for the study of promoter and enhancer elements, splice sites, and other DNA regulatory sites that are highly variable at the level of individual nucleotide positions.
Background
Modeling variable DNA sequence sites
The set of binding sites for a single DNAbinding protein can be highly variable [1,2], and the degree of nucleotide variation tolerated generally differs from position to position within a site [3,4,5]. This sequence diversity can have important functional consequences as the affinity between regulatory proteins and their binding sites is modulated by changes in bindingsite sequence [1]. Large datasets of related DNA sites are now available [3,4,6,7]. Currently, more than 100 prokaryotic and eukaryotic genomes have been completely sequenced, permitting crossspecies comparisons of even larger sequence assemblies than collected here (see, for example [8,9]). Computational approaches can therefore be used to detect and model the sequence patterns present within the bindingsite variability. The most useful models of variable DNA sites provide classification algorithms for identifying functional sites as well as insights that can help elucidate the relationship between the structure and function of these sites.
Variable DNA sequence sites are frequently described by a consensus sequence [10] or a weight matrix [11,12,13]. Consensus sequences have limited utility for even moderately variable sites because they preserve little or no information about the variability within the sequences they characterize. Differences from consensus are quantified by counting the number of mismatched positions, making no distinction between mismatches that abolish function of the site and those that modulate function.
Weight matrices were introduced to characterize transcription and translation initiation sites in Escherichia coli [11] and have become standard for characterizing binding sites for transcription factors. A weight matrix is a twodimensional matrix in which each row corresponds to one of the four nucleotides and each column corresponds to one position in the site being described [11]. The elements of the matrix are the observed counts for each nucleotide at each position in an alignment of the DNA sites [11], or some function of these counts (reviewed in [13,14]). Any sequence can be scored by summing the matrix elements corresponding to the sequence [11], thereby quantitatively rating the sequence's functional potential.
Weight matrices can be used to scan genomic DNA and determine statistically significant matches to the sequence pattern described by the matrix [14]. Putative sites are scored by the weight matrix, so the number of false positives and false negatives can be balanced by choosing an appropriate threshold score defining functional sites. The scores of functional sites are sometimes correlated with their level of function [11,14,15,16,17,18,19,20,21,22,23].
Weight matrices are typically based on order zero Markov chains, meaning they assume that individual base pairs in the binding site affect binding affinity independently. Weightmatrix models based on order one Markov chains assume that adjacent positions are correlated [15,24,25,26,27]. Ponomarenko et al. [28] constructed weight matrices for many different transcription factor binding sites using mono, di, and trinucleotide motifs allowing correlation between nonadjacent positions by specifying X_{1}NX_{3} and X_{1}NX_{3}NX_{5} motifs, where X_{i} is the nucleotide at position i of the motif and N is any nucleotide. While this approach allows for correlation between more than two nonadjacent positions, the motifs may still be unnecessarily restrictive. The apposition of distant bindingsite positions can be important to recruitment of the binding protein, for example, through formation of DNA secondary structure [29,30,31,32], or to the interaction of the binding site with its binding protein [33]. Therefore correlation between distant positions is expected.
Burge and Karlin [34] used hypothesis testing via χ^{2} tests to detect significant correlations between any two positions and introduced a modelbuilding procedure, maximal dependence decomposition, to account for the dependencies. For each significant correlation, this method partitions the dataset into subsets of sequences exhibiting no correlation; the final model is a composite of weight matrices (order zero), one for each subset. Our modelbuilding technique is based on crossvalidation criteria rather than on hypothesis testing per se, and we use a more general model class for DNA sequences.
Models that allow for pairwise correlations between positions were developed in the context of Bayes networks by Cai et al. [35]. Bayes networks represent a large class of models, and in fact, our model can be recast as a Bayes network, although it is not now formulated in those terms.
We have developed an approach to determine the correlation structure present in a set of variable sequence sites. For each position in the site, we identify all other correlated positions placing no restrictions on the number or spatial relationship of correlated positions in the site. Our approach determines, from all possible combinations of disjoint probability distributions, the set of distributions that most effectively distinguishes functional sites from nonfunctional sequences. Although the family of models we consider is very large, we proceed by model selection rather than hypothesis testing. The selected probability function is the product of these disjoint probability distributions. The natural logarithm of this probability function can be used as a score for any sequence, thereby recognizing the presence of evolutionarily conserved nucleotide associations. We have applied this approach to recombination signal sequences (RSS), a set of DNA binding sites necessary for specific immunity.
Recombination signal sequences
Specific immunity depends on the ability of B and T lymphocytes to recognize antigens, molecular identifiers of pathogens [36]. The immune system does not anticipate which antigen will be encountered, but remarkable genetic mechanisms have evolved that generate diverse antigenreceptor repertoires. The primary generative mechanism is V(D)J recombination, the rearrangement of Bcell receptor (BCR) or Tcell receptor (TCR) V, D, and J gene segments to form functional genes; the resulting combinatorial and junctional diversity can produce around 10^{14} BCR specificities and around 10^{18} TCR specificities in one animal [37]. This extraordinary plasticity has a price: V(D)J recombination can produce selfreactive receptors leading to autoimmune disease [38,39] and may err to create oncogenic chromosomal translocations [40].
V(D)J recombination proceeds by the introduction of doublestrand DNA breaks (reviewed in [41]). To prevent DNA breaks that would cause cellular damage, the V(D)J recombinase must be specifically targeted to the Bcr and Tcr loci. This targeting is regulated primarily by binding of RAG1 [42], an enzyme in the recombinase complex, to the RSS adjacent to each V, D and J gene segment [43]. RSS were initially identified as a conserved heptamer (consensus CACAGTG) and a conserved nonamer (consensus ACAAAAACC) separated by a less conserved spacer of either 12 ± 1 or 23 ± 1 basepairs (bp) [44]. The initial CAC of the heptamer is highly conserved, but all remaining positions show moderate to very low levels of conservation (reviewed in [45]).
RSS variability has important functional consequences: the efficiency with which each RSS mediates recombination depends on its sequence [46]. There is strong evidence that differing recombination efficiencies of RSS result in the biased utilization of their associated gene segments (reviewed in [47,48]). In addition, the biases in gene segment use observed in the postselection TCR repertoire are consistent with the biases observed before selection, suggesting that the primary immune repertoire may be genetically patterned [48].
The high level of diversity among RSS makes it very difficult to evaluate their recombinogenic potential. RSS not associated with a V, D or J gene segment but participating in aberrant (illegitimate) V(D)J recombination are particularly difficult to recognize. These nonphysiologic RSS are important because of their potential involvement in oncogenic chromosomal translocation and receptor editing, a process that alters the specificity of autoreactive receptors. Nonphysiologic RSS that have simply arisen by chance can be referred to as fortuitous RSS whereas those thought to have descended from the same ancient transposon as physiologic RSS can be referred to as cryptic (cRSS) [49,50]. Generally, the origin of a nonphysiologic RSS is not known; for simplicity, we will refer to both types as cryptic.
To understand the role of RSS variability in the formation of the primary immune repertoire and of cRSS in illegitimate V(D)J recombination, it is necessary to quantify the genetic variability among physiologic RSS and the relationship between this variability and the regulation of recombination. Allowing for each of the 4 nucleotides at just 10 of the 28 or 39 RSS positions results in over 10^{6} different signals  it is not feasible to measure the efficiency of all potential RSS experimentally. Statistical models of RSS variability that allow prediction of recombination efficiency are essential to furthering our understanding of the biological function of RSS variability.
One estimate of cRSS frequency in mammalian genomes is one in every 600 bp [49]. Promiscuous recombination at this frequency, however, would result in significant damage to the cell. This suggests that complex patterns underlie the high level of nucleotide diversity observed at individual positions, thereby increasing the specificity of the signal governing recruitment of the V(D)J recombinase.
We find significant correlations between nucleotide positions in RSS, suggesting they act cooperatively and that nucleotide associations are conserved. We used our model selection procedure to determine the best models of RSS correlation structure, one model for 12bp spacer RSS (12RSS) and one for 23RSS. We also modeled both types of RSS with order zero and order one Markov chains. The scores from all six models are highly correlated with measured recombination efficiencies, but the models arising from our technique are much better than the Markov models at identifying physiologic and cryptic RSS in genomic DNA.
Results
Mouse RSS show limited sequence conservation
We analyzed 356 physiologic mouse RSS (see Methods). While 62% (219/356) of these RSS contain a consensus heptamer, just 16% (57/356) contain a consensus nonamer and only 13% (48/356) contain both a consensus heptamer and a consensus nonamer. To measure the level of nucleotide diversity at individual RSS positions, we computed the entropy (H_{i}) [51] for each position i in an alignment of the 201 12RSS and an alignment of the 155 23RSS (Figure 1a). A strictly conserved position in which no variation is tolerated has H_{i} = 0, while a position in which the four nucleotides are observed at nearly uniform frequencies has H_{i} = 1.
Figure 1. Entropy and mutual information in physiologic RSS. (a) Positionwise entropy H_{i} at each position in an alignment of physiologic 12RSS (upper bar) and an alignment of physiologic 23RSS (lower bar). Position in the alignment is shown above each bar, and the value of H_{i} at position i is indicated by the color of the bar at that position. (b) Mutual information (MI_{i,i'}) between pairs of positions in physiologic 12 and 23RSS. MI_{i,i'} values between positions in the 12RSS are shown in the upper panel, and MI_{i,i'} values between positions in the 23RSS are shown in the lower panel. Location in the alignment is given on the x and yaxes. The black lines mark the heptamer (H)spacer (S) and spacernonamer (N) boundaries. The color at the intersection of an xaxis and a yaxis grid linegives the MI contained in the positions corresponding to the two grid lines. The actual MI computed from the RSS alignments are shown in the left panels, and one of 300 permutations is shown on the right as an example of the level of MI observable between each pair of positions in the absence of any correlation between them.
The signal specified by the individual RSS positions is very low, especially in 23RSS (Figure 1a). The average entropy is _{12} = 0.50 for 12RSS and _{23} = 0.67 for 23RSS; for a randomly generate string of A, G, C, and T characters, the expected value for is one. The patterns of nucleotide diversity at individual positions are similar for both 12 and 23RSS: the initial CAC of the heptamer is invariant, the remaining four positions of the heptamer are moderately diverse, the spacer is highly diverse, and the A tract of the nonamer (positions 5 through 7) is better conserved than the other nonamer positions (Figure 1a). The average entropy for 23RSS is higher than for 12RSS, primarily because the 23RSS nonamers are much more diverse than 12RSS nonamers (_{N23} = 0.59; _{N12} = 0.35; Figure 1a).
It is known from experimental studies that nucleotide substitutions at highly diverse positions influence recombination efficiency [46,52]. That a given RSS position can be both highly variable and exhibit strong effects on signal efficiency appears contradictory. The contradiction can be reconciled, however, by hypothesizing correlations between nucleotides in RSS: a nucleotide substitution at one position may be compensated for by a change at another position in the RSS.
Significant pairwise correlations exist between positions in the RSS
To test for correlation between pairs of positions in the RSS, we computed the mutual information, MI_{i,i}, [51] between all pairs of positions i and i' in an alignment of the 201 physiologic mouse 12RSS and in an alignment of the 155 physiologic mouse 23RSS (see Methods) (Figure 1b). Statistically significant MI_{i,i'} values result when there is a correlation between nucleotide frequencies at the two positions, indicating selection pressure to maintain (or avoid) a particular dinucleotide.
In 12 and 23RSS, we detect statistically significant correlations between positions in those regions of the RSS exhibiting high levels of positionwise entropy. These are the spacer, especially the nonamerproximal half, and the positions of the heptamer and nonamer that are adjacent to the spacer (Figure 1b). In general, correlation between two positions decreases as a function of distance, but there are examples of strong correlation between widely separated positions (Figure 1b). For example, we observe 12RSS with A at positions 8 and 19 at a much higher frequency (0.38) than expected (0.11) from the independent frequencies of A at each position. In addition, we find that the levels of correlation are higher in 23RSS than in 12RSS (Figure 1b); this is an especially interesting observation given that 23RSS are much more variable than 12RSS (Figure 1a).
Statistical models of 12 and 23RSS determine the relevant correlation structure
The MI computations give evidence for strong pairwise correlations, including between distant positions, and correlations may exist between any number of positions. We therefore developed statistical models, one for 12RSS and one for 23RSS, that can account for correlations between arbitrary positions without regard for their distance from each other or the number of correlated positions. The models were developed using a procedure that selects from all possible combinations of disjoint probability distributions the set of marginal (for one position) and joint (for multiple positions) distributions that most distinguishes physiologic RSS from other nucleotide sequences of the same length. For example, positions 8 and 19 in 12RSS are highly correlated (Figure 1b). They could be included in the model either through the two marginal probability distributions  the distribution over the four nucleotides at position 8 and the distribution over the four nucleotides at position 19, or through one joint probability distribution  the distribution over the 16 pairs of nucleotides at the pair of positions. Each model computes a score for RSS information content: RIC_{12} for 28bp sequences and RIC_{23} for 39bp sequences; the formation of joint probability functions is determined by maximizing the mean score for physiologic RSS.
The selected 12RSS model is:
RIC_{12} = ln [P_{1}P_{2}P_{3,15,25}P_{4,5}P_{6,28}P_{7,8,19}P_{9,26}P_{10,12}P_{11,27}P_{13,14,23}P_{16,17,18}P_{20,21,22}P_{24}]
where P_{1} is the marginal probability function for position 1 and P_{3,15,25} is the joint probability function for positions 3,15 and 25. The presence of the joint probability function in the model indicates that these three positions are mutually correlated. The selected 23RSS model is:
RIC_{23} = ln [P_{1}P_{2}P_{3}P_{4,14}P_{5,39}P_{6}P_{7,24,25}P_{8,9,21}P_{10,16}P_{11,12}P_{13,22}P_{15,23}P_{17,18}P_{19,27,30,31,32,33,37}P_{20,26}P_{28,29}P_{34,38}P_{35,36}].
Most of the marginal probabilities have been merged to form joint probability functions, but the order in which the joint functions were formed reflects the relative strength of the correlations between the corresponding positions: positions with large MI are grouped early in model selection (data not shown). The groups of positions with the strongest correlations are (7,8,19), (16,17,18), and (20,21,22) in 12RSS (Table 1) and (19,27,30,31,32,33,37), (8,9,21), and (7,24,25) in 23RSS (Table 2). For both 12 and 23RSS, the positions exhibiting the strongest cooperative influence lie in the nonamerproximal half of the spacer and in the heptamer and nonamer positions adjacent to the spacer (Figure 2). Interestingly, these positions overlap substantially with positions exhibiting ethylation/methylation interference in RSS complexed with RAG1/RAG2 (Figure 2 and [53]) suggesting that the correlations detected by the models are relevant to RAGRSS interaction.
Figure 2. Location of the most highly correlated positions in 12 and 23RSS and sites of ethylation/methylation interference in RSS complexed with RAG. (a) Positions in 12RSS are shown in the upper row of boxes and those in 23RSS are shown in the lower row. Boxes filled with the same color indicate positions that are correlated. For each model, the three associations with the highest level of correlation are shown. The intensity of the blue indicates the relative strength of the correlation with the darkest being the most correlated. Sites of ethylation/methylation interference in RSS complexed with RAG are shown in orange [53]. (b) Overlap between the strongest correlations (shown in blue) and sites of ethylation/methylation interference (shown in orange) [53]. The relative positions of the heptamer and nonamer are indicated by the red lines. Figure generated in RasMol [69].
Table 1. Strongest nucleotide associations in 12RSS
Table 2. Strongest nucleotide associations in 23RSS
The selected models were compared with order zero and order one Markov models
To determine whether the correlation structure detected by our model selection procedure improves RSS recognition and evaluation, we constructed weight matrix models based on order zero or order one Markov chains. Order zero Markov models (WM0_{12} for 12RSS and WM0_{23} for 23RSS) assume that all RSS positions are independent and are therefore computed from marginal probability distributions. The score for an Nbp sequence is computed as
where P_{i} is the probability of observing nucleotide X at position i. Order one Markov models (WM1_{12} and WM1_{23}) assume that adjacent positions are correlated (the identity of the nucleotide at position i depends on the nucleotide at position i  1) and are computed from conditional probability distributions:
where P_{1} is the marginal probability distribution for position 1, and P(ii  1) is the probability of observing nucleotide X at position i given the nucleotide at position i  1.
For physiologic 12 and 23RSS, the score distribution is shifted toward higher scores for models that include correlation (RIC and WM1) (Table 3, Figure 3). The difference is most striking for 23RSS (Table 3, Figure 3). The mean scores for the physiologic 12RSS are _{12} = 19.84, _{12} = 18.49, and _{12} = 18.47 (Table 3); the mean scores for physiologic 23RSS are _{23} = 37.07, _{23} = 31.51, and _{23} = 32.39 (Table 3). There is relatively little change in the value of low scores across models, so as the correlation structure of the models increases in complexity, the increase in high scores results in an increase in the score range (Table 3 and Figure 3).
Figure 3. Plots of scores computed for physiologic RSS, and for 28 or 39bp segments taken from chromosome 8 DNA. (a) 12RSS; (b) 23RSS. The yaxis indicates the frequency of physiologic RSS or of segments from chromosome 8 with a finite score that score at the level given on the xaxis. Solid line, RIC; dashed line, WM1; dotted line, WM0.
Table 3. Distribution of scores for physiologic RSS
Functional RSS are best discriminated by RIC scores
To test whether functional RSS can be recognized by the sequence properties captured in the models, we characterized the score distributions for nonRSS DNA (Figure 4). These background distributions are characterized by their means and ranges. WM0, WM1 and RIC scores were computed for all 28 and 39bp segments in a 212,128bp fragment of mouse chromosome 8 (AC084823) containing no known RSS. We also characterized the background distributions by computing scores for a pseudorandom string (PRS) of A, T, C and G the same length as sequence AC084823 and having the same nucleotideusage frequencies.
Figure 4. Score distributions for nonRSS segments. (a) Results for 28bp segments (12RSS model); (b) results for 39bp segments (23RSS model). Score level is given by the xaxis, and the frequency of finite scores is given by the yaxis. Distributions for segments of chromosome 8 DNA are shown in blue, and distributions for segments from a pseudorandom string of A, G, C and T are shown in red. Solid line, RIC; dashed line, WM1; dotted line, WM0.
As expected, the WM0 score distributions for AC084823 and the PRS are almost identical for both 28 and 39bp segments (Table 4, Figure 4). For the models that account for correlation, however, scores from genomic DNA are higher on average than those computed for the PRS (Table 4, Figure 4). This is especially true for the WM1 models, indicating that, whereas both the WM1 and RIC models are influenced by correlations present in physiologic DNA, the correlations detected by the RIC models are more specific to RSS. Importantly, the distributions for RIC scores are always shifted toward lower scores than the corresponding WM0 and WM1 distributions (Table 4, Figure 4).
Table 4. Distribution of scores for nonRSS DNA and a pseudorandom string of A, G, C and T
We next compared across models the frequency of signallength segments in AC084823 that score above a given threshold. To determine test thresholds for a given model, we ranked the physiologic RSS by score in ascending order and took the lowest 5% of scores under each model as the test thresholds for that model (Table 5, Figure 5). For example, the lowest RIC_{12} is 48.16. It has rank zero, and zero physiologic RSS have a lower RIC_{12}. Of 212,101 (0.00414) 28bp segments in AC084823, 859 have RIC_{12} > 48.16. Similarly, threshold zero under the WM1_{12} model is 46.32, and 2,372 of 212,101 (0.01118) 28bp segments in AC084823 have WM1_{12} > 46.32. Under each model, the frequency of physiologic RSS scoring below threshold estimates the probability of failing to recognize a functional RSS and is a measure of the model's sensitivity. The frequency of nonRSS scoring above threshold estimates the probability of classifying nonRSS as functional and is a measure of the model's specificity. Both numbers should be small. The highscoring nonRSS segments may support recombination, however, and, if so, would be classified as cRSS. Those not supporting recombination would be counted as false positives.
Figure 5. Specificity of the models as a function of their sensitivity. Scores for physiologic RSS were ranked in ascending order so that the lowest score was ranked 0, the second lowest score was ranked 1, and so on. For each score, the number of physiologic RSS with a lower score is equal to this rank. We used the lowest 5% of scores as test thresholds: ranks 0 through 10 were used for the three 12RSS models and ranks 0 through 8 were used for the three 23RSS models. (a) Results for the 12RSS models; (b) results for the 23RSS models. The number of physiologic RSS scoring below a given threshold score (Table 5) is given on the xaxis, and the frequency of nonRSS segments in chromosome 8 sequence AC084823 scoring above the threshold is shown on the yaxis. Solid line, RIC; dashed line, WM1; dotted line, WM0.
Table 5. Frequency of abovethreshold scores for nonRSS
In general we find that the WM0 models predict the highest number of RSS in AC084823, and the RIC models predict the fewest (Table 5, Figure 5). Of 20 test thresholds, there are three for which RIC does not predict the smallest number of RSS: for the 12RSS models, the score for the ninthlowest ranking physiologic 12RSS and for the 23RSS models, the scores for the second and eighthlowest ranking physiologic 23RSS (Table 5, Figure 5).
From Figure 5, we determine thresholds at which excluding just one more physiologic RSS results in a large drop in the models' predicted number of cryptic RSS. We select 38.81 for RIC_{12} and 58.45 for RIC_{23} (Figure 5). Only two of the 201 (0.01) 12RSS have RIC_{12} < 38.81, and just 54 of 212,101 (2.5 × 10^{4}) 28bp segments in sequence AC084823 achieve RIC_{12} > 38.81. The frequency of RIC_{12} > 38.81 from the PRS is 9.9 × 10^{5} (21/212,102). Similarly, of the 155 physiologic 23RSS, only three (0.02) have RIC_{23} < 58.45. Just 100 of the 212,090 (4.7 × 10^{4}) 39bp segments from sequence AC084823 and only 58 of the 212,091 segments from the PRS (2.7 × 10^{4}) have RIC_{23} > 58.45. We therefore set 38.81 and 58.45 as RIC_{12} and RIC_{23} thresholds, respectively.
The RIC models reliably recognize physiologic RSS
We searched genomic DNA containing physiologic RSS to determine if RIC scores can resolve 12 and 23RSS. Sequences X58411 (7,360 bp) and X58414 (5,867) encompass the mouse Jλ locus and contain four 12RSS, three associated with functional Jλ gene segments and one associated with the pseudogene Jλ4 [54]. We computed RIC_{12} scores for 13,173 28bp segments in X58411 and X58414; the resulting RIC_{12} distributions are almost identical to that for the chromosome 8 sequence with means well below threshold (Table 6). The RIC_{12} for RSS associated with functional Jλ gene segments lie well outside this distribution (Table 6, Figure 6a); all three score > 38.81 (_{12} = 21.16). Analogous searches of mouse D_{H} (AF018146) and DJβ (AE000665) regions also demonstrated RIC_{12} values for physiologic RSS above threshold (Table 6). Of the 21 physiologic 12RSS included in the search, only the RSS associated with the Jλ4 pseudogene [54] scored below threshold (Table 6).
Figure 6. RIC values. (a) RIC_{12} values for 28bp segments from genomic regions containing Jλ gene segments. RIC_{12} values are given on the yaxis, and position in the sequence is shown on the xaxis. Each blue or red point represents the RIC_{12} for the 28bp segment beginning at that position. Points corresponding to physiologic RSS are red. The horizontal black line indicates RIC = 40. (b) RIC_{23} values for 39bp segments from genomic regions containing Vβ gene segments. The horizontal black line indicates RIC = 60. Other details as in (a).
Table 6. Mean scores for physiologic RSS and nonRSS segments in genomic sequence
Scans of the sequence containing Dβ and Jβ gene segments (AE000665, described above) and a 250,611bp sequence containing 16 Vβ genes (AE000663) showed that physiologic 23RSS are also easily resolved by their RIC scores (Table 6). The two Dβ 23RSS have RIC_{23} well above threshold (35.85 and 49.56; Table 6), and the Vβ 23RSS associated with functional gene segments are also easily identified (mean 34.17; Table 6, Figure 6b). Again, the only RSS not scoring above threshold are associated with pseudogenes [55,56,57] (Vβ 11 61.78; Vβ 123 61.76; Vβ 8 58.87); nevertheless, all three RIC_{23} scores fall above the background mean (77.75). RSS flanking pseudogenes cannot be stringently selected, so we expect their RIC to be below threshold but above background.
For comparison, we scanned the five sequences (AE000663, AE000665, AF018146, X58411 and X58414) using the WM0 and WM1 models. In all cases the mean score for nonRSS is lower than the mean score for physiologic RSS associated with functional gene segments, but the disparity between the two sets of scores is always greatest for the RIC models (Table 6), showing that discrimination between RSS and nonRSS is clearest using the RIC models. This wider disparity may be important when searching for functional but degenerate signals such as cryptic and pseudogeneassociated RSS.
RIC_{23} scores for AE000663 are directly compared with WM0_{23} or WM1_{23} scores in Figure 7. The majority of scores for nonRSS fall below the y = x line, indicating that they receive lower scores under the RIC_{23} model than under either of the Markov models (Figure 7). The scores for RSS associated with functional gene segments tend to fall above y = x, indicating their better discrimination by the RIC_{23} model (Figure 7). Of the seven pseudogeneassociated RSS in AE000663, scores for three fall in the cloud of background scores under all three models whereas scores for the remaining four do not (Figure 7). Although the scores for these four RSS are higher under both Markov models than under the RIC_{23} model, they are better discriminated by the RIC_{23} model because the background scores are higher under the Markov models (Figure 7).
Figure 7. RIC models are better at recognizing pseudogeneassociated RSS. The scores for sequence AE000663 under the RIC_{23} model are plotted against the scores under the WM0_{23} model (left panel) or the WM1_{23} model (right panel). RIC_{23} scores are shown on the yaxis, and WM1_{23} or WM0_{23} scores are shown on the xaxis. The y = x line is shown in red. AE000663 contains seven Vβ pseudogenes and nine functional Vβ gene segments. Blue points indicate scores for nonRSS, red points indicate scores for pseudogeneassociated RSS, and the red triangles indicate scores for RSS associated with functional gene segments. The horizontal gray line is halfway between the lowest RIC_{23} score for a physiologic RSS distinguishable from nonRSS scores (46.08.) and the highest RIC_{23} score for a nonRSS (49.09): 46.08  1.505 = 47.585 = 49.09 + 1.505. The vertical gray line is the same distance below the corresponding WM1_{23} or WM0_{23} score: 46.02  1.505 or 47.52  1.505, respectively.
Parameters for all six models were estimated from the full set of RSS (201 12RSS and 155 23RSS). Of the 39 RSS contained in the search contigs, 27 of the RSS associated with functional gene segments were part of the estimation set. In contrast, the pseudogeneassociated RSS (Jλ4, Jβ2.6, and seven Vβ) and three of the functional Jβ RSS were not part of the estimation set.
RIC scores identify functional cryptic RSS
12cRSS in 3' → 5' orientation and embedded near the 3' end of V gene segments can mediate receptor editing, the replacement of the V gene segment portion of a rearranged variableregion gene with an upstream V gene segment (reviewed in [58]). To test whether our model can recognize these cRSS, we computed WM0_{12}, WM1_{12}, and RIC_{12} scores for all 28bp segments in 3' → 5' orientation in V_{H} gene segments known to participate in receptor editing: the V_{H}2S1*01 gene segment [59], the V_{H}14S1 gene segment [60], and the 3H9 transgene [61]. The cRSS were not part of the RSS set used for parameter estimation. While the cRSS in V_{H}2S1*01 and V_{H}14S1 have higher scores than any other 28bp segment in their respective gene segments under all three models, the RIC_{12} and WM0_{12} models are better at identifying them because these models have lower background scores than the WM1_{12} model (Table 7). The cRSS in 3H9, however, is best recognized by its RIC_{12} score (Figure 8). While it has a higher score under the WM1 model, its RIC_{12} score is more separated from the background RIC scores than its WM1_{12} or WM0_{12} score (Figure 8). All receptorediting events in 3H9 involved the cRSS identified by RIC [61].
Figure 8. Lowscoring cRSS are better recognized by RIC_{12} than by WM0_{12} or WM1_{12}. RIC_{12} scores for 28bp segments in 3' → 5 orientation in the 3H9 transgene [62] are plotted against WM0_{12} or WM1_{12} scores for the same segments. The cRSS score is shown in red, and the nonRSS scores are shown in blue. The horizontal gray line is at 47.23, halfway between the cRSS RIC_{12} score (45.32) and the highest scoring nonRSS (49.15). The vertical gray line is the same distance below the cRSS WM0_{12} (47.46) or WM1_{12} (42.23) score. Other details as in Figure 7.
Table 7. Location of cRSS by theirRIC score
RIC scores are correlated with RSS recombination efficiencies
To quantify any correlation between RIC and RSS function, we computed Spearman's rank correlation coefficient (r_{s}) between RIC and published recombination frequencies [46]. The correlation coefficients are r_{s} = 0.81 for 12RSS (Figure 9) and r_{s} = 0.86 for 23RSS (Figure 10); for these RSS, RIC explains 6774% of the variation in recombination efficiency. WM0 and WM1 are equally well correlated with recombination efficiency: WM0_{12}, r_{s} = 0.80; WM1_{12}, r_{s} = 0.82, WM0_{23}, r_{s} = 0.88, WM1_{23}, r_{s} = 0.91. Only 1 of 27 12RSS and 1 of 13 23RSS included in this study [46] are part of our parameter estimation set.
Figure 9. Correlation between recombination efficiency (xaxis) as measured by Hesse et al. [46] and RIC (yaxis) for 12RSS.
Figure 10. Correlation between recombination efficiency (xaxis) as measured by Hesse et al. [46] and RIC (yaxis) for 23RSS.
Discussion
We developed models of the correlation structure in 12 and 23RSS using a model selection procedure that finds the models with the most power to predict the population of functional RSS. The procedure determines the groups of positions such that the predictive power of the model is increased by including the correlation between the positions within each group. Positions not correlated with any other position are included in the models by the probability distribution over the four nucleotides at that position; groups of correlated positions are included by the probability distribution for the set of motifs prescribed by the number of positions within the group; for example, a group of four positions would be modeled by the probability distribution for the 256 possible quadruplet motifs. The models compute an RSS information content score, RIC, for any RSSlength sequence, that is, 28bp segments are scored by the 12RSS model and 39bp segments are scored by the 23RSS model.
Model selection evaluates all possible combinations of probability distributions in a stepwise fashion, so the correlation structure of RSS is not specified nor assumed, but rather detected. Interestingly, the positions exhibiting the strongest correlation overlap substantially with the RSS nucleotides that contact the recombinase (Figure 2 and [53]), offering support for the hypothesis that positions acting cooperatively to influence binding by the recombinase coevolve. This overlap also suggests that the correlation structure detected by our model selection procedure is relevant to RSS function.
To determine if modeling the correlation structure specific to RSS improves RSS recognition and evaluation, we compared the RIC models with models that assume RSS positions are independent (order zero Markov models, WM0) and models that assume adjacent positions are correlated (order one Markov models, WM1). In general we find that, while all models predict RSS function equally well, with Spearman's rank correlation coefficients around 0.81 for 12RSS and 0.88 for 23RSS, the RIC models are better at discriminating between RSS and nonRSS segments. This is especially true for the degenerate relatives of RSS  cRSS and pseudogeneassociated RSS.
Under each model, we compared the score distribution for physiologic RSS with that for nonRSS segments. We find that the two distributions are most disparate under the RIC models (Figure 3). While including correlation in the models increases the scores for most physiologic RSS (Figure 3), assuming adjacent positions are correlated increases scores nonspecifically, raising the background scores (Figures 3,4). The distribution of scores for nonRSS was approximated by computing the score for all RSSlength segments in a 200kb region of chromosome 8 containing no known RSS and also in a pseudorandom string of A, G, C and T characters the same length as the chromosome 8 region and having the same nucleotideusage frequencies.
We determined threshold scores to discriminate between functional RSS and nonRSS. The lowest 5% of scores for physiologic RSS under each model were used as test thresholds; for 17 of 20 thresholds, the frequency of nonRSS scoring above threshold was lowest under the RIC models (Figure 5). We selected threshold RIC scores (38.81 and 58.45 for 12 and 23RSS respectively), at which more than 98% of physiologic RSS score above threshold, and the predicted frequency of functional signals in the mouse genome is on the order of 10^{4} (Table 6). Many of the signals identified by the models may mediate recombination, and so this frequency overestimates the falsepositive rate. The falsepositive rate can only be estimated by testing a random sample of these putative cRSS for function.
We show that the threshold scores can be effectively used to screen genomic DNA for functional RSS. We screened over 650 kb of genomic DNA containing 39 physiologic RSS (Table 6), 12 of which are not part of the RSS set used for parameter estimation. Under all models, the mean score for nonRSS is lower than the mean score for physiologic RSS associated with functional gene segments, but the disparity between the two sets of scores is always greatest for the RIC models (Table 6). Only four RSS, one 12RSS and three 23RSS have RIC scores below their respective threshold scores, and all four are associated with pseudogenes. The remaining four pseudogeneassociated RSS included in our search are better recognized by RIC than by WM0 or WM1 (Figure 7). Furthermore, we computed the RIC_{12} scores for all potential 3' → 5' oriented 12cRSS in three V_{H} gene segments observed to mediate receptor editing. Importantly, cRSS were not included in the estimation set. Two cRSS are recognized by all three models, but the third is recognized only by RIC (Figure 8). We expect cRSS and pseudogeneassociated RSS to be under less stringent selection pressure than physiologic RSS and therefore to have lost some of their sequence similarity to RSS. An important future test of the RIC models will be to scan genomic DNA prospectively for cRSS and show that the cRSS identified mediate recombination by the V(D)J recombinase. We have successfully used the model selection procedure introduced here to develop models for two sets of highly variable and complex binding sites. An advantage of this procedure is that the correlation structure of the site is determined from the statistical properties of the sequence set alone; therefore, the sequence properties governing recognition of the site by the binding protein(s) can be identified in the absence of experimental evaluation of function. Often there is an abundance of sequence data that has not yet been, or perhaps cannot be, experimentally evaluated. Information about the correlation structure of a binding site can give important insight into how the binding site and binding protein(s) interact, and may suggest important experiments. In addition, by choosing the positions to be correlated independently of their relative positions, we are able to model highly complex patterns with a relatively reduced increase in model complexity. Inclusion of highly complex correlation patterns in models of DNAsequence sites can improve the precision with which sites are identified and functional levels are predicted.
Materials and methods
RSS sequence set
We analyzed 356 physiologic mouse RSS from all Tcr and Ig loci available [62]. About 96% (340/356) of the RSS are associated with functional V, D or J gene segments (Immunogenetics database [57,63,64]). Two are associated with pseudogenes known to rearrange, and two are associated with open reading frames (ORFs) also known to rearrange (see Immunogenetics database [57,63,64]). The ability of the remaining RSS to rearrange is unconfirmed.
Calculation of positionwise entropy and mutual information
For a DNA sequence alignment, the positionwise entropy H_{i} [65] measures the level of nucleotide diversity: minimum H is 0 and indicates strict sequence conservation; maximum H is 1 and indicates that the four nucleotides occur at nearly uniform frequencies. The estimated entropy at the ith position in an alignment is given by where p_{i,j} is the estimated probability of nucleotide j at position i.
The estimated mutual information (MI) [51] between two positions, i and i', is computed as: MI_{i,i'} = H_{i} + H_{i'}  H_{i,i'}, where H_{i} is the estimated entropy computed from the frequency of the four nucleotides at position i and H_{i,i'} is the entropy computed using the frequency of the 16 pairs of nucleotides at the two positions.
Nucleotide and nucleotide pair probabilities are themselves estimated using the Bayesian posterior mean [66]
where m_{i,s} is the frequency of nucleotide or nucleotide pair s in position(s) i of the alignment, N_{i} is the total number of sequences in the alignment at position(s) i, and r is the number of distinct classes in the probability distribution (for single nucleotides r = 4, for nucleotide pairs, r = 16, and so on).
Under the null hypothesis of no correlation between nucleotide positions in RSS, positive MI_{i,i'} values may still be observed. To test if the MI_{i,i'} values differ significantly from 0, we randomized the order of the sequences in position i' and recomputed MI_{i,i'}. By randomizing the order of the sequences in one position but not the other, we preserve both marginal distributions but disrupt any correlations. The proportion of 300 permutations giving MI_{i,i'} values higher than that observed in the data is our estimate for the p value under the null hypothesis.
Statistical models of RSS structure
Single models for 12RSS and 23RSS define the set of probability distributions that contain the most information about RSS structure. We developed the models by stepwise model enlargement and selection. We begin with the smallest models, order zero Markov models estimated under the assumption of pairwise site independence. These models take as the probability of observing a sequence S, the product of the individual marginal probabilities over the RSS positions, P(S) = Π_{i}P_{i}(s) where s is the nucleotide observed in sequence S at position i. The marginal probability distribution for a position defines the probability of observing each of the four nucleotides at that position and is estimated from the RSS dataset using the Bayesian estimator for P_{i}(s) described above. Each potential model enlargement is accomplished by replacing the product of an independent marginal probability and kvariate joint probability by a corresponding k + 1variate joint probability.
For example, the first step of model enlargement for RSS of length N compares all possible combinations of one joint probability distribution for two positions and marginal probability distributions for the remaining N2. positions: for every pair of positions i and i', there is the model
Under each of these models we compute scores by taking the natural logarithm of P(S) (ln P(S)) for each RSS in the dataset using leaveoneout crossvalidation [67]. Briefly, for each model, we exclude one RSS from the dataset, estimate the probability distributions used in the model from the remaining RSS, and compute ln P(S) for the excluded RSS. We perform this computation for each RSS and take the average ln P(S) over all RSS. We then select the combination of probability distributions giving the largest average ln P(S).
The second step of model enlargement expands the model selected in step one by comparing all models within two classes: those based on two joint probability distributions, the one formed in step one and one for an additional pair of positions, and marginal probability distributions for the remaining N  4 positions, and those based on one joint probability distribution for a group of three positions, the pair joint selected in step one expanded to include one additional position, and marginal probability distributions for the remaining N  3 positions. We again compute ln P(S) under each model for each RSS using leaveoneout crossvalidation and select the model giving the largest average ln P(S).
We continued the stepwise model enlargement and selection, iteratively expanding the models until the mean ln P(S) ceased improving. The final model is not necessarily optimal, however; we perform the maximization in a stepwise manner and so, as in any stepwise statistical procedure, are not guaranteed to achieve the global maximum.
Once the final set of probability distributions has been selected (the mean ln P(S) no longer improves), the parameters of each probability distribution are estimated on the complete set of RSS. ln P(S) for an RSS is a value between  ∞ and 0. If RSS were strictly conserved, the consensus RSS would have ln P(S) = 0. We define the ln P(S) of a sequence computed from the final model as its RSS information content (RIC). The final models take the form:
RIC_{12} = ln [P_{1}P_{2}P_{3,15,25}P_{4,5}P_{6,28}P_{7,8,19}P_{9,26}P_{10,12}P_{11,27}P_{13,14,23}P_{16,17,18}P_{20,21,22}P_{24}]
and
RIC_{23} = ln [P_{1}P_{2}P_{3}P_{4,14}P_{5,39}P_{6}P_{7,24,25}P_{8,9,21}P_{10,16}P_{11,12}P_{13,22}P_{15,23}P_{17,18}P_{19,27,30,31,32,33,37}P_{20,26}P_{28,29}P_{34,38}P_{35,36}].
The CA dinucleotide at positions 1 and 2 of the heptamer is required for rearrangement [46,52,68]. Therefore, the models assign a probability of 0 to any RSS not beginning with CA.
RSS in the dataset receive higher RIC values (and WM0 and WM1 values, see below) during genome searches than during model development, because the sequence for which RIC is calculated is excluded from the dataset during model development, a property of leaveoneout crossvalidation [67]. In contrast, nucleotide frequencies used in the searches are estimated on the complete set of RSS. The difference between the two RIC values is greatest for rare sequences and, in general, the differences are larger for 23RSS. The difference between the two scores is < 1 for 78% of 12RSS and for 39% of 23RSS.
Markov models
We also modeled RSS with weight matrices based on an order zero or order one Markov model (reviewed in [27]). The order zero Markov model assumes that all RSS positions are independent, so the probability of observing a sequence S that is Nbp long is:
where P_{i}(s) is the probability of observing nucleotide s at RSS position i. P_{i}(s) is estimated as described above from the set of physiologic 12 or 23RSS. The score for sequence S is then the natural logarithm of the probability P(S),
The order one Markov model assumes that all adjacent positions are correlated, so the probability of observing sequence S is based on conditional probabilities:
where P(s_{i}s_{i1}) is the probability of observing nucleotide s_{i} in position i given that nucleotide s_{i1} occupies position i  1. From Bayes rule,
where P(s_{i1},s_{i},) is the joint probability distribution for the dinucleotide s_{i1}s_{i} occurring at the pair of positions i  1 and i. Again, the probability distributions are estimated from the physiologic 12 or 23RSS, and we take the natural logarithm of the probability P(S) as the score for sequence S:
Acknowledgements
This work was supported in part by grants AI24335 and AI49326 (G.K.). L.G.C. received a Bioinformatics and Genome Technology Postdoctoral Fellowship from Duke University.
References

Mitchison A: Partitioning of genetic variation between regulatory and coding gene segments: the predominance of software variation in genes encoding introvert proteins.
Immunogenetics 1997, 46:4652. PubMed Abstract  Publisher Full Text

Purugganan MD: The molecular population genetics of regulatory genes.
Mol Ecol 2000, 9:14511461. PubMed Abstract  Publisher Full Text

Kolchanov NA, Ignatieva EV, Ananko EA, Podkolodnaya OA, Stepanenko IL, Merkulova TI, Pozdnyakov MA, Podkolodny NL, Naumochkin AN, Romashchenko AG: Transcription Regulatory Regions Database (TRRD): its status in 2002.
Nucleic Acids Res 2002, 30:312317. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Zhu J, Zhang MQ: SCPD: a promoter database of the yeast Saccharomyces cerevisiae.
Bioinformatics 1999, 15:607611. PubMed Abstract  Publisher Full Text

Vanet A, Marsan L, Sagot MF: Promoter sequences and algorithmical methods for identifying them.
Res Microbiol 1999, 150:779799. PubMed Abstract  Publisher Full Text

Perier RC, Praz V, Junier T, Bonnard C, Bucher P: The eukaryotic promoter database (EPD).
Nucleic Acids Res 2000, 28:302303. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Wingender E, Chen X, Hehl R, Karas H, Liebich I, Matys V, Meinhardt T, Pruss M, Reuter I, Schacherer F: TRANSFAC: an integrated system for gene expression regulation.
Nucleic Acids Res 2000, 28:316319. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Bucher P: Weight matrix descriptions of four eukaryotic RNA polymerase II promoter elements derived from 502 unrelated promoter sequences.
J Mol Biol 1990, 212:563578. PubMed Abstract

Walker M, Pavlovic V, Kasif S: A comparative genomic method for computational identification of prokaryotic translation initiation sites.
Nucleic Acids Res 2002, 30:31813191. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Day WH, McMorris FR: Critical comparison of consensus methods for molecular sequences.
Nucleic Acids Res 1992, 20:10931099. PubMed Abstract

Stormo GD, Schneider TD, Gold L, Ehrenfeucht A: Use of the 'Perceptron' algorithm to distinguish translational initiation sites in E. coli.
Nucleic Acids Res 1982, 10:29973011. PubMed Abstract

Harr R, Haggstrom M, Gustafsson P: Search algorithm for pattern match analysis of nucleic acid sequences.
Nucleic Acids Res 1983, 11:29432957. PubMed Abstract

Stormo GD: Consensus patterns in DNA.
Methods Enzymol 1990, 183:211221. PubMed Abstract

Stormo GD: DNA binding sites: representation and discovery.
Bioinformatics 2000, 16:1623. PubMed Abstract  Publisher Full Text

Stormo GD, Schneider TD, Gold L: Quantitative analysis of the relationship between nucleotide sequence and functional activity.
Nucleic Acids Res 1986, 14:66616679. PubMed Abstract

Berg OG, von Hippel PH: Selection of DNA binding sites by regulatory proteins. Statisticalmechanical theory and application to operators and promoters.
J Mol Biol 1987, 193:723750. PubMed Abstract

Fickett JW: Quantitative discrimination of MEF2 sites.
Mol Cell Biol 1996, 16:437441. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Lustig B, Jernigan RL: Consistencies of individual DNA baseamino acid interactions in structures and sequences.
Nucleic Acids Res 1995, 23:47074711. PubMed Abstract

Roulet E, Fisch I, Junier T, Bucher P, Mermod N: Evaluation of computer tools for the prediction of transcription factor binding sites on genomic DNA.
In Silico Biol 1998, 1:2128. PubMed Abstract  Publisher Full Text

Roulet E, Bucher P, Schneider R, Wingender E, Dusserre Y, Werner T, Mermod N: Experimental analysis and computer prediction of CTF/NFI transcription factor DNA binding sites.
J Mol Biol 2000, 297:833848. PubMed Abstract  Publisher Full Text

Stormo GD, Strobl S, Yoshioka M, Lee JS: Specificity of the Mnt protein. Independent effects of mutations at different positions in the operator.
J Mol Biol 1993, 229:821826. PubMed Abstract  Publisher Full Text

Takeda Y, Sarai A, Rivera VM: Analysis of the sequencespecific interactions between Cro repressor and operator DNA by systematic base substitution experiments.
Proc Natl Acad Sci USA 1989, 86:439443. PubMed Abstract  PubMed Central Full Text

Tronche F, Ringeisen F, Blumenfeld M, Yaniv M, Pontoglio M: Analysis of the distribution of binding sites for a tissuespecific transcription factor in the vertebrate genome.
J Mol Biol 1997, 266:231245. PubMed Abstract  Publisher Full Text

Bulyk ML, Johnson PL, Church GM: Nucleotides of transcription factor binding sites exert interdependent effects on the binding affinities of transcription factors.
Nucleic Acids Res 2002, 30:12551261. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Salzberg SL: A method for identifying splice sites and translational start sites in eukaryotic mRNA.
Comput Appl Biosci 1997, 13:365376. PubMed Abstract

Zhang MQ, Marr TG: A weight array method for splicing signal analysis.
Comput Appl Biosci 1993, 9:499509. PubMed Abstract

Burge CB: Modeling dependencies in premRNA splicing signals. In In Computational Methods in Molecular Biology. Edited by Salzberg SL, Searls DB Kasif S. New York: Elsevier; 1998:371.

Ponomarenko MP, Ponomarenko JV, Frolov AS, Podkolodnaya OA, Vorobyev DG, Kolchanov NA, Overton GC: Oligonucleotide frequency matrices addressed to recognizing functional DNA sites.
Bioinformatics 1999, 15:631643. PubMed Abstract  Publisher Full Text

Zazopoulos E, Lalli E, Stocco DM, SassoneCorsi P: DNA binding and transcriptional repression by DAX1 blocks steroidogenesis.
Nature 1997, 390:311315. PubMed Abstract  Publisher Full Text

Bianchi ME, Beltrame M, Paonessa G: Specific recognition of cruciform DNA by nuclear protein HMG1.
Science 1989, 243:10561059. PubMed Abstract

Robbe K, Bonnefoy E: NonBDNA structures on the interferonbeta promoter?
Biochimie 1998, 80:665671. PubMed Abstract  Publisher Full Text

Wadkins RM: Targeting DNA secondary structures.
Curr Med Chem 2000, 7:115. PubMed Abstract

Sadofsky MJ: The RAG proteins in V(D)J recombination: more than just a nuclease.
Nucleic Acids Res 2001, 29:13991409. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Burge C, Karlin S: Prediction of complete gene structures in human genomic DNA.
J Mol Biol 1997, 268:7894. PubMed Abstract  Publisher Full Text

Cai D, Delcher A, Kao B, Kasif S: Modeling splice sites with Bayes networks.
Bioinformatics 2000, 16:152158. PubMed Abstract  Publisher Full Text

Zinkernagel RM, Hengartner H: Regulation of the immune response by antigen.
Science 2001, 293:251253. PubMed Abstract  Publisher Full Text

Janeway C, Travers P, Walport M, Shlomchik M: Immunobiology: The Immune System in Health and Disease. New York: Garland; 2001.

Radic MZ, Weigert M: Origins of antiDNA antibodies and their implications for Bcell tolerance.
Ann NY Acad Sci 1995, 764:384396. PubMed Abstract

Sprent J, Kishimoto H: T cell tolerance and the thymus.
Ann NY Acad Sci 1998, 841:236245. PubMed Abstract  Publisher Full Text

Davila M, Foster S, Kelsoe G, Yang K: A role for secondary V(D)J recombination in oncogenic chromosomal translocations?
Adv Cancer Res 2001, 81:6192. PubMed Abstract

Fugmann SD, Lee AI, Shockett PE, Villey IJ, Schatz DG: The RAG proteins and V(D)J recombination: complexes, ends, and transposition.
Annu Rev Immunol 2000, 18:495527. PubMed Abstract  Publisher Full Text

Difilippantonio MJ, McMahan CJ, Eastman QM, Spanopoulou E, Schatz DG: RAG1 mediates signal sequence recognition and recruitment of RAG2 in V(D)J recombination.
Cell 1996, 87:253262. PubMed Abstract  Publisher Full Text

Sakano H, Huppi K, Heinrich G, Tonegawa S: Sequences at the somatic recombination sites of immunoglobulin lightchain genes.
Nature 1979, 280:288294. PubMed Abstract

Tonegawa S: Somatic generation of antibody diversity.
Nature 1983, 302:575581. PubMed Abstract

Lewis SM: The mechanism of V(D)J joining: lessons from molecular, immunological, and comparative analyses.
Adv Immunol 1994, 56:27150. PubMed Abstract

Hesse JE, Lieber MR, Mizuuchi K, Gellert M: V(D)J recombination: a functional definition of the joining signals.
Genes Dev 1989, 3:10531061. PubMed Abstract

Feeney AJ, Tang A, Ogwaro KM: Bcell repertoire formation: role of the recombination signal sequence in nonrandom V segment utilization.
Immunol Rev 2000, 175:5969. PubMed Abstract  Publisher Full Text

Livak F, Petrie HT: Somatic generation of antigenreceptor diversity: a reprise.
Trends Immunol 2001, 22:608612. PubMed Abstract  Publisher Full Text

Lewis SM, Agard E, Suh S, Czyzyk L: Cryptic signals and the fidelity of V(D)J joining.
Mol Cell Biol 1997, 17:31253136. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Marculescu R, Le T, Simon P, Jaeger U, Nadel B: V(D)Jmediated translocations in lymphoid neoplasms: a functional assessment of genomic instability by cryptic sites.
J Exp Med 2002, 195:8598. PubMed Abstract  Publisher Full Text

Kullback S: Information Theory and Statistics. New York: Wiley; 1959.

Akamatsu Y, Tsurushita N, Nagawa F, Matsuoka M, Okazaki K, Imai M, Sakano H: Essential residues in V(D)J recombination signals.
J Immunol 1994, 153:45204529. PubMed Abstract

Swanson PC, Desiderio S: V(D)J recombination signal recognition: distinct, overlapping DNA protein contacts in complexes containing RAG1 with and without RAG2.
Immunity 1998, 9:115125. PubMed Abstract  Publisher Full Text

Miller J, Selsing E, Storb U: Structural alterations in J regions of mouse immunoglobulin lambda genes are associated with differential gene expression.
Nature 1982, 295:428430. PubMed Abstract

Chen F, Rowen L, Hood L, Rothenberg EV: Differential transcriptional regulation of individual TCR V segments before gene rearrangement.
J Immunol 2001, 166:17711780. PubMed Abstract  Publisher Full Text

Chou HS, Anderson SJ, Louie MC, Godambe SA, Pozzi MR, Behlke MA, Huppi K, Loh DY: Tandem linkage and unusual RNA splicing of the Tcell receptor betachain variableregion genes.
Proc Natl Acad Sci USA 1987, 84:19921996. PubMed Abstract  PubMed Central Full Text

Lefranc MP: IMGT, the international ImMunoGeneTics database.
Nucleic Acids Res 2001, 29:207209. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Kouskoff V, Nemazee D: Role of receptor editing and revision in shaping the B and T lymphocyte repertoire.
Life Sci 2001, 69:11051113. PubMed Abstract  Publisher Full Text

Kleinfield R, Hardy RR, Tarlinton D, Dangl J, Herzenberg LA, Weigert M: Recombination between an expressed immunoglobulin heavychain gene and a germline variable gene segment in a Ly 1^{+} Bcell lymphoma.
Nature 1986, 322:843846. PubMed Abstract

Usuda S, Takemori T, Matsuoka M, Shirasawa T, Yoshida K, Mori A, Ishizaka K, Sakano H: Immunoglobulin V gene replacement is caused by the intramolecular DNA deletion mechanism.
EMBO J 1992, 11:611618. PubMed Abstract

Chen C, Nagy Z, Prak EL, Weigert M: Immunoglobulin heavy chain gene replacement: a mechanism of receptor editing.
Immunity 1995, 3:747755. PubMed Abstract  Publisher Full Text

Downloads for statistical models of recombination signal sequences that identify cryptic signals and predict recombination efficiencies [http://www.duke.edu/~lgcowell] webcite

Lefranc MP, Giudicelli V, Ginestoux C, Bodmer J, Muller W, Bontrop R, Lemaitre M, Malik A, Barbie V, Chaume D: IMGT, the international ImMunoGeneTics database.
Nucleic Acids Res 1999, 27:209212. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Ruiz M, Giudicelli V, Ginestoux C, Stoehr P, Robinson J, Bodmer J, Marsh SG, Bontrop R, Lemaitre M, Lefranc G, et al.: IMGT, the international ImMunoGeneTics database.
Nucleic Acids Res 2000, 28:219221. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Shannon CE, Weaver W: The Mathematical Theory of Communication. Urbana, IL: University of Illinois Press; 1949.

Jaynes ET: Probability Theory: The Logic Of Science. Edited by Bretthorst GL. Cambridge: Cambridge University Press; in press.

Stone M: Crossvalidatory choice and assessment of statistical predictions (with discussion).

Ramsden DA, Baetz K, Wu GE: Conservation of sequence in recombination signal sequence spacers.
Nucleic Acids Res 1994, 22:17851796. PubMed Abstract