|
Resolution: standard / high Figure 2.
Example of transcript modeling from a set of protein and mRNA alignments using DACMs.
(a) The DACM input are mRNA (r1...r6) and protein (p1, p2) sequences that have been aligned
to a genomic sequence S. The individual local alignments are each a level 1 transcript
model (L1TMs)and constitute the nodes of a graph DACM1. (b) This graph has three possible directed edges: same_molecule, maximal_intron_size,
and genomic_molecule_order. Each corresponds to a different relationship that connects
two nodes if they respectively: are alignments produced by the same mRNA or same protein;
are separated by a distance smaller than a user defined threshold (for example, 75
kilobases); and are collinear on the molecule of origin (mRNA or protein) and the
genomic DNA. There are nine maximal paths along the three combined edges, which reduce
DACM1 into the nine nodes (r1 to r6 and p1', p1", p2) of a graph DACM2, each representing
a level 2 transcript model (L2TM). Note that the reduction of DACM1 splits nodes p1,1
to p1,5 into two DACM2 nodes (p1' and p1) because of the absenceof a genomic_molecule_order
edge between p1,3 and p1,4. (c) DACM2 has three possible edges, inclusion, extension (for mRNAs) and genomic_overlap
(for proteins), which respectively connect two nodes if: they overlap and their overlapping
introns are identical; they overlap and their overlapping introns are identical but
the second node also extends the first in 3'; and the span of the two nodes have overlapping
genomic coordinates. The reduction follows either the 'extension' rule for mRNAs edges
or the genomic_overlap protein edge and produces here the five nodes of graph DACM3
(mRNA nodes R1 to R3 and protein nodes P1 and P2), which represent level 3 transcript
models (L3TMs). (d) DACM3 has two possible edges, genomic_overlap and compatible_splicing_structure, whichconnect
(combines) protein and mRNA transcript models if they respectively have overlapping
genomic coordinates and if the protein transcript model does not have any exons in
introns of the mRNA transcript model. To reduce the graph, Exogean first identifies
the path that contains both edges and from these, the reduction consists in grouping
all nodes that are connectedto the same RNA node. This generates the three nodes of
a graph DACM4 (RP1 to RP3), which represent level 4 transcript models (L4TMs). These
L4TMs arethe final transcript models generated by the DACM expert annotation. (e) Graphical representation of the DACM expert annotation output: the final transcript
models RP1 to RP3 are represented on the genomic sequence S. No information has been
lost during the three graph reductions. Note that transcript models produced by the
DACM component of Exogean are not yet final, and will be further examined and potentially
extended when looking for splicing and start/stop signals.
Djebali et al. Genome Biology 2006 7(Suppl 1):S7 doi:10.1186/gb-2006-7-s1-s7 |