Email updates

Keep up to date with the latest news and content from Genome Biology and BioMed Central.

Open Access Research

Quantifying the mechanisms of domain gain in animal proteins

Marija Buljan*, Adam Frankish and Alex Bateman

Author Affiliations

Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SA, UK

For all author emails, please log on.

Genome Biology 2010, 11:R74  doi:10.1186/gb-2010-11-7-r74

Published: 15 July 2010

Additional files

Additional file 1:

Further discussion of different types of domain gain events as classified in Figure 2.

Format: DOC Size: 30KB Download file

This file can be viewed with: Microsoft Word Viewer

Open Data

Additional file 2:

Flowchart of (a) methods and (b) analysis for the set of high-confidence domain gain events and for the set of medium-confidence domain gains. The numbers of gained domains we were left with after each filtering step are noted in (a). In some cases more domains were gained at the same time; hence, the number of gain events that we looked at for the high-confidence domain gains differs from the number of gained domains.

Format: EPS Size: 1.5MB Download file

Open Data

Additional file 3:

Distribution of domain gain events according to the position of the domain insertion and the number of exons gained in the set of high-confidence domain gains and the set of medium-confidence domain gains. (a) The distribution of characteristics of domains from the high-confidence set of domain gains is identical to that in Figure 2. (b) The distribution of characteristics of domains from the set of medium-confidence domain gains. There are in total 330 high-confidence domain gain events and 849 medium-confidence domain gains (of which 19 gains have ambiguous position and are not shown in the graph). The flowchart in Additional file 1 shows the procedure for creation of these two sets of domain gains. The distribution of domain gains in the medium-confidence set (b) is similar to that in the set of high-confidence domain gains; the main difference is that the number of middle domain gains is increased. We believe that this is largely due to false domain gain calls caused by some proteins in the TreeFam families missing the Pfam annotations for domains that are actually present in these proteins.

Format: EPS Size: 2.6MB Download file

Open Data

Additional file 4:

Analysis of supporting evidence for the representative transcripts for domain gain events.

Format: DOC Size: 31KB Download file

This file can be viewed with: Microsoft Word Viewer

Open Data

Additional file 5:

A table listing high-confidence domain gain events.

Format: DOC Size: 359KB Download file

This file can be viewed with: Microsoft Word Viewer

Open Data

Additional file 6:

Analysis of evidence for retroposition and middle insertions by intronic recombination as mechanisms for domain gain.

Format: DOC Size: 38KB Download file

This file can be viewed with: Microsoft Word Viewer

Open Data

Additional file 7:

Fusion of adjacent genes and NAHR as a mechanism that preceded gene fusions. Discussion of evidence for NAHR as a mechanism that frequently assisted domain gains.

Format: DOC Size: 32KB Download file

This file can be viewed with: Microsoft Word Viewer

Open Data

Additional file 8:

Examples of domain gains by joining of exons from adjacent genes. (a) TreeFam family TF323983 contains Cadherin EGF LAG seven-pass G-type receptor (CESLR) precursor genes. One branch of the family, containing vertebrate genes, has gained the Sulfate transport and STAS domains in addition to the ancestral cadherin, EGF and other extracellular domains. The gain occurred after the other vertebrates diverged from fish and homologues without the gained domains are present in all animals. A representative for the gain is the transcript CELSR3-207 (ENST00000383733) and its 3' end is shown on the left-hand side (the whole transcript is too long to be clearly presented). On the right-hand side is shown a gene that is the plausible donor of these domains. Namely, the gene SLC26A4 (ENSG00000091137) contains both domains, and its STAS domain is 31% identical to that in the CELSR3 gene. In addition, the alignment with the zebrafish genome is shown below the CELSR3-207 transcript. The yellow arrows represent the alignment with chromosome 8 in zebrafish, and pink arrows that with chromosome 6 (information taken from the USCS browser). The alignment with the fish genome shows that the synteny is broken exactly in the region where the new domain is gained. Therefore, the plausible scenario for domain gain involves gene duplication, recombination and joining of newly adjacent exons. (b) Another example of a domain gain after gene duplication and exon joining. Family TF334740 in the TreeFam database contains genes that code for the Rho-guanine nucleotide exchange factor (RhoGEF). However, the RhoGEF domain was not present in the ancestral protein but was inserted later on together with the C1_1 domain when mammals diverged from other vertebrates (TreeFam release 6.0 that we used in the analysis had chicken, fish and frog genes without the gained domains). The representative transcript for the gain event is AC093283.3-201 (ENST00000296794). The gene ARHGEF18 (ENSG00000104880) has both of these domains, and the two RhoGEF domains between the genes are 52% identical. Hence, ARHGEF18 is a plausible donor for this gain event. Again, the mechanism for the gain of these domains most likely involves gene duplication and exon joining. (c) An example of a domain gain after segmental duplication and exon joining. TreeFam family TF351422 contains only primate genes, and after a gene duplication event one branch of the family has gained the PTEN_C2 domain. A representative transcript for this gain is AL354798.13-202 (ENST00000381866). A few segmental duplications span across the gene AL354798.13 and one of them covers only the ancestral portion of the gene - without the gained domain. The pair of that segmental duplication is on the gene's paralogue that has not gained the domain, the gene AP000365.1 (ENSG00000206249). Hence, a possible scenario is that a recent duplication of a paralog gene has changed its genetic environment and brought it into proximity of the PTEN_C2 domain, which subsequently became part of the gene. (d) Another example of a gain of a domain-coding region by segmental duplication followed by exon joining. A branch with primate genes in the TF340491 family of vertebrate proteins that contain the KRAB domain has gained the additional HATPase_c domain. The representative transcript is the human PMS2L3-202 (ENST00000275580). The HATPase_c domain exists in the gene PMS2 (ENSG00000122512) and on the protein level the gained domain is 98% identical to the sequence in the protein product of PMS2's transcript, PMS2-001. A segmental duplication spans across the gained sequence in the transcript PMS2L3-202 and is a pair of the segmental duplication that covers the same domain in the gene PMS2. The pair of segmental duplication regions are presented as grey boxes and are connected with arrows. Therefore, the mechanism underlying this gain appears to be a segmental duplication of the sequence belonging to PMS2 after which the copy next to PMS2L3-202's ancestor was joined with it. An important caveat is that PMS2L3-202 has a structure that can be targeted by NMD.

Format: EPS Size: 1.2MB Download file

Open Data

Additional file 9:

A table that lists domains that are classified as being gained by insertion of new exons(s) into the introns of ancestral genes.

Format: DOC Size: 75KB Download file

This file can be viewed with: Microsoft Word Viewer

Open Data

Additional file 10:

A table listing significant Gene Ontology terms for human genes that have been extended with a new protein domain during evolution.

Format: DOC Size: 115KB Download file

This file can be viewed with: Microsoft Word Viewer

Open Data