Table 1

Summary of key features of MotifCluster and a selection of other programs that perform clustering of motifs or remote homology detection

Strategy
Program
Overview of program
Publication

Clustering proteins by motifs they contain
MotifCluster
Takes aligned or unaligned protein and nucleotide sequences and a MEME file showing motifs; allows clustering of the sequences according to the motifs they contain, and visualization of the motifs on the aligned and unaligned sequences and three-dimensional structures
This article
Clustering of transcription factor binding sites (in DNA)
MCAST
Takes list of transcription factor binding sites as input: uses hidden Markov models to find cis-regulatory modules in DNA
[21]

Cluster-Buster
Takes list of transcription factor binding sites as input: uses Forward algorithm and expected uniform distribution to find motif co-occurrence in DNA
[22]

ClusterDraw
Takes list of transcription factor binding sites as input: uses r-scan algorithm and sweep over parameter values to visualize significant clusters as peaks on the DNA sequence
[23]

COMET
Calculates significance of collection of position-specific score matrices that appear in order: can apply to DNA or protein, in principle
[24]

PEAKS
Calculates significance of collection of transcription factor binding sites that appear at specified distance from transcription start site or other feature in the DNA
[25]

CompMoby
Aligns all pairs of motifs that appear significant in different promoters, then groups these into clusters using the CAST algorithm. DNA-specific
[26]

CREME
Identifies groups of DNA motifs that co-occur significantly within a defined distance using both order-dependent and order-independent models
[27]

PHYLOCLUS
Uses Bayesian method to find clusters of evolutionarily conserved DNA motifs that appear in different promoters.
[28]

INCLUSive
Clusters genes based on microarray analysis: feeds promoters to Gibbs sampler to find DNA motifs overrepresented in each cluster
[29]
Identifying kernels for SVMs*
SVM kernels
Introduces kernels based on k-word occurrences and best BLAST hit for SVM clustering: does not focus on conserved motifs
[30]

WCM (word correlation matrices)
Introduces k-word kernel for SVM clustering based on correlations in appearance of pairs of k-words: does not focus on conserved motifs.
[31]

ODH (oligomer distance histograms)
Introduces new kernel for SVM clustering based on histograms of distances between all words in protein: does not focus on conserved motifs
[32]
Iterative BLAST
Shotgun
BLAST-based approach for identifying remote homologs by iterative searches: not motif-based
[3]

DivergentSet
Among other features, can perform BLAST and PSI-BLAST versions of Shotgun and choose representative sequences of each group: not motif-based
[20]

Cascade PSI-BLAST
Performs iterative steps of PSI-BLAST, otherwise like Shotgun: not motif-based.
[33]

ProClust
Performs graph-based connection of proteins based on pairwise sequence similarity: not motif based
[34]
k-word clustering
CD-Hit
Clusters proteins based on shared segments of overall sequence, not by motifs already known to be significant
[35]
Profile-profile alignment
COMPASS
Performs profile-profile alignments for remote homology detection: assesses statistical significance matches in the profiles overall, rather than specifically using shared motifs
[1]
Clustering of motifs
STAMP
Aligns motifs with one another so that relationships among motifs can be detected; performs many other tasks for promoter characterization, but specific to promoters
[36]

TAMO
Performs many functions for cis-regulatory analysis: is able to cluster DNA motifs with one another
[37]

SOMBRERO
Aligns and clusters DNA motifs with one another to improve transcription factor binding site searches
[38]
Identification of functions in labeled structures
FunClust
Takes set of three-dimensional structures with annotated functions; identifies three-dimensional motif fragments that are common to the structures with each function.
[39]

*SVMs are support vector machines, a common machine learning approach to pattern classification. A kernel is a function that calculates the inner product of all pairs of input vectors in an abstract space, which is an important step in the process and affects the clustering.

Hamady et al. Genome Biology 2008 9:R128   doi:10.1186/gb-2008-9-8-r128