An application for maximum likelihood superpositioning and analysis of macromolecular structures

Theseus is a program that simultaneously superimposes multiple macromolecular structures. Instead of using the conventional least-squares criteria, Theseus finds the optimal solution to the superposition problem using the method of maximum likelihood. By downweighting variable regions of the superposition and by correcting for correlations among atoms, the ML superpositioning method produces much more accurate results.

When superpositioning macromolecules with different residue sequences, other programs and algorithms discard residues that are aligned with gaps. Theseus, however, uses a novel maximum likelihood superposition algorithm that includes all of the data. To use Theseus to superposition homologous proteins with different length sequences (e.g., when the protein sequences align with gaps and insertions), a sequence alignment must be provided. We supply a wrapper script, theseus_align (linked below), that calls Theseus, extracts the proper sequences from the PDB files, aligns them, and performs the superposition using that alignment. Future versions of Theseus will address the much harder structural alignment problem, by simultaneously finding the best alignment and superposition using the method of maximum likelihood.

LS vs ML of kunitz domain

A conventional least-squares superposition of the Kunitz domain from PDB ID 1adz is shown at left. A maximum likelihood superposition from Theseus is shown at center. At right is the first principal component of the superposition plotted on the family of models. The red loops at lower right are highly correlated with each other, whereas they are moderately anti-correlated with the light blue strands at left center.


Douglas Theobald <>


"Optimal simultaneous superpositioning of multiple structures with missing data."
Theobald, Douglas L. & Steindel, Philip A. (2012) Bioinformatics 28 (15): 1972-1979 [Open Access]

"Accurate structural correlations from maximum likelihood superpositions."
Theobald, Douglas L. & Wuttke, Deborah S. (2008) PLOS Computational Biology 4(2):e43 [Open Access]

Latest Version - THESEUS 3.0.0 (2014 May 13)

Version 3: Differences from version 2 include (1) improved algorithm, (2) slight change in target criterion, now maximizing a marginal likelihood (with covariance matrix integrated out) instead of a joint likelihood, which should improve stability in certain rare pathological cases, and (3) lots of code restructuring and streamlining.

UNIX C source code, licensed uder the GPLv3 open source license.
Requires an ANSI C compiler (preferably GNU GCC) to compile and a working GSL library to link against.
Download source (1.2 Mb)

Macintosh OS X Universal binary.

Linux generic x86 binary executable.

'theseus_align' script

This very useful wrapper script runs THESEUS on multiple PDB files when the proteins (or nucleic acids) are of different lengths. For example, you will probably want to use this script when superpositioning structurally similar homologous proteins (having different sequences). This script transparently extracts the proper sequences from the PDB files, aligns them, and then performs the ML superposition based on that alignment. Examples are given in the examples directory provided with the source code and binaries. In general, the command will look something like:
theseus_align -f protein1.pdb protein2.pdb