tutorial
The TFmodeller program scans a protein sequence P against a library of protein-DNA complexes and builds comparative models of P if good templates are found. These models are used to get an idea of the P-DNA interface, its evolution and the putative recognised DNA sequences. This tutorial explains how to use it in these sections:
input data
1. fully automatic mode
To run TFmodeller you need the FASTA-formatted amino acid sequence of one or more proteins known or suspected to bind to DNA, such as the FNR transcription factor in E.coli:>P0A9E5|FNR_ECOLI Fumarate and nitrate reduction regulato... MIPEKRIIRRIQSGGCAIHCQDCSISQLCIPFTLNEHELDQLDNIIERKKPIQKGQTLFK AGDELKSLYAIRSGTIKSYTITEQGDEQITGFHLAGDLVGFDAIGSGHHPSFAQALETSM VCEIPFETLDDLSGKMPNLRQQMMRLMSGEIKGDQDMILLLSKKNAEERLAAFIYNLSRR FAQRGFSPREFRLTMTRGDIGNYLGLTVETISRLLGRFQKSGMLAVKGKYITIENNDALA QLAGHTRNVA
Once you paste the protein sequence and type your email, TFmodeller will scan this sequence against a weekly updated library of protein-DNA complexes, using PSI-BLAST. Each match found in this search is regarded as a template and the BLAST alignment is then used to drive the building of comparative models of the input sequence in complex with DNA. This is a fully automated process that builds monomeric complexes and might serve as a first approach mode. However, many transcription factors bind to DNA as multimeric complexes. You can model these using the user template/alignment mode.
2. user template mode
Often you will have an idea of what template is best to build this model, perhaps after checking interface similarity with the 3D-footprint database search form or after reading a paper. When this happens you must save your template coordinates in PDB format and tell TFmodeller to use it in the input form, at the bottom. Typically, a PDB template file will look like this:HEADER GENE-REGULATORY PROTEIN 12-AUG-91 1CGP JRNL AUTH S.C.SCHULTZ,G.C.SHIELDS,T.A.STEITZ JRNL TITL CRYSTAL STRUCTURE OF A CAP-DNA COMPLEX: THE DNA JRNL TITL 2 IS BENT BY 90 DEGREES JRNL REF SCIENCE V. 253 1001 1991 JRNL REFN ASTM SCIEAS US ISSN 0036-8075 REMARK (many remarks may follow...) REMARK 2 RESOLUTION. 3.0 ANGSTROMS. ATOM 1 N PRO A 9 32.555 55.928 33.201 1.00 82.62 ATOM 2 CA PRO A 9 31.300 56.105 32.474 1.00 82.25 ATOM 3 C PRO A 9 30.441 54.837 32.272 1.00 81.70 ATOM 4 O PRO A 9 30.717 53.724 32.761 1.00 80.50 ATOM 5 CB PRO A 9 31.739 56.735 31.148 1.00 81.60 ...
In this mode, TFmodeller will extract the protein sequence contained in the PDB template and it will
try to align it to the query input sequence using
BLAST2SEQ.
If the generated alignment is good enough (in terms of coverage and sequence identity) then a comparative modell of the
protein-DNA complex will be built.
To summarize, in this mode you need to: 1) paste the protein sequence of your query and
2) upload the PDB coordinates of your chosen template.
3. user template + alignment mode
TFmodeller allows you to use custom alignments of the amino acid sequence of query and template. This might be useful when you are not satisfied by the automatic alignment. The alignment must be in FASTA format as well, with the input sequence on top, followed by the template's sequence. Headers will be ignored:>sp|P0A9E5|FNR_ECOLI monomer MIPEKRIIRRIQSGGCAIHCQDCSISQLCIPFTLNEHELDQLDNIIERKKPIQKGQTLFK AGDELKSLYAIRSGTIKSYTITEQGDEQITGFHLAGDLVGFDAIG--SGHHPSFAQALET SMVCEIPFETLDDLSGKMPNLRQQMMRLMSGEIKGDQDMILLLSKKNAEERLAAFIYNLS RRFAQRGFSPREFRLTMTRGDIGNYLGLTVETISRLLGRFQKSGMLAVKGKYITIENNDA LAQLAGHTRNVA >template 1CGP chain A ---------------------------PTLEWFLSHCHIHKYP----------SKSTLIH QGEKAETLYYIVKGSVAVLIKDEEGKEMILSYLNQGDFIGELGLFEEGQERSAWVRAKTA CEVAEISYKKFRQLIQVNPDILMRLSAQMARRLQVTSEKVGNLAFLDVTGRIAQTLLNLA K-QPDAMTHPDGMQIKITRQEIGQIVGCSRETVGRILKMLEDQNLISAHGKTIVV----- ------------It is important to note that the server requires that the amino acid sequence of the aligned template exactly matches the sequence in the PDB file with the coordinates.
In order to use the server in this mode you need to: 1) paste the protein sequence of your query, 2) upload the PDB coordinates of your chosen template and 3) upload the alignment in FASTA format.
4. modelling a multimeric complex
It is possible to take advantage of TFmodeller to build multimeric models, in which two or more protein chains bind to the same DNA molecule. Of course it is necessary to use a multimeric template to do this, extracted from the PDB or generated with symmetry matrices, as explained here. Here I will illustrate how to model a FNR dimer, the protein introduced earlier, which is known to be functional as a dimer. We will first obtain the sequence of the FNR dimer by concatenating two copies of the sequences, which we will paste in the window (note that with heterodimers we will concatenate two different sequences):>sp|P0A9E5|FNR_ECOLI monomer MIPEKRIIRRIQSGGCAIHCQDCSISQLCIPFTLNEHELDQLDNIIERKKPIQKGQTLFK AGDELKSLYAIRSGTIKSYTITEQGDEQITGFHLAGDLVGFDAIG--SGHHPSFAQALET SMVCEIPFETLDDLSGKMPNLRQQMMRLMSGEIKGDQDMILLLSKKNAEERLAAFIYNLS RRFAQRGFSPREFRLTMTRGDIGNYLGLTVETISRLLGRFQKSGMLAVKGKYITIENNDA LAQLAGHTRNVA MIPEKRIIRRIQSGGCAIHCQDCSISQLCIPFTLNEHELDQLDNIIERKKPIQKGQTLFK AGDELKSLYAIRSGTIKSYTITEQGDEQITGFHLAGDLVGFDAIG--SGHHPSFAQALET SMVCEIPFETLDDLSGKMPNLRQQMMRLMSGEIKGDQDMILLLSKKNAEERLAAFIYNLS RRFAQRGFSPREFRLTMTRGDIGNYLGLTVETISRLLGRFQKSGMLAVKGKYITIENNDA LAQLAGHTRNVAThen we need to align the FNR dimer to the dimeric PDB template, with two concatenated protein chains, A and B, and put the alignment in a FASTA formatted text file (check the PDB file here):
>sp|P0A9E5|FNR_ECOLI dimer MIPEKRIIRRIQSGGCAIHCQDCSISQLCIPFTLNEHELDQLDNIIERKKPIQKGQTLFK AGDELKSLYAIRSGTIKSYTITEQGDEQITGFHLAGDLVGFDAIG--SGHHPSFAQALET SMVCEIPFETLDDLSGKMPNLRQQMMRLMSGEIKGDQDMILLLSKKNAEERLAAFIYNLS RRFAQRGFSPREFRLTMTRGDIGNYLGLTVETISRLLGRFQKSGMLAVKGKYITIENNDA LAQLAGHTRNVAMIPEKRIIRRIQSGGCAIHCQDCSISQLCIPFTLNEHELDQLDNIIER KKPIQKGQTLFKAGDELKSLYAIRSGTIKSYTITEQGDEQITGFHLAGDLVGFDAIG--S GHHPSFAQALETSMVCEIPFETLDDLSGKMPNLRQQMMRLMSGEIKGDQDMILLLSKKNA EERLAAFIYNLSRRFAQRGFSPREFRLTMTRGDIGNYLGLTVETISRLLGRFQKSGMLAV KGKYITIENNDALAQLAGHTRNVA >template 1CGP chains A,B ---------------------------PTLEWFLSHCHIHKYP----------SKSTLIH QGEKAETLYYIVKGSVAVLIKDEEGKEMILSYLNQGDFIGELGLFEEGQERSAWVRAKTA CEVAEISYKKFRQLIQVNPDILMRLSAQMARRLQVTSEKVGNLAFLDVTGRIAQTLLNLA K-QPDAMTHPDGMQIKITRQEIGQIVGCSRETVGRILKMLEDQNLISAHGKTIVV----- --------PTLEWFLSHCHIHKYPSK---------------------------------- -------STLIHQGEKAETLYYIVKGSVAVLIKDEEGKEMILSYLNQGDFIGELGLFEEG QERSAWVRAKTACEVAEISYKKFRQLIQVNPDILMRLSAQMARRLQVTSEKVGNLAFLDV TGRIAQTLLNLAK-QPDAMTHPDGMQIKITRQEIGQIVGCSRETVGRILKMLEDQNLISA HGKTIVV-----------------Finally, as explained above, you need to: 1) paste the protein sequence of your query, 2) upload the PDB coordinates of your chosen template and 3) upload the alignment in FASTA format.
Go back to top
output
After successfully receiving the submission data, TFmodeller will start. Results are always emailed to the user, although you have the chance to wait and see them appear in your browser. Results include:- a matrix of homologous interface contacts, that shows a multiple interface alignment of the
user's input sequence to one or more structurally related protein-DNA complexes. Note this is a vertical alignment with
residues from the query on the left and then one column per related complex. Each column shows the aligned equivalent
protein residue and the contacted nucleotide. Only N-ring (purine/pyrimidine) contacts are considered here.
Residues marked with * are supposed to be interface
residues in the query sequence. For instance
0208 E* ECRG------SCKG--HG----RG--EC 0.91
represents residue E(Glu) 208 from the query aligned to 7 equivalent residues, two of which (from templates 1zrf_A and 1rio_H) are E that contact C nucleotides. The entropy of this residue in the original profile built by PSI-BLAST when searching for templates is also reported (0.91). The stats line_ stats: contacts=13 Nring=5 specif=0.38
shows the evolutionary proportion of sequence-specific contacts for this complex (0.38), a number that can be related to the number of different DNA sequences potentially bound by this protein. - one or more comparative models of the input sequence in complex with DNA. Sequence alignments
are printed in order to highlight interface contact residues and its degree of conservation with respect to the
template complex. In addition, a schematic representation of the interface is printed to help in the task of identifying
key residues and the recognised DNA motifs:
_1.00 0.23 1.00 _R0212A T0206A E0208A _G A t C G c a _: : : : : : : _ T a G C g t _ E0208A V0207A V0207A _ 1.00 0.00 0.00
In this example we learn that E(Glu) 208, chain A, from the query protein sequence is very likely to be contacting two C nucleotides, as this interaction is conserved from the template, with a contact probability of 1.00. However, V(Val) 207 most likely will not contact G or T nucleotides, since this residue actually mutated with respect to the original R in the template and no base contacts can be identified for it. Nucleotides in lower case show parts of the putative DNA motif that have probably changed, as a result of mutations in contacting residues. A special case would be T(Thr) 206, chain A, found to be contacting a T nucleotide with a probability of 0.23. However, the matrix of homologous interface contacts provides further support for this contact, as there is a similar contact (TT) in the PDB:0206 T* ST--------------SG----HC--TT 0.89
Each comparative model is attached in PDB format to the results email in compressed form (.tgz format). Programs such as Rasmol, PyMOL or DeepView can be used to display them.
# sequence library: /home1/tfmodell/db/dna_complexes_PDB.fas (Sun Mar 4 10:44:13 2007) > P0A9E5_FNR_ECOLI_980419 number of comparative complexes = 1 _Matrix of homologous interface contacts: _ stats: contacts=13 Nring=5 specif=0.38 entropy=0.67 _ PDBs: 1:1zrf_A,2:1cf7_A,3:1qbj_A,4:1sfu_A,5:2heo_A,6:1je8_A,7:1zlk_A, _ PDBs: 8:1b8i_B,9:1tc3_C,10:2h27_A,11:1k61_A,12:2glo_A,13:2hdd_A,14:1rio_H, _ -lnE: 1:25.1,2:8.4,3:8.1,4:8.1,5:7.1,6:6.0,7:5.6,8:5.5,9:5.3,10:5.2, _ -lnE: 11:5.2,12:5.0,13:5.0,14:4.7, _ 1 2 3 4 5 6 7 8 91011121314 0196 R ------------------YG-------- 1.33 0197 G* QT--------------HG---------- 0.88 0206 T* ST--------------SG----HC--TT 0.89 0207 V* RG--------------RG----RG--RG 0.67 0208 E* ECRG------SCKG--HG----RG--EC 0.91 0209 T --RG------TATG----TG--QA--RT 1.02 0211 S --YC------KTKA--RTRG--QG--RC 0.71 0212 R* RG--RG----VCNGNT--SGNC--IAQT 1.65 0213 L ----------HTYA-------------- 0.60 0215 G ----YGYGYGKG------FTST--KG-- 0.66 0216 R --------------NA--RGNA--NA-- 0.59 0219 K --------------------RG------ 0.66 0220 S --------------RG------------ 0.37 model 1zrf_A 203 DNACOMPLEX resol=2.10 %ID=21 e-value=3e-56 _query LDQLDNIIERKKPIQKGQTLFKAGDELKSLYAIRSGTIKSYTITEQGDEQITGFHLAGDL _template LEWFLSHCHIHKYPSKS-TLIHQGEKAETLYYIVKGSVAVLIKDEEGKEMILSYLNQGDF _contacts ................. .......................................... _ _query VGFDAIGS--GHHPSFAQALETSMVCEIPFETLDDLSGKMPNLRQQMMRLMSGEIKGDQD _template IGELGLFEEGQERSAWVRAKTACEVAEISYKKFRQLIQVNPDILMRLSAQMARRLQVTSE _contacts ........ .................................................. _ _query MILLLSKKNAEERLAAFIYNLSRRFAQRGFSPREFRLTMTRGDIGNYLGLTVETISRLLG _template KVGNLAFLDVTGRIAQTLLNLAKQ-PDAMTHPDGMQIKITRQEIGQIVGCSRETVGRILK _contacts ........................ ................*........***...*... _ _query RFQKSGMLAVKGKYITI _template MLEDQNLISAHGKTIVV _contacts ................. _ _stats: 5/5 aligned contacting residues, 2/5 conserved _modelled protein-DNA interface (N-ring contacts): _ _1.00 0.23 1.00 _R0212A T0206A E0208A _G A t C G c a _: : : : : : : _ T a G C g t _ E0208A V0207A V0207A _ 1.00 0.00 0.00 _ _template reference: A.A.NAPOLI et al. J.MOL.BIOL. V. 357 173 2006 _template_info: INDIRECT READOUT OF DNA SEQUENCE AT THE PRIMARY-KINK _template_info: SITE IN THE CAP-DNA COMPLEX: RECOGNITION OF PYRIMIDINE-PURINE _template_info: AND PURINE-PURINE STEPS. _PDB model file P0A9E5_FNR_ECOLI_980419-1zrf_A.pdb _compressed PDB models file P0A9E5_FNR_ECOLI_980419_compressed_models.tgz # TFmodeller : emailing results for P0A9E5_FNR_ECOLI_980419
how does it work?
The figure shows a flow chart of TFmodeller, exposing all steps involved in a modelling job.
Performance analysis
It is important to acknowledge key observations that affect the value of results generated by TFmodeller:- Generally, structurally related protein-DNA complexes have similar interfaces, as seen in figure 1, where two Drosophila homeodomain transcription factors have been superposed.
- Interface similarity decreases as the protein sequences being compared diverge, as seen in figure 2, where a non-redundant collection of protein DNA-complexes was used to perform all vs. all comparisons, yielding a pair or points per comparison: a median protein (P) deviation and a median nucleotide (N) deviation (ref).
- However, not all interfaces are equally conserved, as suggested by figure 3, in which we see how different DNA/RNA-binding protein folds conserve their interfaces (ref). TFmodeller has been mostly benchmarked using 3-helical bundle proteins, that include well known folds or motifs such as HTH or homeodomains. However, we haven't especifically benchmarked Zn-finger proteins.
- Proteins with similar interfaces recognise similar DNA motifs, but the predictive value of modelled interfaces decreases linearly as templates diverge, as suggested by figure 4 (ref).
- the accuracy of modelled interface side-chains also depends on the sequence divergence
between the sequence being modelled and the template used. We actually benchmarked this effect
by modelling 2193 interface hydrogen-binding residues from the complexes compared in
figure 2. By classifying complexes in terms
of %sequence identity between template and target, we were able to compile the following tables, that
summarize the reliability (REL, frequency of conserved contacts after modelling) and deviations (RMSD, in angstrom)
observed for each residue type in four %sequence identity intervals: 20-39 (0.2), 40-59 (0.4), 60-79 (0.6) and
80-100 (0.8):
source: sc_nr95_25022007_16.matrices REL 0.2 0.4 0.6 0.8 RMSD 0.2 0.4 0.6 0.8 obs 0.2 0.4 0.6 0.8 ASP 0.50 0.57 0.00 ---- ASP 1.78 1.35 1.60 ---- ASP 028 021 003 --- ASN 0.63 0.75 0.86 0.86 ASN 1.11 0.70 1.24 0.28 ASN 252 061 014 022 LYS 0.52 0.65 0.90 0.70 LYS 1.47 1.07 0.66 0.68 LYS 204 175 010 010 TYR 0.40 0.38 0.50 ---- TYR 2.44 2.00 3.54 ---- TYR 015 008 002 --- GLU 0.50 0.70 0.60 0.80 GLU 1.51 1.10 0.76 1.18 GLU 115 122 005 005 ARG 0.48 0.73 0.63 0.71 ARG 2.13 1.41 1.20 1.16 ARG 507 283 054 021 CYS 1.00 1.00 ---- ---- CYS 0.90 0.99 ---- ---- CYS 002 001 --- --- THR 0.23 0.40 0.40 0.80 THR 1.28 0.90 0.70 0.39 THR 013 005 005 010 HIS 0.42 0.18 0.33 ---- HIS 1.88 2.10 2.00 ---- HIS 036 011 006 --- GLN 0.31 0.67 0.90 0.71 GLN 1.93 1.20 1.00 0.39 GLN 051 039 010 007 SER 0.39 0.18 0.56 ---- SER 2.05 1.57 1.56 ---- SER 038 011 009 --- TRP ---- ---- 1.00 ---- TRP ---- ---- 0.46 ---- TRP --- --- 002 --- Total observations = 2193