*************************
                     **                     **
                     **  COILS version 2.2  **
                     **                     **
                     *************************
                           by A. N. Lupas
                     programmed by J. M. Lupas
  1. Introduction
  2. Input file formats
  3. Scoring options
  4. Weighting options
  5. Output options
  6. Performance:
    A. Database statistics
    B. Highscoring sequences in globular proteins
    C. Performance on coiled coils
    D. Limits of the method
  7. Recommendations for using the program


1. INTRODUCTION

COILS is a program that compares a sequence to a database of known parallel two-stranded coiled-coils and derives a similarity score. By comparing this score to the distribution of scores in globular and coiled-coil proteins, the program then calculates the probability that the sequence will adopt a coiled-coil conformation.

COILS is described in:

    Lupas, A., Van Dyke, M., and Stock, J. (1991) Predicting Coiled Coils from Protein Sequences
    Science 252:1162-1164.

    Lupas, A. (1996) Prediction and Analysis of Coiled-Coil Structures
    Meth. Enzymology 266:513-525.

and is based on a prediction protocol proposed by David Parry:

    Parry, D. A. D. (1982) Coiled-coils in alpha-helix-containing proteins:
    analysis of the residue types within the heptad repeat and the use of these data in the prediction of coiled coils in other proteins
    Biosci. Rep. 2:1017-1024.


2. INPUT FILE FORMATS

COILS accepts files in the following formats:

(a) GCG:

P1;MULI_ERWAM - MAJOR OUTER MEMBRANE LIPOPROTEIN PRECURSOR (MUREIN-LIPOPROTEIN)
ID MULI_ERWAM STANDARD; PRT; 78 AA.
AC P02939;
DT 21-JUL-1986 (REL. 01, CREATED)
DT 21-JUL-1986 (REL. 01, LAST SEQUENCE UPDATE)
DT 01-APR-1988 (REL. 07, LAST ANNOTATION UPDATE)
DE MAJOR OUTER MEMBRANE LIPOPROTEIN PRECURSOR (MUREIN-LIPOPROTEIN).
OS ERWINIA AMYLOVORA.
OC PROKARYOTA; GRACILICUTES; SCOTOBACTERIA; FACULTATIVELY ANAEROBIC RODS;
OC ENTEROBACTERIACEAE.
RN [1]
RP SEQUENCE FROM N.A.
RM 81117327
RA YAMAGATA H., NAKAMURA K., INOUYE M.;
RL J. BIOL. CHEM. 256:2194-2198(1981).
DR EMBL; J01577; EALPP.
DR PIR; A03439; NPWCWY.
DR PROSITE; PS00013; PROKAR_LIPOPROTEIN.
KW SIGNAL; OUTER MEMBRANE; LIPOPROTEIN; DUPLICATION.
FT SIGNAL 1 20
FT CHAIN 21 78 MUREIN-LIPOPROTEIN.
FT LIPID 21 21 N-ACYL DIGLYCERIDE.
FT REPEAT 24 34
FT REPEAT 38 48
SQ SEQUENCE 78 AA; 8369 MW; 24285 CN;
Muli_Erwam Length: 78 January 21, 1994 16:04 Type: P Check: 4477 ..
1 MNRTKLVLGA VILGSTLLAG CSSNAKIDQL STDVQTLNAK VDQLSNDVTA
51 IRSDVQAAKD DAARANQRLD NQAHSYRK

(b) Pearson (FASTA):

>MULI_ECOLI - MAJOR OUTER MEMBRANE LIPOPROTEIN PRECURSOR
MKATKLVLGAVILGSTLLAGCSSNAKIDQLSSDVQTLNAKVDQLSNDVNAMRSDVQAAKD
DAARANQRLDNMATKYRK

(c) user-defined:

The program recognizes the start of a sequence by a > at the beginning or a [space,space,dot,dot] at the end of the line preceeding the sequence. The program recognizes the end of a sequence by a *, a //, or by the end-of-file character. The program accepts sequences in upper- and lower-case letters and ignores all spaces, numbers and other characters not representing an amino acid. If a file contains several proteins, the end of each sequence but last must be marked by * or by //:

>M_ECOLI P1;MULI_ECOLI - MAJOR OUTER MEMBRANE LIPOPROTEIN PRECURSOR
MKATKLVLGAVILGSTLLAGCSSNAKIDQLSSDVQTLNAKVDQLSNDVNAMRSDVQAAKD
DAARANQRLDNMATKYRK*
>M_ERWAM P1;MULI_ERWAM - MAJOR OUTER MEMBRANE LIPOPROTEIN PRECURSOR
MNRTKLVLGAVILGSTLLAGCSSNAKIDQLSTDVQTLNAKVDQLSNDVTAIRSDVQAAKD
DAARANQRLDNQAHSYRK*
>M_MORMO P1;MULI_MORMO - MAJOR OUTER MEMBRANE LIPOPROTEIN PRECURSOR
MGRSKIVLGAVVLASALLAGCSSNAKFDQLDNDVKTLNAKVDQLSNDVNAIRADVQQAKD
EAARANQRLDNQVRSYKK



3. SCORING OPTIONS

After asking for input and output filenames, the program will offer the choice of two scoring matrices that it can compare a sequence to:

MTK - is a matrix derived from the sequences of myosins, tropomyosins and keratins (intermediate filaments type I and II). It is the one described in Science, 252:1162 (1991).

MTIDK - is a new matrix derived from myosins, paramyosins, tropomyosins, intermediate filaments type I - V, desmosomal proteins and kinesins. The matrix was compiled by weighting the residue frequencies of the different protein families according to the following scheme:

0.2 MYOSINS - 0.5 myosins
            - 0.5 paramyosins
0.2 TROPOMYOSINS
0.2 INTERMEDIATE FILAMENTS - 0.2 type I (keratin)
                           - 0.2 type II (keratin)
                           - 0.2 type III (desmin, vimentin, GFAP, peripherin)
                           - 0.2 type IV (NF light, medium and heavy chains)
                           - 0.2 type V (lamins A and B)
0.2 DESMOSOMAL PROTEINS - 0.33 desmoplakin
                        - 0.33 plectin
                        - 0.33 hemidesmosomal plaque prot. (bullous pemphigoid)
0.2 KINESINS

While the MTIDK matrix provides for a somewhat better resolution between the scores of globular and coiled-coil proteins as well as for a more consistent evaluation of the different families of coiled-coil proteins, the MTK matrix yields fewer highscoring segments in a database of globular sequences (see Section 7: PERFORMANCE). Current data are consistent with the assumption that the MTK matrix is more specific for two-stranded structures and that the MTIDK matrix gives a more realistic assesment for other types of coiled coils.


4. WEIGHTING OPTIONS

Because coiled coils are generally fibrous, solvent-exposed structures, all but the internal a and d positions have a high likelihood of being occupied by hydrophilic residues. A program that gives equal weight to all positions is therefore going to be biased towards hydrophilic, charge-rich sequences. While this does not pose a problem for the vast majority of natural sequences, some highly charged sequences obtain high coiled-coil probabilities in the obvious absence of heptad periodicity and coiled-coil-forming potential. An extreme case is that of polyglutamate which obtains a coiled-coil-forming probability > 99%.

To counter this problem, COILS2 contains a weighting option, which allows the user to assign the the same weight to the two hydrophobic positions a and d as to the five hydrophilic positions b, c, e, f and g. This leads to an only slightly worse performance of the program (see Section 7: PERFORMANCE) and permits the identification of the class of false positives described above. It is recommended to run a weighted and unweighted scan in parallel and to compare the outputs. A drop of more than 20-30% in the probability is a clear indication of a highly-charged false positive.

Two examples (window=21, probabilities abbreviated to the first digit):

sequence EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE
MTK      99999999999999999999999999999999999999999999999999
MTIDK    99999999999999999999999999999999999999999999999999
MTK_W    00000000000000000000000000000000000000000000000000
MTIDK_W  00000000000000000000000000000000000000000000000000
sequence DDEKRKEKKDKKEKEKERRREKEKKEKEKEKERREKKKRKREEDDEEKKE
MTK      47888999999999999999999999999999999999999998766666
MTIDK    99999999999999999999999999999999999999999999999999
MTK_W    00000000000000000000000000000000000000000000000000
MTIDK_W  00111111122222333333333333333333333222222220000000

In many cases a 21 residue scan yields clearer results than a 28 residue scan.

As an alternative, it is possible to use the auxiliary program ALLFRAME. This program lists the scores (not probabilities) of a sequence in all seven frames. The presence and strength of a heptad periodicity can be inferred directly from the difference between the highest-scoring frame and all others:

   1 E  a 1.99  b 1.99  c 1.99  d 1.99  e 1.99  f 1.99  g 1.99
   2 E  b 1.99  c 1.99  d 1.99  e 1.99  f 1.99  g 1.99  a 1.99
   3 E  c 1.99  d 1.99  e 1.99  f 1.99  g 1.99  a 1.99  b 1.99
   4 E  d 1.99  e 1.99  f 1.99  g 1.99  a 1.99  b 1.99  c 1.99
   5 E  e 1.99  f 1.99  g 1.99  a 1.99  b 1.99  c 1.99  d 1.99
   6 E  f 1.99  g 1.99  a 1.99  b 1.99  c 1.99  d 1.99  e 1.99
   7 E  g 1.99  a 1.99  b 1.99  c 1.99  d 1.99  e 1.99  f 1.99
   8 E  a 1.99  b 1.99  c 1.99  d 1.99  e 1.99  f 1.99  g 1.99
   9 E  b 1.99  c 1.99  d 1.99  e 1.99  f 1.99  g 1.99  a 1.99
  10 E  c 1.99  d 1.99  e 1.99  f 1.99  g 1.99  a 1.99  b 1.99
.....
   1 D  a 1.21  b 1.38  c 1.51  d 1.42  e 1.58  f 1.18  g 1.14
   2 D  b 1.38  c 1.39  d 1.51  e 1.42  f 1.63  g 1.19  a 1.16
   3 E  c 1.40  d 1.41  e 1.58  f 1.52  g 1.66  a 1.22  b 1.26
   4 K  d 1.40  e 1.41  f 1.58  g 1.52  a 1.66  b 1.29  c 1.26
   5 R  e 1.46  f 1.41  g 1.58  a 1.52  b 1.66  c 1.31  d 1.27
   6 K  f 1.46  g 1.41  a 1.58  b 1.52  c 1.66  d 1.31  e 1.27
   7 E  g 1.46  a 1.41  b 1.58  c 1.52  d 1.66  e 1.31  f 1.27
   8 K  a 1.46  b 1.41  c 1.58  d 1.52  e 1.66  f 1.31  g 1.27
   9 K  b 1.46  c 1.41  d 1.58  e 1.52  f 1.66  g 1.31  a 1.27
  10 D  c 1.46  d 1.41  e 1.58  f 1.52  g 1.66  a 1.31  b 1.27
.....

In both examples, the absence of a heptad periodicity is obvious. For comparison, here are scores for the GCN4 leucine zipper; the heptad frame with the leucines in position d is immediately apparent.

.....
  30 L  b 0.79  c 1.17  d 1.91  e 0.98  f 0.94  g 1.21  a 1.17
  31 E  c 0.79  d 1.17  e 1.91  f 0.98  g 0.96  a 1.21  b 1.17
  32 D  d 0.79  e 1.17  f 1.91  g 0.98  a 0.96  b 1.21  c 1.17
  33 K  e 0.79  f 1.17  g 1.91  a 0.98  b 1.02  c 1.21  d 1.17
  34 V  f 0.79  g 1.17  a 1.91  b 1.02  c 1.02  d 1.21  e 1.17
  35 E  g 0.79  a 1.17  b 1.91  c 1.02  d 1.02  e 1.21  f 1.17
  36 E  a 0.78  b 1.17  c 1.91  d 1.02  e 1.02  f 1.21  g 1.17
  37 L  b 1.02  c 1.17  d 1.91  e 1.02  f 1.02  g 1.19  a 1.17
  38 L  c 1.02  d 1.17  e 1.91  f 1.02  g 1.02  a 1.19  b 1.17
  39 S  d 1.02  e 1.17  f 1.91  g 1.02  a 1.02  b 1.17  c 1.17
  40 K  e 1.02  f 1.17  g 1.91  a 1.02  b 1.02  c 1.17  d 1.17
  41 N  f 1.02  g 1.17  a 1.91  b 1.02  c 1.02  d 1.06  e 1.17
  42 Y  g 1.02  a 1.17  b 1.91  c 1.02  d 1.02  e 1.02  f 1.17
  43 H  a 1.02  b 1.17  c 1.91  d 1.02  e 1.02  f 1.02  g 1.17
  44 L  b 1.02  c 1.17  d 1.91  e 1.02  f 1.02  g 1.02  a 1.17
  45 E  c 1.02  d 1.17  e 1.91  f 1.02  g 1.02  a 1.02  b 1.17
  46 N  d 1.02  e 1.17  f 1.91  g 1.02  a 1.02  b 1.02  c 1.17
  47 E  e 1.02  f 1.17  g 1.91  a 1.02  b 1.02  c 1.02  d 1.17
  48 V  f 1.02  g 1.10  a 1.91  b 1.02  c 1.02  d 1.02  e 1.17
  49 A  g 1.02  a 1.10  b 1.91  c 1.02  d 1.02  e 1.02  f 1.17
  50 R  a 1.02  b 1.10  c 1.91  d 1.02  e 1.02  f 1.02  g 1.17
  51 L  b 1.02  c 1.04  d 1.91  e 1.02  f 1.02  g 1.02  a 1.17
.....



5. OUTPUT OPTIONS

COILS2 offers four output options:

The default option gives residue number, residue type and the frame and coiled-coil-forming probability obtained in scanning windows of 14, 21 and 28 residues:

.....
   61 E        c  0.317       c  0.379       c  0.562
   62 L        d  0.317       d  0.379       d  0.562
   63 E        e  0.317       e  0.379       e  0.562
   64 L        f  0.167       f  0.379       f  0.562
   65 T        c  0.472       c  0.598       g  0.562
   66 H        d  0.472       d  0.740       a  0.562
   67 R        e  0.916       e  0.740       e  0.677
   68 K        f  0.943       f  0.740       f  0.677
   69 M        g  0.943       g  0.740       g  0.677
   70 K        a  0.943       a  0.740       a  0.677
   71 D        b  0.943       b  0.740       b  0.677
.....

Opion a is similar to the default option, except that the results are displayed in rows. As a result, residue numbers are indicated by a scale above the sequence, probabilities are abbreviated to the first digit (but 100% is also 9) and the frames for the three scans are listed below the probabilities. This option gives a good overview over the location of peaks in a protein:

.....
61
    .    |    .    |    .    |    .    |    .    |    .    |
ELELTHRKMKDAYEEEIKHLKLGLEQRDHQIASLTVQQQRQQQQQQQVQQHLQQQQQQLA
111144999999999999999777770000000000000000000333333333333332
333357777777777777777777772222222200004444444444444444444443
111112666666666666666666666666666654422222222222222222222222
cdefcdefgabcdefgabcdefgabcdefgabcdefdefgabcdebcdefgabcdefgab
cdefcdefgabcdefgabcdefgabcdefgabcdefgabcdefgabcdefgabcdefgab
cdefcdefgabcdefgabcdefgabcdefgabcdefgabcdefgabcdefgabcdefgab
.....

Option b asks the user for the size of the scanning window and returns scores only. This option allows the user to inspect the scores behind the probabilities given in the previous options and to scan sequences with window sizes for which no statistics are currently available. For an application, see Seo, J. and Cohen, C. (1993) Pitch diversity in alpha-helical coiled coils, Proteins 15:223-234.

.....
        61 E      c     1.59
        62 L      d     1.59
        63 E      e     1.59
        64 L      f     1.50
        65 T      c     1.65
        66 H      d     1.65
        67 R      e     1.90
        68 K      f     1.94
        69 M      g     1.94
        70 K      a     1.94
        71 D      b     1.94
.....

Option c is useful for scanning very large proteins or files containing many proteins as it only displays (in default format) sequences with coiled-coil-forming probabilities above a cutoff value that is set by the user.


6. PERFORMANCE

A. Database statistics

The following is a synopsis of the score distributions for the PDB and coiled-coil databases. The score distributions are approximated by Gaussians and the means and standard deviations of the Gaussians are given. PDB is a database of globular sequences from The Protein Data Bank (32,592 res.) described in Science 252:1162. The combined coiled-coil database contains 26,965 residues from various coiled-coil proteins (see Section 4: SCORING OPTIONS) and will be described in detail in print. Obviously, every family of coiled-coil proteins was scored with a scoring matrix that excluded residue frequencies from that family.

                       28 residue scan   21 residue scan   14 residue scan
                         mean   std.dev.   mean   std.dev.   mean   std.dev.
PDB           MTK        0.77   0.20       0.83   0.24       0.94   0.29
              MTIDK      0.80   0.18       0.86   0.21       0.95   0.26
              MTK_W      0.79   0.23       0.86   0.26       1.00   0.33
              MTIDK_W    0.86   0.18       0.92   0.22       1.04   0.27
Coiled coils  MTK        1.63   0.22       1.70   0.25       1.79   0.30
              MTIDK      1.69   0.18       1.74   0.23       1.82   0.28
              MTK_W      1.70   0.24       1.76   0.28       1.88   0.34
              MTIDK_W    1.74   0.20       1.79   0.24       1.89   0.30

From these numbers, several conclusions can be drawn:

  • The difference between the mean scores in PDB and in coiled coils is slightly larger with the MTIDK matrix than with the MTK matrix. More importantly, the standard deviation of the score distribution is lower with the MTIDK matrix for both databases. This means that the MTIDK matrix yields a more consistent evaluation of globular and coiled-coil sequences and provides for a better resolution between the two score distributions. Not shown here is that the MTIDK matrix also improves the score of intermediate filament sequences relative to the scores of other coiled-coil sequences, thus providing for a more balanced scoring of the different families of coiled-coil proteins than the MTK matrix.
  • For both matrices, weighting slightly decreases the resolution between the globular and coiled-coil score distributions.
  • For all scoring methods, the resolution between the globular and coiled-coil score distributions decreases strongly with decreasing size of the scanning window.
  • The difference in performance between the MTK matrix and the MTIDK matrix is small although the MTIDK matrix is derived from over twice the number of residues and many more protein families. I conclude that little further progress can be expected from even larger coiled-coil databases.

B. Highscoring sequences in globular proteins

I scored release 13.0 (8/93) of the NRL_3D database (containing thesequences of proteins of known structure from PDB) with all four scoring methods and counted the number of segments obtaining probabilities >10%. The database contained 539 nonredundant protein sequences and excluded the coiled-coil proteins tropomyosin, hemagglutinin, GCN4, Gal4 and apolipoprotein E. Apolipoprotein E was included with the coiled-coil subset because its helices are very long compared to those of other helical bundles and because it forms a partly three-stranded structure. All other helical bundles were included with the globular proteins because their helices are short and frequently packed at irregular angles. These features generally prevent their detection by this algorithm although several helices from four-helix bundles appear as high-scoring segments in the following table. Results are compared to the number of segments obtained in a database of sequences generated by means of a random number generator (see Science 252:1162).

(1 - MTK; 2 - MTIDK; 3 - MTK_W; 4 - MTIDK_W)

RANDOM SEQUENCES

             28 res.        21 res.        14 res.      28      21      14
           1  2  3  4     1  2  3  4     1  2  3  4    1  2    1  2    1  2
  10-19%   8  5 11 13    37 22 24 35    96 85 99 85    1  2   12 10   51 60
  20-29%   4  1  5  3    18 14 23 14    47 33 51 45    2  1   10  5   21 26
  30-39%   2  0  2  4    14  8  9  9    29 35 42 21    2  0    7  4   14 14
  40-49%   4  0  2  5     6  2 15 10    21 14 17 19    1  0    2  1    8  9
  50-59%   2  2  1  1     1  4  4  7    11  9 11 14    0  0    1  0   10  9
  60-69%   1  0  3  6     3  4  7  5     9 11 12 14    0  0    0  0    5  6
  70-79%   3  2  2  1     4  1  6  1    12  7 12 13    0  0    2  1    6  4
  80-89%   1  2  3  1     3  4  3  4    10 14  8 18    0  0    1  2    2  5
  >= 90%   1  3  1  1     4  9  6  7    11 20  8 15    2  2    2  2    5  7

In this table, the number of segments per 10% increment levels off above 50% rather than decreasing continuously. This is due to the sigmoid shape of the curve that relates scores to probabilities which masks a continuing decrease in number of segments per score interval. Above 50%, the number of segments per 10% increment doubles from around 2 in the 28 res. scan to around 4 in the 21 res. scan and then triples to around 12 in the 14 res. scan. A similar progression at a lower level is observed for the random sequence database. This progression is due to the significantly poorer resolution of smaller scanning windows. The difference in numbers between PDB and random sequences is attributable to amphipathic helices that are frequently present in native proteins but are not a preferred element of random sequences. Outside the tail end of the score distribution seen in this table, the score distributions of PDB and random sequences are superimposable (see Science 252:1162). This means that the real resolution between the globular and coiled-coil score distributions is slightly lower than the nominal resolution.
The weighted matrices are less reliable than the unweighted matrices.
The MTK matrix yields fewer highscoring segments at probabilities >90% than the MTIDK matrix and thus appears more reliable even though its nominal resolution is poorer. This is probably an incorrect conclusion. As is detailed in the next paragraph, there are now several examples of sequences that do not assume a coiled-coil (or even alpha-helical!) structure under normal circumstances but that have the potential to do so if their context is changed. It therefore appears likely that the sequences which are assigned elevated coiled-coil probabilities by the COILS program actually do have the potential to form coiled coils even though they do not do so in the protein context or under the conditions in which the structure was determined. The larger number of high-scoring segments with the MTIDK matrix would then be the result of an increased sensitivity of this matrix.
Virtually all segments with scores above 50% in 21 and 28 scans are centered on a surface helix although several contain two discotinuous helices rather than one continuous helix. Several of the helices are from four-helix bundles and thus have coiled-coil characteristics. Following recent developments, it is increasingly likely that most (if not all) of these high-scoring sequences have an elevated coiled-coil-forming potential and could form coiled coils in a different context. This follows from three recent results:

  1. A loop segment of influenza hemagglutinin, pH7, which was predicted by COILS to have elevated coiled-coil potential, in fact forms a coiled coil in the pH4 structure (Bullough et al., Nature 371:37, 1994).
  2. The basic region of bZip transcription factors, which is not even alpha-helical in the absence of DNA, can be converted into a coiled coil by a designed peptide (Krylov et al., EMBO J. 14:5329, 1995).
  3. A peptide from topoisomerase II, which was identified using COILS, forms a coiled coil in solution but not in the structure of the full protein (Frere et al., J. Biol.Chem. 270:17502, 1995).

Nevertheless, the decreased coiled-coil-forming potential of these sequences relative to "constitutive" coiled coils can be seen from the fact that they score highly in one method but generally much lower in at least one of the other methods; example: 5LDH - lactate dehydrogenase:

seq     CAISILGKSLTDELALVDVLEDKLKGEMMDLQHGSLFLQTP
MTK     00112444444444444444444444444444411000000
MTK_W   35678999999999999999999999999999911000000
MTIDK   00000000000000000000000000000000000000000
MTIDK_W 00012333333333333333333333333333300000000

and several segments drop considerably in score from a 28 residue scan to a 21 residue scan; example: 2TS1 - tyrosyl-tRNA synthetase:

seq PEKRAAQKTLAEEVTKLVHGEEALRQAIRIS
14  0001111111111111100000000000000
21  0222222222222222222222222220000
28  0777777777777777777777777777721

The latter effect is observed particularly if a segment contains two discontinuous helices. These effects can be taken as indicators for a decreased likelihood of coiled-coil formation since neither effect is normally observed in coiled coils, as can be seen in part C of this section.

C. Performance on coiled coils

In the following, secondary structure (c = coiled-coil helix) and coiled-coil-forming probabilities are shown beneath the sequences as scored by MTK, MTIDK, MTK_W and MTIDK_W in that order. The values were obtained with a 21 residue scanning window which appears to spot the ends of coiled-coil segments somewhat more accurately than a 28 residue window. (For spotting the ends of coiled coil helices, see also the documentation for the auxiliary program CAPS). The coiled coils in Gal4, GreA and human mannose-binding protein were analyzed with a 14 residue window because of their short length. Tropomyosin is not shown; it obtains probabilities >99% over its entire length except for the C-terminal 20 residues.

(C1) parallel, two-stranded structure.

>GCN4 bZip (Cell 71:1223)
MKDPAALKRARNTEAARRSRARKLQRMKQLEDKVEELLSKNYHLENEVARLKKLVGER
hhhhhhhhhhhhhhhhhhhhhhhhhhcccccccccccccccccccccccccccccccc
0000000000222779999999999999999999999999999999999999988330
0000011111777999999999999999999999999999999999999999988110
0000000000000224555566699999999999999999999999999999999770
0000000000000889999999999999999999999999999999999999999770

Similar probabilities (>99%) are obtained for the bZip regions of Fos and Jun (see Meth. Enzymology 266:513). As seen here, the ends of coiled-coil segments may be overpredicted significantly in the absence of strong flanking helix-breaking residues. This is a particular problem in bZip proteins, where the coiled coil follows continuously out of the basic-region helix. Note, though, that the basic region also has some coiled-coil-forming potential, as demonstrated by Krylov et al. (EMBO J. 14:5329, 1995).

>Max b-HLH-Zip (Nature 363:38)
ADKRAHHNALERKRRDHIKDSFHSLRDSVPSLQGEKASRAQILDKATEYIQYMRRKNDTH
hhhhhhhhhhhhhhhhhhhhhhhhhhhhhh         hhhhhhhhhhhhhhccccccc
000000000000000000000000000000000000000000111112288889999999
000000000000000000000000000000000000000000000001199999999999
000000000000000000000000000000000000000000000011155556888999
000000000000000000000000000000000000000000111113388889999999
QQDIDDLKRQNALLEQQVRALEKARSSAQLQT
ccccccccccccccccccccc
99999999999999999999999999999884
99999999999999999999999999999996
99999999999999999999999999988771
99999999999999999999999999999992
>Gal4 (Nature 356:408)
MKLLSSIEQACDICRLKKLKCSKEKPKCAKCLKNNWECRYSPKTKRSPLTRAHLTEVESRLERLEF
          hhhhhhhh         hhhhhhhh              ccccccccccccccc
000000000000000000000000000000000000000000000000014888888888888882
000000000000000000000000000000000000000000000000017999999999999992
000000000000000000000000000000000000000000000000006888888888888884
000000000000000000000000000000000000000000000000008999999999999995

COILS works well for parallel two-stranded structures (independently of the scoring method used) if they are solvent-exposed. The parallel two-stranded coiled coil buried in CAP is entirely invisible to this program because of the absence of a heptad repeat.

(C2) antiparallel, two-stranded structures

>Seryl-tRNA synthetase - Escherichia coli (Nature 347:249)
MLDPNLLRNEPDAVAEKLARRGFKLDVDKLGALEERRKVLQVKTENLQAERNSRSKSIGQ
                          cccccccccccccccccccccccccccccccchh
000000000000000000000000003888888888888888888888888882100000
000000000000000000000000003999999999999999999999999993000000
000000000000000000000000003777777777777777777777773330000000
000000000000000000000000004889999999999999999999998880000000
AKARGEDIEPLRLEVNKLGEELDAAKAELDALQAEIRDIALTIPNLPADEVPVG......
hhhh    cccccccccccccccccccccccccccccccccc
000000000099999999999999999999999999999999900000000000
000007788899999999999999999999999999999999988800000000
000000000099999999999999999999999999999999955500000000
000089999999999999999999999999999999999999999933100000
>Seryl-tRNA synthetase - Thermus thermophilus (JMB 234:222)
MVDRKRLRQEPEVFHRAIREKGVALDLEALLALDREVQELKKRLQEVQTERNQVAKRVPK
                          ccccccccccccccccccccccccccccccc
000000000000000000011124599999999999999999999999999999999910
000000000000000000000013499999999999999999999999999999999986
000000000000000000022236699999999999999999999999999998887700
000000000000000000000014599999999999999999999999999999999954
APPEEKEALIARGKALGEEAKRLEEALREKEARLEALLLQVPLPPWPGAPVG........
   ccccccccccccccccccccccccccccccccccccc
0008888888888999999999999999999999999999920000000000
4009999999999999999999999999999999999999997000000000
0002224444444999999999999999999999999999932000000000
1005556677777999999999999999999999999999999000000000
>GreA transcript cleavage factor (Nature 373:636)
MQAIPMTLRGAEKLREELDFLKSVRRPEIIAAIAEAREHGDLKENAEYHAAREQQGFCEGRIKDIEAKLSNAQVID
     sscccccccccccccccc-ccccccccccccc        cccccccccccccccccccccccccc  ss
0000011366666666666666664200000000000000000000000000000002999999999999998730
0000011388888888888888888500000001111111111111100000000004999999999999997710
0000022688888888888888885300000000000000000000000000000000777777777777776630
0000033899999999999999999800000000000000000000000000000000777777777777776620

GreA resembles in its structural organization seryl-tRNA synthase. It is currently the only known coiled-coil structure with a true skip residue (Val34). The high scores in the two coiled coil helices correspond to the segment of coiled coil that is located between the skip and the globular part of the protein.

>Replication terminator protein (Cell 80:651)
MKEEKRSSTGFLVKQRAFLKLYMITMTEQERLYGLKLLEVLRSEFKEIGFKPNHTEVYRSL
             hhhhhhhhhhhhhhhh ssss hhhhhhhhhhh       hhhhhhhh
0000000000000000000000000000000000000000000000000000000000000
0000000000000000000000000000000000000000000000000000000000000
0000000000000000000000000000000000000000000000000000000000000
0000000000000000000000000000000000000000000000000000000000000
HELLDDGILKQIKVKKEGAKLQEVVLYQFKDYEAAKLYKKQLKVELDRCKKLIEKALSDNF
hhhhh   sssssss       sssssss hhhhhhhhhhccccccccccccccccccccc
0000000000000000000000000001111133666666666666666666666655540
0000000000000000000000000000033333444488888888888888888888880
0000000000000000000000000002222233555555555555555555555533320
0000000000000000000000000001133344555588888888888888888888882

COILS is also generally reliable in the analysis of antiparallel two-stranded coiled coils, but does not detect the DNA-binding coiled coil in serum response factor (Nature 376:490), which, because of its special function, has a very distinct residue distribution.

(C3) parallel, three-stranded structures

>hemagglutinin (Nature 333:426 and 371:37)
GLFGAIAGFIENGWEGMIDGWYGFRHQNSEGTGQAADLKSTQAAIDQINGKLNRVIEKTN
                                     hhhhhhhhhhhhhhhhhh        pH7
                                       ccccccccccccccccccccc   pH4
000000000000000000000000000000001223466666666666666666666658
000000000000000000000000000000000222455555555555667888888889
000000000000000000000000000000000122344444444444444444444402
000000000000000000000000000000000111222222222222222222222211
EKFHQIEKEFSEVEGRIQDLEKYVEDTKIDLWSYNAELLVALENQHTIDLTDSEMNKLFE
               ccccccccccccccccccccccccccccccccccccccccccccc   pH7
ccccccccccccccccccccccccccccccccccccccccccccc       hhhhhhhh   pH4
999999999999999999999999999988800000000000000000000144444444
999999999999999999999999999766611111110000000000000288888888
333377777788888888888888888888800000000000000000000000000000
333355555555555555555555555555533333331000000000000033333333
KTRRQLRENAEEMGNGCFKIYHKCDNACIESIRNGTYDHDVYRDEALNNRFQIKG
cccccc                                                         pH7
hhhhhhhhh                                                      pH4
4444444444444220000000000000000000000000000000000000000
8888888888888440000000000000000000000000000000000000000
0000000000000000000000000000000000000000000000000000000
3333333333333110000000000000000000000000000000000000000

Influenza haemagglutinin is a complex structure which undergoes a large structural transition between pH7 and pH4. There is multiple evidence that the structure at pH7 is only meta-stable.

>Mannose-binding protein A, rat (Structure 2:1227)
AIEVKLANMEAEINTLKSKLELTNKLHAFSMGKKSGKKFFVTNHERMPFSKVKALCSELRGTVAIPRNAEENKAI
cccccccccccccccccccccccccccccc        sssssssss hhhhhhhhhh   ss     hhhhhhh
999999999999999999999999997731000000000000000000000000000000000000000000000
999999999999999999999999998830000000000000000000000000000000000000000000000
999999999999999999999999993320000000000000000000000000000000000000000000000
999999999999999999999999995520000000000000000000000000000000000000000000000
QEVAKTSAFLGITDEVTEGQFMYVTGGRLTYSNWKKDEPNDHGSGEDCVTIVDNGLWNDISCQASHTAVCEFPA
hhhh   ssssss        ss                       sssss     ssss     sssssss
00000000000000000000000000000000000000000000000000000000000000000000000000
00000000000000000000000000000000000000000000000000000000000000000000000000
00000000000000000000000000000000000000000000000000000000000000000000000000
00000000000000000000000000000000000000000000000000000000000000000000000000
>Mannose-binding protein C, human (Nature Struct. Biol. 1:789)
        AASERKALQTEMARIKKWLTFSLGKQVGNKFFLTNGEIMTFEKVKALCVKFQASVATPRNAAENGAI
          cccccccccccccccccccc  sss  ssssssssssshhhhhhhhhh   ss     hhhhhhh
        2246666666666666600000000000000000000000000000000000000000000000000
        5579999999999999900000000000000000000000000000000000000000000000000
        2222222222222222200000000000000000000000000000000000000000000000000
        5555555555555555500000000000000000000000000000000000000000000000000
QNLIKEEAFLGITDEKTEGQFVDLTGNRLTYTNWNEGEPNNAGSDEDCVLLLKNGQWNDVPCSTSHLAVCEFPI
hhh    ssssss        ss                        ssss     ssss    sssssssss
00000000000000000000000000000000000000000000000000000000000000000000000000
00000000000000000000000000000000000000000000000000000000000000000000000000
00000000000000000000000000000000000000000000000000000000000000000000000000
00000000000000000000000000000000000000000000000000000000000000000000000000

(C4) antiparallel, three-stranded structures

>coil-Ser (Science 259:1288)
EWEALEKKLAALESKLQALEKKLEALEHG
ccccccccccccccccccccccccccccc
99999999999999999999999999999
99999999999999999999999999999
99999999999999999999999999999
99999999999999999999999999999

This is an unusual homotrimeric structure that was produced incidentally to the design of a two-stranded coiled coil.

>spectrin (Science 262:2027)
NLDLQLYMRDCELAESWMSAREAFLNADDDANAGGNVEALIKKHEDFDKAINGHEQKIAA
cccccccccccccccccccccccccccc        cccccccccccccccccccccccc
000000000000000000000000000000111114466667777777777777777777
000000000000000000000000000000000003355558888888888888888888
000000000000000000000000000000000000011111111117777777777777
000000000000000000000000000000000000011113333339999999999999
LQTVADQLIAQNHYASNLVDEKRKQVLERWRHLKEGLIEKRSRLGD
cccccccccc     ccccccccccccccccccccccccccccccc
7777777777742220000000000000000000000000000000
8888888888863110000000022222222222222222222200
7777777777755552211110000000000000000000000000
9999999999977773322220044444444444444444444400

As an antiparallel three-helix bundle, spectrin is already fairly far removed from the reference set of parallel two-stranded structures that is used for scoring. Accordingly, as with four-helix bundles, the program has problems identifying all the helices in the structure. While this does not make the prediction of helix B as a coiled coil incorrect, it makes it rather useless and indeed misleading for model-building. In the long run, scoring matrices that are specific for helical bundles should be the answer, but my experiments with a matrix derived from four-helix bundles (Paliakasis & Kokkinidis, Prot.Eng. 5:739) show that the ones currently available have only little predictive power. Even in the absence of such matrices, the prediction can be improved significantly using the auxiliary programs ALIGNED20/80 if homologous sequences are available for a protein. Their application to spectrin is shown in the documentation file ALIGNED.DOC.

One of the specific problems of the program with helix A of spectrin are the Trp and Phe residues in position g of the heptad repeat. These residues are very rare at that position both in two-stranded and three-stranded coiled coils. Such residues can occur or even be important in certain structures even though they are disfavored in most others. It is therefore recommended that a protein with a single peak be also analyzed with all rare residues (W, C, P) replaced by Ala. Emergence of more peaks indicates the presence of a helical bundle. Also, if proteins that one suspects may form a helical bundle have a peak that occurs only in a 14 residue scan, one should look whether replacement of a single unfavorable residue (e.g. D in a) by Ala does not greatly lengthen the predicted length of the helix or raise significantly its score. Such "wrong" residues may actually help to build a model since their presence needs to be accounted for and limits the possibilities.

(C5) other antiparallel helical bundles

>ApoE (Science 252:1817)
GQRWELALGRFWDYLRWVQTLSEQVQEELLSSQVTQELRALMDETMKELKAYKSELEEQL
 ccccccccccccccccccc hhhhhhhhhhcccccccccccccccccccccccccccc
000000000000000000000000000000013379999999999999999999999999
000000000000000000000000000000026699999999999999999999999999
000000000000000000000000000000001129999999999999999999999999
000000000000000000000000000001689999999999999999999999999999
TPVAEETRARLSKELQAAQARLGADMEDVCGRLVQYRGEVQAMLGQSTEELRVRLASHLR
    cccccccccccccccccccccccccccccccccccc       ccccccccccccc
818999999999999999999999533331111000000000000111111114478999
889999999999999999999999733330000000000000000444444445589999
959999999999999999999999444441111111111111000333333336689999
999999999999999999999999433331111111111110011888888888899999
KLRKRLLRDADDLQKRLAVYQAGA
cccccccccccccccccccccc
999999999999999999988877
999999999999999999999855
999999999999999999999999
999999999999999999999999

The prediction for ApoE is good for the three-stranded part but much poorer for the four-stranded part: the short N-terminal helix 1 is not seen by the program, partly because of its length but mostly because of the three Trp residues, and the C-terminus of helix 3 and the N-terminus of helix 4 which interact with helix 1 also obtain low scores. This brings me to:

D. Limits of the method

As can be seen from the examples given, the program works well for parallel two-stranded structures that are solvent-exposed but runs progressively into problems with the addition of more helices, their antiparallel orientation and their decreasing length. The program fails entirely on buried structures. Limits are also set by the statistical noise which greatly decreases the usefulness of small scanning windows. Finally, the possibility that sequences with good coiled-coil potential do not form a coiled coil because of constraints from other parts of the sequence may add a further limit to the accuracy of the program.
Because many reasons can lead the program to miss a helix while the conditions for detection are quite stringent, the absence of a peak is not nearly as conclusive as the presence of a peak. Effects of this on interpreting scores from multiple alignments is discussed in ALIGNED.DOC. What I believe one can conclude safely from the absence of a peak is that no solvent-exposed two- or three-stranded coiled-coil of length greater than approximately 20 residues is present in the protein.


7. RECOMMENDATIONS FOR USING THE PROGRAM

COILS is specific for solvent-exposed, left-handed coiled coils. Other types of coiled-coil structure, such as buried coiled coils (e.g the central coiled coil in catabolite repressor protein, or some transmembrane domains) and right-handed coiled coils, are not detected by the program.

COILS does not reach yes-or-no decisions based on a threshold value. Rather, it yields a set of probabilities that presumably reflect the coiled-coil forming potential of a sequence. This means that even at high probabilities (e.g. >90%), there will be (and should be) sequences that in fact do not form a coiled coil, though they may have the potential to do so in a different context.

COILS is biased towards hydrophilic, highly charged sequences. For this reason, all scans should be performed with a weighted and an unweighted matrix, and the results compared. Differences of more than 20-30 percentage points in the probabilities should be taken to indicate that a coiled-coil structure is unlikely, the elevated scores being mainly due to the high incidence of charged residues (note, though, that this would have marked human mannose-binding protein as a false positive).

The MTK and MTIDK matrices both assign high probabilities to known coiled coils segments, but identify different helices at high probability in a database of globular proteins. This is a surprising feature whose reason is as yet unclear, but which can be exploited for predictive purposes. It is therefore useful to compare the results of scans made with the two matrices. Again, differences of more than 20-30 percentage points in the probabilities should be taken to indicate that a coiled-coil structure is unlikely (note, though, that this threshold would make the replication terminator protein a border-line case).

The resolution between globular and coiled-coil score distributions decreases strongly with a decreasing size of the scanning window. The prediction of new coiled-coil segments should therefore be made using a 28 residue window, or in special cases a 21 residue window. 14 residue windows should normally be reserved for the analysis of local parameters (such as the frame) in known or predicted coiled coils.

The ends of coiled-coil segments appear to be most accurately identified in a 21 residue window. In general, I assume that residues with probabilities >50% are part of a coiled-coil segment.

Sequences with high coiled-coil probabilitiy from globular proteins rarely exceed a length of 30 residues. None is longer than 35 residues. Sequences with probabilities >80-90% that extend for more than 35 residues are therefore more likely to assume a coiled-coil structure than is indicated by the obtained probabilities.

Where possible, sequences related to the protein of interest should also be analyzed for predicted coiled-coil segments. It should be kept in mind, though, that the sequences must be related in the region of high scores in order for the comparison to be significant.

Comparison of the coiled-coil prediction with predictions of the secondary structure are generally useful, particularly if multiple related sequences are available.