Introduction to Patterns, Profiles, and HMMs

1. Prosite database

Question 1: PS50235 correspond to a matrix (see 'Entry type: MATRIX'), and therefore is a Profile. Patterns are not matrices, but regular expressions.

Question 2: no, the profile has no false positive or false negatives (see the Precision and Recall lines).

Question 3: just read the documentation (link "Prosite documentation").

Question 4: yes, there are 2 patterns associated with the profile.

Question 5: the protein contains the PS50235 profile and the 2 associated patterns. There is also a profile GLU_RICH, but with a low score, and a pattern ATPASE_ALPHA_BETA, which is an improbable true positive.

2. Build a pattern

Given the following MSA:


Seq1  WFFKGIADKDAERHLLA
Seq2  WFFKNLEQKDAEARLLA
Seq3  WFFKR---KDAERQLLA
Seq4  WFFGTI---DAERQLLA
Seq5  WFFKDIPTKDAERQLLA
Seq6  WYFG----RESERLLLA
Seq7  WYFGKIPLKDAERQLLA
Seq8  WYFGKLRAKDTERLLLL



A possible pattern could be the following one, but this is not the only solution!!

W-[FY]-F-[KG]-x(0,4)-[KR](0,1)-[DE]-x-E-[RA]-x-L(2)-[AL]

Running a search on SWISS-PROT we found 14 matches, all annotated as tyrosine-protein kinase.

This pattern doesn't find false positives on a reverse SWISS-PROT.

Here a possible pattern for the second set of sequences:

[ED]-R-x(2)-R

This pattern returns false positives (in a random database). These kind of patterns, although useful, require other evidences to be validated.

3. Search the Prosite pattern database

The number of hits decrease if patterns with a high probability are excluded.

The masked patterns have the characteristic to be short and/or degenerated. This results in a large number of hits. These patterns match have to be considered as a very preliminary information and other information must be used to confirm the observations (biological information. bench experiments, sequence environment, ...).

A search of a random database with a pattern with a high probability of occurrence returns a series of matches. This indicates that the pattern match information alone is not reliable and other information is required to validate the result in a real sequence.

4. Build PSSMs with MEME

MEME finds 3 possible motifs (see section DATABASE AND MOTIFS of the third mail):

MOTIFS  (peptide)
MOTIF WIDTH BEST POSSIBLE MATCH
----- ----- -------------------
  1    20   LWNHPWFHGKIPREEAEAIL
  2     9   DGTFLVRES
  3    50   AKAKYDFCARDDDELSFKRGDIIKILNKKCDQGWWKGEINGKGGWFPKNY

Motif 1 and 2 are present in all 6 sequences together, while motif 3 is restricted to only 2 sequences (once is repeated).

The motifs described by MEME correspond to 2 protein domains: motif1 + motif2 = SH2, motif3 = SH3.

4. Protein function discovery

Pfam and Prosite return similar result. Unsure matches with a low score are marked with status:? by the pfscan server. InterPro return a much larger result, because a full search is done against a number of databases (Pfam, Prosite, Smart, ProDom, PRINTS, TIGRfam).

By reading the documentation of each domain is possible to infer that the protein is implicated in the post-transcriptional gene silencing (RNAi). Probably in the degradation of the double-stranded RNA.

5. Protein domain hunting

Question 1: the N-term region contains an homology with sequences found with BLAST. Moreover, the central and C-term regions of the protein have a strong coiled-coils signal.

Question 2: the PSI-BLAST converge in both cases, if the good sequences are selected for the various cycles.

Question 3: the Profile is much more sensitive than PSI-BLAST.

Question 4: not much is found in Prosite and Pfam, ... yet.

6. Protein function discovery

There are two distinct regions at the N-term and C-term of the protein, which could be protein domains.

After a few rounds we found a good homology with the abc transporter family involved in a multicomponent binding-protein-dependent transport system for glycine betaine/l-proline (as example see PROV_SALTY).

Looking at the alignment between PROV_SALTY and myseq (in the alignment region of the PSI-BLAST), it looks like the domains are swapped.

Without the low complexity filter we obtain a complete different result with NCBI PSI-BLAST. The homology we find is with collagen. This because there is a bias in the aa composition of collagen proteins which match our low complexity region at the center of the protein.

Any question? Mail to Lorenzo Cerutti.