Question 1: PS50235 correspond to a matrix (see 'Entry type: MATRIX'), and therefore is a Profile. Patterns are not matrices, but regular expressions.
Question 2: no, the profile has no false positive or false negatives (see the Precision and Recall lines).
Question 3: just read the documentation (link "Prosite documentation").
Question 4: yes, there are 2 patterns associated with the profile.
Question 5: the protein contains the PS50235 profile and the 2 associated patterns. There is also a profile GLU_RICH, but with a low score, and a pattern ATPASE_ALPHA_BETA, which is an improbable true positive.
Given the following MSA:
Seq1 WFFKGIADKDAERHLLA Seq2 WFFKNLEQKDAEARLLA Seq3 WFFKR---KDAERQLLA Seq4 WFFGTI---DAERQLLA Seq5 WFFKDIPTKDAERQLLA Seq6 WYFG----RESERLLLA Seq7 WYFGKIPLKDAERQLLA Seq8 WYFGKLRAKDTERLLLL
A possible pattern could be the following one, but this is not the only solution!!
W-[FY]-F-[KG]-x(0,4)-[KR](0,1)-[DE]-x-E-[RA]-x-L(2)-[AL]
Running a search on SWISS-PROT we found 14 matches, all annotated as tyrosine-protein kinase.
This pattern doesn't find false positives on a reverse SWISS-PROT.
Here a possible pattern for the second set of sequences:
[ED]-R-x(2)-R
This pattern returns false positives (in a random database). These kind of patterns, although useful, require other evidences to be validated.
The number of hits decrease if patterns with a high probability are excluded.
The masked patterns have the characteristic to be short and/or degenerated. This results in a large number of hits. These patterns match have to be considered as a very preliminary information and other information must be used to confirm the observations (biological information. bench experiments, sequence environment, ...).
A search of a random database with a pattern with a high probability of occurrence returns a series of matches. This indicates that the pattern match information alone is not reliable and other information is required to validate the result in a real sequence.
MEME finds 3 possible motifs (see section DATABASE AND MOTIFS of the third mail):
MOTIFS (peptide) MOTIF WIDTH BEST POSSIBLE MATCH ----- ----- ------------------- 1 20 LWNHPWFHGKIPREEAEAIL 2 9 DGTFLVRES 3 50 AKAKYDFCARDDDELSFKRGDIIKILNKKCDQGWWKGEINGKGGWFPKNY
Motif 1 and 2 are present in all 6 sequences together, while motif 3 is restricted to only 2 sequences (once is repeated).
The motifs described by MEME correspond to 2 protein domains: motif1 + motif2 = SH2, motif3 = SH3.
Pfam and Prosite return similar result. Unsure matches with a low score are marked with status:? by the pfscan server. InterPro return a much larger result, because a full search is done against a number of databases (Pfam, Prosite, Smart, ProDom, PRINTS, TIGRfam).
By reading the documentation of each domain is possible to infer that the protein is implicated in the post-transcriptional gene silencing (RNAi). Probably in the degradation of the double-stranded RNA.
Question 1: the N-term region contains an homology with sequences found with BLAST. Moreover, the central and C-term regions of the protein have a strong coiled-coils signal.
Question 2: the PSI-BLAST converge in both cases, if the good sequences are selected for the various cycles.
Question 3: the Profile is much more sensitive than PSI-BLAST.
Question 4: not much is found in Prosite and Pfam, ... yet.
There are two distinct regions at the N-term and C-term of the protein, which could be protein domains.
After a few rounds we found a good homology with the abc transporter family involved in a multicomponent binding-protein-dependent transport system for glycine betaine/l-proline (as example see PROV_SALTY).
Looking at the alignment between PROV_SALTY and myseq (in the alignment region of the PSI-BLAST), it looks like the domains are swapped.
Without the low complexity filter we obtain a complete different result with NCBI PSI-BLAST. The homology we find is with collagen. This because there is a bias in the aa composition of collagen proteins which match our low complexity region at the center of the protein.
Any question? Mail to Lorenzo Cerutti.