match.py: 15 points
One of the many important collaborations between computer science and biology is the study of protein sequences (in an area of research called bioinformatics). As you may know, proteins are chains of molecules called amino acids that make up important building blocks for organic life. Human cells can make 20 different amino acids, each of which is typically represented by a single letter code. We can therefore define any protein as a specific sequence of amino acids. This sequence determines the properties of the protein, including its 3D structure.
Identifying Proteins
When a new protein is found, we can guess its functionality by comparing it to proteins we have already studied. Specifically, we can check if the protein contains certain markers common to a known class of proteins. For example (and an entirely bogus example at that), suppose we discover a new protein, that we’ve named Duane, with the following amino acid sequence:
STTECQLKDNRAWTSLFIHTGHTECA
We may also suspect that Duane might belong to one of two possible classes of proteins: Spiffs and Blorts. Most Spiffs contain the pattern TECQRKMN
or at least something close to it. That is, most of the amino acid sequences in the class of Spiff proteins have the subsequence TECQRKMN
with only a few of the letters changed. Blorts, meanwhile, contain the pattern ALFHHTTGT
, or something very similar.
Now let’s try to compare the marker for Spiffs and the marker for Blorts against the amino acid sequence for Duane. We can see that Duane contains the pattern TECQLKDN
which only has 2 mismatches from the Spiffs marker TECQRKMN
(the errors are marked with a ^ below).
STTECQLKDNRAWTSLFIHTGHTECA
TECQRKMN
^ ^
On the other hand, the closest pattern to the Blort marker has 4 mismatches:
STTECQLKDNRAWTSLFIHTGHTECA
ALFHHTTGT
^ ^ ^^
In this case, we can deduce that Duane is most likely a Spiff! Pretty spiffy, eh?!
In this part of the lab, we will be writing a program called match.py
that takes a given protein’s amino acid sequence and determines how closely the protein matches the markers for different classes of proteins.
In particular, the match.py
program has two global variables already defined:
PROTEIN = “STTECQLKDNRAWTSLFIHTGHTECA”
which is the protein sequence from the example above, and
MARKERS = [“TECQRKMN”, “ALFHHTTGT”, “TTECQ”, “HT”, “ZZZ”, “TTZZZRAWT”]
which is a list of strings representing the markers of six classes of proteins (including the two from the example above).
Pay close attention to the global variable names above. Notice that they are both capitalized. It will be our convention to use capitalized variable names for global variables. This will help us keep track of them in our code and avoid issues with scope.
To determine whether part of PROTEIN
’s amino acid sequence closely matches each of the strings in MARKERS
, we need to think about all of the ways we can align a marker with the protein sequence. For instance, using the marker TECQRKMN
as an example, we could align it with the start of the PROTEIN
(so the starting position is 0):
STTECQLKDNRAWTSLFIHTGHTECA
TECQRKMN
^^^^^^^^
which has errors in all 8 characters of the alignment. We could also shift the marker by one character (i.e., starting position = 1) to get a different alignment:
STTECQLKDNRAWTSLFIHTGHTECA
TECQRKMN
^^^^^^^
which has errors in 7 characters of the alignment (only the first character was correct). We could again shift the marker by two characters (i.e., starting position = 2) to get another alignment:
STTECQLKDNRAWTSLFIHTGHTECA
TECQRKMN
^ ^
which has errors in only two characters. And the final alignment is when the marker is shifted all the way to the right (i.e., starting position = 18):
STTECQLKDNRAWTSLFIHTGHTECA
TECQRKMN
^^^^^^^^
which also has errors in all 8 characters of the alignment.
Your program should compare each marker in the MARKERS
list to all of the possible alignments in PROTEIN
where the marker could start. For each marker, the program should print both (1) the starting position in PROTEIN
of the alignment that provided the closest match, and (2) how many errors were found in that particular alignment. If there are is a tie in the number of matches at two positions, it is acceptable to either return one of the positions, or provide us with all of the positions with the closest matches as a combination. The output of your program should look something like this:
Marker TECQRKMN has 2 errors at starting position 2.
Marker ALFHHTTGT has 4 errors at starting position 14.
Marker TTECQ has 0 errors at starting position 1.
Marker HT has 0 errors at starting position 18.
Marker ZZZ has 3 errors at starting position 0.
Marker TTZZZRAWT has 5 errors at starting position 5.
ReadMe
If you are stuck, take a look at your worksheet for guidance!
Reminder
As you make progress on your program, don’t forget to commit and push your changes regularly!