Protein Matching

match.py: 15 points


Background

One of the many important collaborations between computer science and biology is the study of protein sequences (in an area of research called bioinformatics). As you may know, proteins are chains of molecules called amino acids that make up important building blocks for organic life. Human cells can make 20 different amino acids, each of which is typically represented by a single letter code. We can therefore define any protein as a specific sequence of amino acids. This sequence determines the properties of the protein, including its 3D structure.

3D model of the insulin protein.

Identifying Proteins

When a new protein is found, we can guess its functionality by comparing it to proteins we have already studied. Specifically, we can check if the protein contains certain markers common to a known class of proteins. For example (and an entirely bogus example at that), suppose we discover a new protein, that we’ve named Duane, with the following amino acid sequence:

STTECQLKDNRAWTSLFIHTGHTECA

We may also suspect that Duane might belong to one of two possible classes of proteins: Spiffs and Blorts. Most Spiffs contain the pattern TECQRKMN or at least something close to it. That is, most of the amino acid sequences in the class of Spiff proteins have the subsequence TECQRKMN with only a few of the letters changed. Blorts, meanwhile, contain the pattern ALFHHTTGT, or something very similar.

Now let’s try to compare the marker for Spiffs and the marker for Blorts against the amino acid sequence for Duane. We can see that Duane contains the pattern TECQLKDN which only has 2 mismatches from the Spiffs marker TECQRKMN (the errors are marked with a ^ below).

STTECQLKDNRAWTSLFIHTGHTECA
  TECQRKMN
      ^ ^ 

On the other hand, the closest pattern to the Blort marker has 4 mismatches:

STTECQLKDNRAWTSLFIHTGHTECA
              ALFHHTTGT
              ^  ^  ^^

In this case, we can deduce that Duane is most likely a Spiff! Pretty spiffy, eh?!

Your Program

In this part of the lab, we will be writing a program called match.py that takes a given protein’s amino acid sequence and determines how closely the protein matches the markers for different classes of proteins.

In particular, the match.py program has two global variables already defined:

PROTEIN = “STTECQLKDNRAWTSLFIHTGHTECA”

which is the protein sequence from the example above, and

MARKERS = [“TECQRKMN”, “ALFHHTTGT”, “TTECQ”, “HT”, “ZZZ”, “TTZZZRAWT”]

which is a list of strings representing the markers of six classes of proteins (including the two from the example above).

Pay close attention to the global variable names above. Notice that they are both capitalized. It will be our convention to use capitalized variable names for global variables. This will help us keep track of them in our code and avoid issues with scope.

Alignments

To determine whether part of PROTEIN’s amino acid sequence closely matches each of the strings in MARKERS, we need to think about all of the ways we can align a marker with the protein sequence. For instance, using the marker TECQRKMN as an example, we could align it with the start of the PROTEIN (so the starting position is 0):

STTECQLKDNRAWTSLFIHTGHTECA
TECQRKMN
^^^^^^^^

which has errors in all 8 characters of the alignment. We could also shift the marker by one character (i.e., starting position = 1) to get a different alignment:

STTECQLKDNRAWTSLFIHTGHTECA
 TECQRKMN
  ^^^^^^^

which has errors in 7 characters of the alignment (only the first character was correct). We could again shift the marker by two characters (i.e., starting position = 2) to get another alignment:

STTECQLKDNRAWTSLFIHTGHTECA
  TECQRKMN
      ^ ^ 

which has errors in only two characters. And the final alignment is when the marker is shifted all the way to the right (i.e., starting position = 18):

STTECQLKDNRAWTSLFIHTGHTECA
                  TECQRKMN
                  ^^^^^^^^

which also has errors in all 8 characters of the alignment.

Finding Closest Matches

Your program should compare each marker in the MARKERS list to all of the possible alignments in PROTEIN where the marker could start. For each marker, the program should print both (1) the starting position in PROTEIN of the alignment that provided the closest match, and (2) how many errors were found in that particular alignment. If there are is a tie in the number of matches at two positions, it is acceptable to either return the first of the two positions, or provide us with all of the positions with the closest matches as a combination. The output of your program should look something like this:

Marker TECQRKMN has 2 errors at starting position 2.
Marker ALFHHTTGT has 4 errors at starting position 14.
Marker TTECQ has 0 errors at starting position 1.
Marker HT has 0 errors at starting position 18.
Marker ZZZ has 3 errors at starting position 0.
Marker TTZZZRAWT has 5 errors at starting position 5.

ReadMe

If you are stuck, take a look at your worksheet for guideance!

Reminder

As you make progress on your program, don’t forget to commit and push your changes regularly!