CSCI 150: PreLab 5

Strings
Due: 9AM on Wednesday, October 1st

In this prelab you will formulate some of the ideas necessary to complete Lab 05. Please turn in your solution on Gradescope. You can either turn it in as a PDF (e.g., saving a document as a PDF in Word), take a picture (e.g., with a smartphone), or scan it (e.g., at the library) to hand it in. Please remember, no late prelabs allowed!

Reading

Read this article about Growth Mindset.

1. Did you do the reading? (Yes or no will suffice.)

Mind Mastery

Mastermind is a neat (although oftentimes frustrating) puzzle game. It works a something like this: There are two players. One player is the codemaker (your porgram), the other is the codebreaker (the user). The codemaker chooses a sequence of four colored pegs, out of a possible six colors (red, blue, green, yellow, orange, and purple). He may repeat colors and place them in any order he wishes. This sequence is hidden from the codebreaker. The codebreaker has 10 chances to guess the sequence. The codebreaker places colored pegs down to indicate each of her guesses. After each guess, the codemaker is required to reveal certain information about how close the guess was to the actual hidden sequence.

Describe the Problem:
The problem you will solve on your lab is as follows.
input: repeatedly get guesses from the user, until they either guess the code, or run out of guesses.
goal: generate a random code, and correctly provide the user with feedback on their guesses.


Understand the Problem:
The trickiest part of this game is determining how to provide feedback on the codebreaker's guesses. In particular, next to each guess that the codebreaker makes, the codemaker places up to four clue pegs. Each clue peg is either black or white. Each black peg indicates a correct color in a correct spot. Each white peg indicates a correct color in an incorrect spot. No indication is given as to which clue corresponds to which guess.

For example, suppose that the code is RYGY (red yellow green yellow). Then the guess GRGY (green red green yellow) would cause the codemaker to put down 2 black pegs (since guesses 3 and 4 were correct) and 1 white peg (since the red guess was correct, but out of place). Note that no peg was given for guess 1 even though there was a green in the code; this is because that green had already been "counted" (a black peg had been given for that one).

As another example, again using RYGY as our code, the guess YBBB would generate 1 white peg and 0 black; yellow appears twice in the code, but the guess only contains one yellow peg. Likewise, for the guess BRRR, only 1 white peg is given; there is an R in the code, but only one.

Check here for an online graphical version of the game.

2. Assuming the code is RYGY, fill in the appropriate number of black and white pegs for each guess.

guess black pegs white pegs
YYYY
YRYR
BBPO
PGYR
YYYG
RYGY

3. Consider the following algorithmic approach for calculating the number of white pegs to be awarded for a given guess.
   set a white counter to 0
   loop through the four positions of the guess
      loop through the four positions of the code
         if the current code character matches the current guess character
         and the guess and code positions are not the same, 
         increment the white counter and exit the inner loop
What goes wrong with this algorithm? Give an example of a code, a guess, the value that should be generated and the value that this pseudocode would generate.

4. Decribe precisely how you would go about determining the appropriate number of white pegs to award. As before, your description should be in clear, unambiguous pseudocode.

Looking for a Match

As you may know, proteins are chains of molecules called amino acids. There are 20 amino acids, each of which is typically represented by a single letter, and any protein can be specified by its sequence of amino acids. This sequence determines the properties of the protein, including its 3D structure.

Left: A general amino acid (structure of R determines the particular amino acid).
Right: 3D structure of a protein. Image source: wikipedia.org.

When a new protein is found, one way in which we might attempt to guess the functionality of that protein would be to see if it contains certain markers common to a known class of proteins. For example (and an entirely bogus example at that), suppose we discover a new protein, that we've named Duane, with the following amino acid sequence:

STTECQLKDNRAWTSLFIHTGHTECA

We may also suspect that Duane might belong to one of two possible classes of proteins: Splunkers and Munkatoos. As you well know, most Splunkers contain the pattern TECQRKMN or at least something close to it. That is, most of the sequences in the class of Splunker proteins have the subsequence TECQRKMN with only a few of the letters changed. Munkatoos, meanwhile, have the pattern ALFHHTTGT, or something very similar.

In this case, we can deduce that Duane is most likely a Splunker: Duane contains the pattern TECQLKDN which only has 2 mismatches from TECQRKMN (the errors are marked with a ^ below).

TECQLKDN
TECQRKMN
    ^ ^ 

The closest pattern to the Munkatoo sequence is

SLFIHTGHT
ALFHHTTGT
^  ^  ^^ 

which has 4 mismatches.



Describe the Problem:
The problem you will be solving on your lab is as follows.
input: a file that contains a string s representing a protein sequence, along with some number of strings, each representing a marker sequence.
goal: for each marker sequence, find its best match in the protein sequence and report its location and the number of errors in the match.

Understand the Problem:
The file test.txt is in the format you should expect for your input. In particular, the first line will always contain the protein sequence. Following the protein sequence will be some number of marker sequences. For each of these sequences, you should report the location of the best match, and the number of errors at that location.

For example, the contents of test.txt are as follows:

STTECQLKDNRAWTSLFIHTGHTECA
TECQRKMN
ALFHHTTGT
TTECQ
HT
ZZZ
TTZZZRAWT

For this file your program should have something like the following output:

Example Output

Sequence 1 has 2 errors at position 2.
Sequence 2 has 4 errors at position 14.
Sequence 3 has 0 errors at position 1.
Sequence 4 has 0 errors at position 18.
Sequence 5 has 3 errors at position 0.
Sequence 6 has 5 errors at position 5.


Design an Algorithm:
In order to solve this problem, you will need to figure out how to find the best match between a single marker sequence and the original protein sequence.

5. Write pseudocode for an algorithm that finds the best fit for a given subsequence to a given protein. Assume you have a String or (if you'd prefer) an list of characters P (representing the protein) and a shorter String or list of characters S (representing the marker subsequence). Your pseudocode algorithm should find the location at which to align the subsequence with the protein so as to minimize the number of mismatches. You should print both the starting index of the best alignment as well as the number of mismatches at that location. If there are multiple best alignments, use the earliest appearing one.

For example, if P contains the characters

STTECQLKDNRAWTSLFIHTGHTECA

and S contains the characters

ALFHHTTGT

then your algorithm should print

Best match site: 14
Mismatches: 4

since the best match occurs starting with the 14th element in P and has 4 mismatches, as discussed above.

Honor Code

If you followed the Honor Code in this assignment, write the following sentence attesting to the fact at the top of your homework.

I affirm that I have adhered to the Honor Code in this assignment.