In the application part of this lab, we will be using a Markov chain model for the fun purpose of learning how to generate new text that mimics the style of existing text, which is an example of both machine learning and natural language processing. In this part of the warmup, we will introduce the idea of Markov chains and start playing with a simpler application that will eventually help you implement the last part of the lab.
A Markov chain is a stochastic process that models changes of the world from one state to another. For example, an intelligent user interface might want to be able to predict what task a user is working on based on their interactions so far with the software itself (e.g., which menu items and icons they have clicked on). Here, user behavior can be modeled as a sequence of interactions, and the likelihood of the next interaction depends on the most recent interaction(s) performed. Each task the user wants to perform (e.g., writing an email, downloading slides from Blackboard, playing a game online) have different common transitions from one interaction to the next.
A Markov chain models (1) the situations the world might be in as states (e.g., the user’s most recent interaction with the interface) and (2) the probability of the next state following the recent state(s) as transitions. These transitions can be represented as a map where the keys are the previous state(s) and next state and the values are integer counts (or double probabilities).
The model is learned by counting state transitions present in existing data (e.g., counting how often users performed one interaction after the others). Then the probabilities in the transitions can be used to estimate a likely next state, which we will be using in our application in this assignment.
As an initial real-world example, consider sequences of DNA from biology. DNA sequences are composed of only four nucleotides: adenine (A), cytosine (C), guanine (G), and thymine (T). For illustration, one DNA sequence of length 100 (generated here) is:
TCGGCGGACAGTTGTCCCCTGTCCTAGGTAACGGATAGCGAATTATTGAGAGACCACCACCAAGAGAGATGTCCGCACAGCTTATCCCATTATCACCACG
The order of these nucleotides is not uniform (i.e., equally random). Instead, the probability of each next character/nucleotide changes with the previous character/nucleotide. For instance, the next character A
only occurred 11% of the time when the previous character was A
but occurred 31% of the time when the previous character was C
. A Markov chain can model the transitions of how likely each nucleotide (state) is to follow the previous.
To learn these transitions, we simply iterate over each current -> next
pair of characters (e.g., T -> C
, C -> G
, G -> G
) in the sequence and increase a count of how often the next character followed the previous.
We have provided you with a class called Transitions
that represents the transition counts from one particular state into each of the possible next states. This class provides a addTransition(State nextState)
method that takes in a next state and updates the appropriate transition count.
We have also provided you with a DNASequences.java
file that stores all of the transition probabilities in a HashMap
called this.transitions
where the key
is the current character in a DNA sequence and the value
is the Transitions
object, and the Transitions
object stores the counts of all transitions into possible next characters for the current character.
Your first objective is to implement the trainCounts
method of the DNASequences
class to do the necessary counting for learning the transition function. Your method should:
Loop over every current/next
character pair in the given String sequence
Get the appropriate Transitions<Character>
object for the current
character from the this.transitions
HashMap
Call the addTransition
method of the Transitions
object from Step 2 with the next
character as the argument
Then, if you uncomment out the first 3 lines of the main
method, you should be able to see the learned probabilities if you run the program. If your code is working, then you should see output like the following (note that these probabilities are different than before since we have a different DNA sequence):
A to A: 0.248
A to C: 0.196
A to T: 0.328
A to G: 0.228
C to A: 0.3025210084033613
C to C: 0.23949579831932774
C to T: 0.25210084033613445
C to G: 0.20588235294117646
T to A: 0.24642857142857144
T to C: 0.25357142857142856
T to T: 0.26071428571428573
T to G: 0.2392857142857143
G to A: 0.20346320346320346
G to C: 0.26406926406926406
G to T: 0.2813852813852814
G to G: 0.2510822510822511
Your second objective is to implement the generate
method of the DNASequences
class. Your method should:
Create a new String that contains only the start
character.
Set a char current
to be equal to start
.
Loop until the String you created in Step 1 has length
characters in it. For each iteration of the loop, you should:
Transitions<Character>
object for the current
charactergenerateNextState
method of the Transitions
object to randomly generate a next
characternext
to the end of the String you created in Step 1current = next
to get ready for the next loopThis approach uses your learned transition counts to randomly pick next characters for a given current character and combines all the randomly chosen characters to create a new String with a given length
.
If you uncomment out the last three lines of code in the main
method, you should see a randomly generated DNA sequence of length 30 printed at the end of your output.
Most commonly, we consider first-order Markov chains, where the next state (e.g., character/nucleotide) only depends on the single most recent state (e.g., character/nucleotide). This is what we just did in the warmup implemented above.
But in a more general n
-order Markov chain, the next state (e.g., character/nucleotide) depends on a history of the most recent n
states (e.g., a String of n
previous characters). We will implement an n
-order Markov chain in Part 6 of the lab
Don’t forget to add
/commit
/push
your changes to the DNASequence.java
.