Markov Chains

In the application part of this lab, we will be using a Markov chain model for the fun purpose of learning how to generate new text that mimics the style of existing text, which is an example of both machine learning and natural language processing. In this part of the warmup, we will introduce the idea of Markov chains and start playing with a simpler application that will eventually help you implement the last part of the lab.

Markov Chain Model

A Markov chain is a stochastic process that models changes of the world from one state to another. For example, an intelligent user interface might want to be able to predict what task a user is working on based on their interactions so far with the software itself (e.g., which menu items and icons they have clicked on). Here, user behavior can be modeled as a sequence of interactions, and the likelihood of the next interaction depends on the most recent interaction(s) performed. Each task the user wants to perform (e.g., writing an email, downloading slides from Blackboard, playing a game online) have different common transitions from one interaction to the next.

A Markov chain models (1) the situations the world might be in as states (e.g., the user’s most recent interaction with the interface) and (2) the probability of the next state following the recent state(s) as transitions. These transitions can be represented as a map where the keys are the previous state(s) and next state and the values are integer counts (or double probabilities).

The model is learned by counting state transitions present in existing data (e.g., counting how often users performed one interaction after the others). Then the probabilities in the transitions can be used to estimate a likely next state, which we will be using in our application in this assignment.

Learning Transition Probabilities

As an initial real-world example, consider sequences of DNA from biology. DNA sequences are composed of only four nucleotides: adenine (A), cytosine (C), guanine (G), and thymine (T). For illustration, one DNA sequence of length 100 (generated here) is:

TCGGCGGACAGTTGTCCCCTGTCCTAGGTAACGGATAGCGAATTATTGAGAGACCACCACCAAGAGAGATGTCCGCACAGCTTATCCCATTATCACCACG

The order of these nucleotides is not uniform (i.e., equally random). Instead, the probability of each next character/nucleotide changes with the previous character/nucleotide. For instance, the next character A only occurred 11% of the time when the previous character was A but occurred 31% of the time when the previous character was C. A Markov chain can model the transitions of how likely each nucleotide (state) is to follow the previous.

To learn these transitions, we simply iterate over each current -> next pair of characters (e.g., T -> C, C -> G, G -> G) in the sequence and increase a count of how often the next character followed the previous.

Implementing a First Order Markov Model

We have provided you with a class called Transitions that represents the transition counts from one particular state into each of the possible next states. This class provides a addTransition(State nextState) method that takes in a next state and updates the appropriate transition count.

We have also provided you with a DNASequences.java file that stores all of the transition probabilities in a HashMap called this.transitions where the key is the current character in a DNA sequence and the value is the Transitions object, and the Transitions object stores the counts of all transitions into possible next characters for the current character.

Your first objective is to implement the trainCounts method of the DNASequences class to do the necessary counting for learning the transition function. Your method should:

  1. Loop over every current/next character pair in the given String sequence

  2. Get the appropriate Transitions<Character> object for the current character from the this.transitions HashMap

  3. Call the addTransition method of the Transitions object from Step 2 with the next character as the argument

Then, if you uncomment out the first 3 lines of the main method, you should be able to see the learned probabilities if you run the program. If your code is working, then you should see output like the following (note that these probabilities are different than before since we have a different DNA sequence):

A to A: 0.248
A to C: 0.196
A to T: 0.328
A to G: 0.228
C to A: 0.3025210084033613
C to C: 0.23949579831932774
C to T: 0.25210084033613445
C to G: 0.20588235294117646
T to A: 0.24642857142857144
T to C: 0.25357142857142856
T to T: 0.26071428571428573
T to G: 0.2392857142857143
G to A: 0.20346320346320346
G to C: 0.26406926406926406
G to T: 0.2813852813852814
G to G: 0.2510822510822511

Generating Random Sequences

Your second objective is to implement the generate method of the DNASequences class. Your method should:

  1. Create a new String that contains only the start character.

  2. Set a char current to be equal to start.

  3. Loop until the String you created in Step 1 has length characters in it. For each iteration of the loop, you should:

  • Get the Transitions<Character> object for the current character
  • Call the generateNextState method of the Transitions object to randomly generate a next character
  • Append next to the end of the String you created in Step 1
  • Set current = next to get ready for the next loop
  1. Return the String you created

This approach uses your learned transition counts to randomly pick next characters for a given current character and combines all the randomly chosen characters to create a new String with a given length.

If you uncomment out the last three lines of code in the main method, you should see a randomly generated DNA sequence of length 30 printed at the end of your output.

N-Order Markov Chains

Most commonly, we consider first-order Markov chains, where the next state (e.g., character/nucleotide) only depends on the single most recent state (e.g., character/nucleotide). This is what we just did in the warmup implemented above.

But in a more general n-order Markov chain, the next state (e.g., character/nucleotide) depends on a history of the most recent n states (e.g., a String of n previous characters). We will implement an n-order Markov chain in Part 6 of the lab

Committing Your Progress

Don’t forget to add/commit/push your changes to the DNASequence.java.