10:00pm, Sunday, April 28
You may work with a partner on this assignment.
In this lab, you will use hash tables to do generate pseudo-random text.
The purpose of this lab is to:
Note that there is no starter code for this lab, but here are some sample text files you can use for testing.
In this lab you will design and implement an order k Markov model from
a piece of input text. Sound scary? It isn't. Basically, we'll use these Markov
model things to read in a sample text, then generate a new
Would you like them here or there? I would not like them here or there. I would not like them anywhere. I do not like green eggs and ham. I do not like them Sam-I-am. Would you like them on a house? Would you like them with a mouse?
Our Markov model will read in all of Dr. Seuss' fine work, then will generate random text in Seuss' style, such as
That Sam-I-am! That makes a story that needs a comb? No time for more, I'm almost home. I swung 'round the smoke-smuggered stars. Now all that cart! And THEN! Who was back in the dark. Not one little house Leaving a thing he took every presents! The whole aweful lot, nothing at all, built a radio-phone. I put in a house. I do not like them with a goat. I will not eat them off. 'Where willing to meet, rumbling like them in a box. I do not like them all! No more tone, have any fun? Tell me. What will show you. You do not like them, Sam-I-am! And you taking our Christmas a lot.
As you can see, our random text certainly resembles the original in spirit, although it may not make a whole lot of sense.
For this lab, you will be using a Markov model for the somewhat silly purpose of generating stylized pseudo-random text; however, Markov models have plenty of "real" applications in speech recognition, handwriting recognition, information retrieval, and data compression. (In fact, there is a whole course on such models in the math department, called Graphical Models: Math 348.)
Our Markov model is going to generate one character of our output at a time. In order to determine what this next character is, we will need to look at the sample text to determine what character is most likely to occur at this point in the text. In order to determine what character is most likely to occur, we look at the last few characters we generated, and try to find those character in our sample text. Hopefully we'll find it a bunch of times, and from these occurrences we try to figure out what character should occur next.
For example, suppose we have already generated the text "I do not like
them,
", and we want to determine the next character. Then, we may look
in the sample text for all occurrences of the substring "ke them,
", and we may find that the substring occurs 10 times: 7 times it is
followed by "Sam-I-am
", 2 times it is followed by
"on a boat
",
and once it is followed by "on a house
". Then, with 7/10
probability, the next character is an S
, and with 3/10 probability
it is an o
.
Now if you think about it, the further back we look in the text, the more our generated text will resemble the original. However, looking farther back requires a lot more work and space, and produces less interesting text. So there are trade-offs to consider. The Markov model formalizes this notion as follows.
An order 0 Markov model looks in the sample text for the previous 0
characters of our generated text. That is, given an input text, you compute the
Markov model of order 0 by counting up the number of occurrences of each letter
in the input and use these as the frequencies. For example, if the input text is
"agggcagcgggcg", then the order 0 Markov model predicts that each character is
a
with probability 2/13, c
with probability 3/13, and
g
with probability 8/13. This has the effect of predicting that
each character in the alphabet occurs with fixed probability, independent of
previous characters.
Characters in English text are not independent, however. An order k Markov
model looks back at the previous k characters in the generated
text, and bases its prediction on that substring of length k. That is,
given an input text, you compute a Markov model of order k by counting
up the number of occurrences of each letter that follows each sequence of
k letters. For example, if the text has 100 occurrences of
th
, with 50 occurrences of the
, 25 occurrences of
thi
, 20 occurrences of tha
, and 5 occurrences of
tho
, the order 2 Markov model predicts that the next character
following th
is e
with probability 1/2, i
with probability 1/4, a
with probability 1/5, and o
with probability 1/20.
So this is how we generate text. The details will become clear later. Right now, let's get on with the show.
First you'll implement your own hash map with separate chaining in a class called MyHashMap<K,V>. You will build your hash table on top of an array; this array should consist of an array of LinkedLists, one linked list per "bucket".
The methods you will implement are a subset of the
java.util.Map
interface, but you won't actually implement the interface.
You may not assume that keys implement Comparable
, but as all objects, they have an equals
method.
You're going to need an array to store the buckets of your hash map. Because you're using separate chaining, each of these buckets will be a linked list of elements, in fact, they'll be a linked list of (key,value) pairs (since each element is really one such pair).
In order to store both the key and value of an element in a single linked list, you will need to create a MyEntry<K,V> class that represents a key-value pair. In this way, each bucket can be represented by a linked list of MyEntries.
So, step one is to create a MyEntry
class (which can be
declared inside the MyHashMap class - if nested in MyHashMap
, you
should drop the Generics from that class declaration -- you know, the angle brackets -- and then just use the K and V from MyHashMap's generics) that
has a class variable key
of type K
, a class variable
value
of type V
, and overridden
hashCode()
and equals()
methods (so that they apply
only to the key). You may use the key's hashCode
and
equals
methods directly.
Now that you have the MyEntry class, you can create the following class members in your MyHashMap class:
You may also want constants for the default hashtable capacity (say, 11) and a default load factor (say, 0.75).
table = (LinkedList<MyEntry> []) new LinkedList[capacity];
LinkedList.clear()
method.
Before continuing, you should test your MyHashMap class if you haven't already. That is, you should create a JUnit test class called MyHashMapTest.java and check the methods you have so far before continuing.
resize()
method.
null
values or keys should generate a
NullPointerException.
LinkedList.contains
method. (You probably will need to make a MyEntry
object to pass into this call.)
public Iterator<K> keys() { return new Iterator<K>() { int bucket = 0; Iterator<MyEntry> itr = table[bucket].iterator(); int nextCount = 0; public boolean hasNext() { // can just check nextCount and size } public K next() { // if my hasNext() is false, I should throw a NoSuchElementException // while itr.hasNext() is false, increment bucket and get the next iterator // now increment nextCount and return the key from the item itr.next() returns } public void remove() { // just ask itr to remove, but I need to update my size and nextCount } }; }
11 23 47 97 197 397 797 1597 3203 6421 12853 25717 51437 102877 205759 411527 823117 1646237 3292489 6584983 13169977 26339969 52679969 105359939 210719881 421439783 842879579 1685759167
i
and returns the first prime number in this list that is greater than i
.)
Be sure to test your hash table methods with JUnit tests before continuing. One good test would create a hash table of (String,Integer) pairs, and add the first 100 elements (""+i,i), printing out your hash table as you go along. Remove the elements afterwards, one-by-one.
Create a class Markov to represent a k-character substring.
Ultimately, it will have a random
method that returns a random character according to the Markov model. For now, just make it store the substring and an integer that counts the number of times the substring appears. You will need a constructor, a method to increment the frequency count, and the usual toString
method for output.
Implement a program FrequencyCounter that reads the order parameter
k of the Markov model from the command-line, a text string from
System.in, and uses a hash table to insert each k-character
substring (key) from the text. For example, if k is 2 and the input
string is "agggcagcgggcg
", then your program should create
Markov
objects for each of the 5 distinct keys, and call the
add
method 12 times total: ag gg gg gc ca ag gc cg gg gg gc
cg
. Maintain an integer count of the number of occurrences of each key.
Use your hash table's methods to print out the number of distinct keys and the
number of times each key appears in the text. For the example above, your
program should output (in some order):
5 distinct keys 2 ag 1 ca 2 cg 3 gc 4 gg
To generate random text, given a k character key, your
Markov
objects must know all of the letters that follow the k
character key. This operation is at the crux of the matter, as you will
need it to generate random characters in accordance with the Markov model.
Modify your Markov
class so that in addition to frequency
counts, it records the breakdown depending on the next letter. Create a
class variable of type
TreeMap<Character,Integer>
to keep track of the list of suffix characters along with their frequencies
(remember, you made your own MyTreeMap in lab 7).
Modify the toString
method so that it prints out
the list of suffixes, along with the substring and frequency count. Include
the following method to insert a suffix character.
Modify the toString()
method so that it prints out
substring, the overall frequence count and the list of suffixes, along the
lines of the following example. You may also want to add other constructors
or methods, as you see fit.
Implement a program SuffixCounter based off
FrequencyCounter that inserts keys into the hash table (if necessary),
and calls add(char c)
to add the appropriate suffix characters to
the Markov model. It should produce the following output on the example input
(you do not have to format your output exactly the same, but it should contain
the same information in a reasonable layout.)
5 distinct keys 2 ag: 1 c 1 g 1 ca: 1 g 1 cg: 1 g 3 gc: 1 a 2 g 4 gg: 2 c 2 g
You'll probably need to read up on the
TreeMap
operations, and if you choose to use its entrySet
method, you will need to look at the
Set
and
Map.Entry
classes.
Note that since the last cg
substring doesn't have a "next" character, we don't include it in the model.
Now add a method random
to Markov that returns a pseudo-random character according to the language model. Be sure to get the probabilities right, as we will be checking this. (And, it may take some thought to figure out how to translate the probabilities into characters.)
Now, create a class TextGenerator that takes as command line input an integer k, an integer M, and a filename file, and prints out M characters according to the order k Markov model based on file. You should start by printing the first k characters of the original text. Then, repeatedly generate successive pseudo-random characters.
The Java Scanner class is great for reading in words and lines, but not as useful for doing something like we are doing with character at a time input. However, there are other Java classes that predate Scanner such as FileReader which has a single character read() method that might be useful for this lab. Note that the read method returns an int because it uses -1 to indicate that there is no more input instead of an exception or a hasNext() like method. The non-negative values you can then just treat like a char.
int nextChar; try { input = new FileReader(inputFileName); } catch (FileNotFoundException e) { System.err.println("Could not open file "+inputFileName+": "+e.getMessage()); System.exit(2); } try { while ( -1 != ( nextChar = input.read() ) ) { char c = (char) nextChar; System.out.println("Hey, I read a character: " + c); } } catch (IOException e) { System.err.println("Error reading from file "+inputFileName+": "+e.getMessage()); System.exit(4); }
If you want to use Scanner, I'm told that some students have had success by changing the delimiter to be an empty string.
Scanner input = new Scanner(new File(filename)); input.useDelimiter(""); // now input.next() gives a single character
Using the example above, if the Markov object
m represents the substring "gg
", then
m.random()
should return c
or g
, each
with probability 1/2. After you generate a character, move over one character
position, always using the last k characters generated to determine
the probabilities for the next. For example, if your program chooses
c
in the example above, then the next Markov object would
represent the substring "gc
," and according to the Markov model,
the next character should be a
with probability 1/3 and
g
with probability 2/3. Continue the process until you have output
M characters. If the language model contains less than 100
k-tuples (prefixes), then print the language model (the keys, their
suffixes and counts) before you output M randomly generated
characters.
NOTE: If you are using line-at-a-time input (as opposed to using a FileReader which has a single character read() method), you should append a newline character at the end of the input line -- otherwise you won't have any in your output. Also, you should carry the last k characters from the previous line to the start of the next line. Finally, print a newline at the end of your text generation to clean up the appearance when the command prompt returns.
NOTE: If your final sequence of k characters does not appear anywhere else in your text, you may encounter a situation where a lookup in the table returns no Markov object. For example "ies" only appears at the end of "flippyfloppies". In this circumstance, you should just reset back to the original start string.
You should test out your text generation with very simple inputs first, such as with a file containing "flippyfloppies", and small k and M's.
Once you get that working, you should try it on some of the files provided below. You will find that the random text with low-order models starts to sound more and more like the original text as you increase the order, as illustrated in the examples below. As you can see, there are limitless opportunities for amusement here. Try your model on some of your own text, or find something interesting on the net.
Here are a few sample texts of interest: Dr. Seuss, Shakespeare1, 1 million digits of Pi, Buffy the Vampire Slayer (Season 1)2, State of the Union addresses by Bush and Obama3, 2011 GOP Primary Debates, My Little Pony: Friendship is Magic (Season One)4, Big Bang Theory (Series 1)5, Friends (Season 1)6, and Doctor Who (Series 1)7.
[Enter DUKE SENIOR, AMIENS, and two or three Lords, like foresters] DUKE SENIOR Now, my co-mates and brothers in exile, Hath not old custom made this life more sweet Than that of painted pomp? Are not these woods More free from peril than the envious court? Here feel we but the penalty of Adam, The seasons' difference, as the icy fang And churlish chiding of the winter's wind, Which, when it bites and blows upon my body, Even till I shrink with cold, I smile and say 'This is no flattery: these are counsellors That feelingly persuade me what I am.' Sweet are the uses of adversity, Which, like the toad, ugly and venomous, Wears yet a precious jewel in his head; And this our life exempt from public haunt Finds tongues in trees, books in the running brooks, Sermons in stones and good in every thing. I would not change it. AMIENS Happy is your grace, That can translate the stubbornness of fortune Into so quiet and so sweet a style. DUKE SENIOR Come, shall we go and kill us venison? And yet it irks me the poor dappled fools, Being native burghers of this desert city, Should in their own confines with forked heads Have their round haunches gored.
DUKE SENIOR Now, my co-mates and thus bolden'd, man, how now, monsieur Jaques, Unclaim'd of his absence, as the holly! Though in the slightest for the fashion of his absence, as the only wear. TOUCHSTONE I care not for meed! This I must woo yours: your request than your father: the time, That ever love I broke my sword upon some kind of men Then, heigh-ho! sing, heigh-ho! sing, heigh-ho! sing, heigh-ho! unto the needless stream; 'Poor deer,' quoth he, 'Call me not so keen, Because thou the creeping hours of the sun, As man's feasts and women merely players: Thus we may rest ourselves and neglect the cottage, pasture? [Exit] [Enter DUKE FREDERICK Can in his time in my heartily, And have me go with your fortune In all this fruit Till than bear the arm's end: I will through Cleanse the uses of the way to look you. Know you not, master, Sighing like upon a stone another down his bravery is not so with his effigies with my food: To speak my mind, and inquisition And unregarded age in corners throat, He will come hither: He dies that hath engender'd: And you to the bed untreasured of the brutish sting it.
In every generation there is a Chosen One. She alone will stand against the vampires, the demons and the forces of darkness. She is the Slayer. The Bronze at night. Cut inside. The camera follows her out. Cordelia: Well, just one dance. They dance close. Owen: It's weird. Buffy: I know. A vampire brings the demons, which ends the world. Willow: Angel stopped by? Wow. Was there... Well, I mean, was it having to do with kissing? Buffy: Mom! Mom, can you hear me? / Can you see me? / What's inside of me? / Oh, I just wanted to start over. Be like everybody else. Have some friends, y'know, maybe three isn't company anymore. Buffy: Why are you following me? I just had this feeling she'd do just about enough! Buffy shoots Xander a look. Snyder: I don't know. Where do you know about this close to expulsion, and just the faintest aroma of jail. Giles: (to Buffy) Well, he is young. Buffy: It shouldn't be. (starts back to their original form, which is, uh, uh, slightly l
Use handin to submit the following files:
If you work with a partner, just submit one solution per team.