CSCI 151 - Web Page Index Indexing the web using AVL trees

Due before 10:00pm, Sunday, April 14


This assignment is the first part of a series of related assignments about the World Wide Web. Our ultimate goal is to build a search engine for a limited portion of the web, with some of the features of common search engines such as Yahoo!, Bing, or Google.

In this first step, you will use an AVL tree to create an index of all the words contained on a webpage. You will then be able to query the index to find out how frequently a word appears on a page and in what locations. Presumably, there is some correlation between words on a page and that page's content. And a page that frequently contains a word is probably a better match than other pages that don't.

The goals of this lab are for you to:

As usual, you may work with a partner on this assignment.

Motivation

You probably use a search engine at least 10 times a day to dig through the vast amount of information on the web. Some folks don't even know how to get to their favorite web sites without using a search engine! Back in the days before search engines, folks had to rely upon browser bookmarks to keep track of their favorite or even buy a book listing web sites. (I guess some folks still do!)

So what motivates today's search companies to spend so much time, effort, and money to give you good results to your search queries? Sure, some of the first search engines were computer science folks doing nifty things and sharing it out of the goodness of their hearts (and some still do), but mostly it is done for profit these days. Along with your search results, the search engine also displays a bunch of advertisements related to your query. For every ad-click that you make, they get some money. Sometimes, they get paid just to display an URL at the top of the search results! By getting high-quality/relevant search results, you are more likely to continue using a particular search engine, thus increasing their chances to profit from you.

So the million dollar question is: How do the search engines produce their list of relevant URLs? Well, they don't share all the specifics, but we know a number of basic ideas that most of them use.

In this lab, we'll be trying out 2 techniques that are used by most search engines by building an index of the words on a page. For each word we encounter on a web page, we'll keep track of what order we encountered it in (0th, 1st, 2nd, 3rd, etc.) and keep a list of all the locations for each word.

The first query technique is related to the frequency of a query word on a given page. If you've got a page that has the word "monkey" occurring repeatedly, then it is quite likely to be about monkeys. And a page that has 10% of the words on it being "monkey" is probably more relevant to a query on "monkey" than a page that only has it there 0.5%.

The second technique will be to use the set of indexes to find phrases on a page. If some is looking for "robot ninjas", then you would need to go through all of the locations where "robot" appears and check to see if "ninjas" is located in a spot one greater. Some search engines also support the ability to search for words NEAR each other, and you could do this too using the index, but it isn't required.

Getting Started

Starting point code is in lab7.zip. We are using an external library Jsoup to parse HTML for us (it simplifies things greatly). You will need to add jsoup-1.8.3.jar to the build path in Eclipse. If you are working from the command line, you can compile and run things using the -classpath parameter.

% javac -classpath jsoup-1.8.3.jar:. HTMLScanner.java TestScanner.java

% java -classpath jsoup-1.8.3.jar:. TestScanner http://www.cs.oberlin.edu/

Part 1 - Getting input

Begin by experimenting with the HTMLScanner class and the associated test class TestScanner. HTMLScanner reads tokens one by one from a file, a web page (given its URL) or a string. TestScanner contains a main method designed to test the HTMLScanner.

HTMLScanner is designed to work similarly to the normal Scanner. You give the constructor a String representing the URL or file you want to read in, it then reads the file and lets you use hasNext() and next() to access the words on a page. I've also included features to iterate through the links on a page (hasNextLink() and nextLink). The Jsoup HTML parser supports other features (keywords, title, etc.) which you can read about in Jsoup's API docs if you want.

HTMLScanner currently only returns contiguous blocks of alpha-numeric characters -- so "sing-song95" on a page will return "sing" and then "song95".

TestScanner has one command-line argument, a string representing a URL. Try it out on a few URLs you are familiar with, such as "http://www.cs.oberlin.edu/" and "http://www.google.com/", or a filename such as "testscannerfile".

Part 2 - AVL Trees

First you'll be completing an implementation of TreeMap called MyTreeMap. Most of the implementation is already provided for you, but there are a few things you still need to finish. MyTreeMap is just an AVL tree in disguise, where the nodes in the tree contain "key-value" pairs. You place items in the tree ordered by their key, and this key has an associated value tagging along with it. Now, the key can be any reference type, and so can the value. Therefore, our MyTreeMap class will be parameterized by not one but two generic types: K for the key, and V for the value.

The methods you have to implement are listed below. You should peruse the class to see how it is implemented; it is not the same as the binary tree lab (Lab 6). In particular, a TreeMap contains a (key,value) pair, and a reference to its left and right subtrees (which are also TreeMaps). An empty TreeMap is one for which its left and right subtrees are null; a leaf TreeMap is one for which its left and right subtrees are non-null, but are themselves empty TreeMaps. You can explore this further by looking at the provided constructors.

Note that the generic type K of the key implements the Comparable interface, and therefore, you can (and should!) use the compareTo method to determine whether two keys are equal, or to determine their order.

private V get(K searchKey)
Return the current mapping of the given key, that is, the value associated with the provided searchKey.
If no mapping exists, return null.
We have already included the actual public method get(Object key) which takes care of the casting.
(The get(Object key) method is required for any TreeMap implementation.)
public V put(K key, V value)
Insert a (key, value) mapping into the map, ordered by its key.
If a mapping for this key already exists, the new value should replace the old value in the map.
The return value of put is the previous value for the key if there was one, or null if there was not.
Here is a sequence of operations assuming recursive implementation:
  1. If the key does not yet exist in the tree, add a node to the correct location in the tree as a leaf. That is, add the (key,value) pair to the TreeMap that reports itself as isEmpty() that you ended in during your search, and then make it into a leaf. In more detail, you should:
    1. Set the value of key
    2. Set the value of value
    3. Set the size of the tree appropriately...to 1
    4. Set this.height appropriately...to 0
    5. Set both left and right to be new empty MyTreeMaps (i.e., new MyTreeMap<K,V>())
    This is a base case, so you are done and can return null (since there is no previous value).
  2. If the key already exists, update its value. If this happens in the current location, you are done. If you have to ask a child to do this, save that value and then...
  3. Call restructure(this) if the tree is unbalanced.
  4. Call this.setHeight() to update the tree's height.
  5. Recalculate your size by adding 1 to the sum of the size of your children.
  6. Return the previous value of value
private void restructure(MyTreeMap<K, V> node)
Rebalances the MyTreeMap rooted at node, if it is unbalanced.
The actual rotation is already implemented; that is, once it knows which subtrees need to be rotated, it will do it.
What you need to do is tell it which subtrees need to be rotated, giving values to variables a, b, c, t1, t2, t3, and t4 as described in the comments surrounding the code. Your code should match the pictures on p. 474 of the text.
You will do this by setting certain variables appropriately, as described in the comments.
The first case is done for you (when the left child is the tallest, and its left child is the tallest); you need to implement the other three cases.
The rest of the restructure method is already implemented for you.
Please have scratch paper with you on which to draw what is happening. Trying to figure it out all in your head is just asking for trouble.

JUnit testing

For testing, you should create a class called MyTreeMapTest.java that thoroughly tests the new methods that you implemented. Be sure to try examples that will require calls to each of the various configurations of restructure(). I strongly suggest drawing out your examples by hand rather than just making them up in your head. Don't forget about to test the case where you overwrite a value for an already-existing key.

Be thorough with your tests because you want this tree to be working before you proceed!

Part 3 - MyTreeSet

Next you'll be implementing your own version of TreeSet<T> using a MyTreeMap<T> as the backing storage. We won't be implementing a remove( ) method, but if we did you could "remove" an item from the set by setting its Boolean value to false.

Recall that in a Set, you only keep one copy of any item added. With a working MyTreeMap, implementing a MyTreeSet is pretty straightforward. Here are the methods you need to implement:

public MyTreeSet()
Create an empty Set by creating an empty MyTreeMap.
public boolean add(T item)
Add in item to the set if it isn't already in there.
You should return true if the set changed, false otherwise.
Iterator<T> iterator()
Just return the inorder iterator keys("in") from MyTreeMap.
int size()
Return the size of the MyTreeMap.
void clear()
Clear the MyTreeMap.

JUnit testing

You should now create a file called MyTreeSetTest.java that contains JUnit tests for this class. As most of the methods you created are likely just a small wrapper around existing MyTreeMap methods, you will hopefully not run into too many issues while testing.

Part 4 - WebPageIndex

Now that you have a working MyTreeMap and MyTreeSet, you will use it to implement a data structure that will contain the index representation of a web page. You will use a MyTreeMap<String, LinkedList<Integer>> to keep track of the indexes of each word on a page, and a MyTreeSet to keep track of the links contained in the page. You should also keep track of the URL used to build the index and the total number of words on the page.

You will need the following public methods:

public WebPageIndex(String baseUrl)
Create a HTMLScanner from baseUrl. Keep a running counter of the number of words you run into when stepping through the page using next( ) and hasNext( ).
When you first encounter a word (i.e., it isn't in your MyTreeMap already), you should create a new LinkedList<Integer> containing the current index. If you've already seen the word, you should just add the current index onto the end of the existing List of locations.
Then you should step through the links using nextLink() and add them each into your MyTreeSet. Note that you don't need to re-initialize the HTMLScanner; next( ) and nextLink( ) make use of different iterators().
Hint: converting all words to lower case using String.toLowerCase() is highly recommended.
public String getUrl()
Return the URL used to create this index.
public int getWordCount()
Return the count of the total number of words on the page (not just unique words).
public boolean contains(String s)
Return true if the word s appeared as text anywhere on the page.
public int getCount(String s)
Return the number of times the word s appeared on the page.
public double getFrequency(String s)
Return the frequency the word s appeared on the page (i.e., the count for that word divided by the total number of words on the page).
Be careful of integer division!
public List<Integer> getLocations(String s)
Return the List representing the locations where the word s appeared on the page (i.e., the value from MyTreeMap).
If s does not appear on the page, return an empty list, not null.
public Iterator<String> words()
Return an iterator over all the words on the page in alphabetical order.
Hint: your MyTreeMap already has something that will create this.
public String toString()
Just return the MyTreeMap's toString() value.

Once you have those methods working, you should go on and implement the ability to look for phrases. To do this, what you'll want to do is take a string and break it up along whitespace boundaries into individual words. Look to see if each word appears in the sequence provided.

My suggestion is to either use s.split("\\s+") to turn the input into an array of Strings or a Scanner to step through s. The String method split( ) takes a regular expression and uses it to split the string into an array of substrings. The regular expression "\\s+" matches any sequence of one or more whitespace characters. (You might be tempted to use s.split(" ") but this runs into trouble if there are seveal whitespace characters in a row, which is not uncommon.) Either way, you need to find the individual words of s. For each word there, create a parallel structure of Lists using getLocations (I had an array). Loop through the values for the first word and see if the next has a value 1 greater, the next 2 greater, etc. You only have a phrase match if every one has an appropriate value.

public boolean containsPhrase(String s)
Return true if the phrase s is in the web page.
public int getPhraseCount(String s)
Return the number of times the phrase s appears on the page
public double getPhraseFrequency(String s)
Return the number of times the phrase s appears on the page divided by the total number of words on the page.
(Note: I'm open to suggestions on how to improve this metric.)
public List<Integer> getPhraseLocations(String s)
Return a List marking the stating point of each instance of phrase s on the page.
If the phrase does not occur, you should return an empty List, not null.

JUnit testing

As you might expect by now, you will need to create a WebPageIndexTest.java file that thoroughly tests your WebPageIndex objects. (A good habit to get into is to create this file early on and add tests as you add in individual features.)

WebPageIndex's main method

The main method should take an argument from the command line and build a WebPageIndex from it. Your main method should handle all exceptions and not display stack traces to the user. You should then display a list of all the words on the page, their frequencies, and their locations. Follow this up with a list of all the links that were on the page.

Here are some sample outputs from my program:

% java -classpath jsoup-1.8.3.jar:. WebPageIndex testscannerfile

Frequency and index of words in testscannerfile
happening       0.083333    [4]
hi              0.083333    [0]
if              0.083333    [8]
important       0.083333    [7]
is              0.250000    [3, 5, 10]
it              0.166667    [6, 9]
tagged          0.083333    [11]
there           0.083333    [1]
what            0.083333    [2]

Links:
http://www.google.com/
% java -classpath jsoup-1.8.3.jar:. WebPageIndex http://www.cs.oberlin.edu/~rhoyle/19s-cs151/lab07/sample.html

Frequency and index of words in http://www.cs.oberlin.edu/~rhoyle/19s-cs151/lab07/sample.html
6               0.076923    [4, 11, 15]
a               0.051282    [6, 23]
be              0.025641    [17]
book            0.025641    [26]
children        0.025641    [24]
cow             0.102564    [27, 30, 33, 36]
for             0.051282    [2, 9]
from            0.025641    [22]
i               0.025641    [12]
if              0.025641    [14]
just            0.025641    [5]
lab             0.051282    [3, 10]
me              0.025641    [29]
milk            0.025641    [35]
moo             0.128205    [28, 31, 34, 37, 38]
on              0.025641    [19]
page            0.051282    [8, 21]
popular         0.025641    [18]
s               0.025641    [25]
sample          0.025641    [0]
short           0.025641    [7]
text            0.025641    [1]
that            0.025641    [20]
will            0.025641    [16]
wonder          0.025641    [13]
you             0.025641    [32]

Links:
http://www.cs.oberlin.edu/~rhoyle/19s-cs151/lab07/index.html
http://www.goodreads.com/book/show/926239.Cow_Moo_Me

Handin

Use the handin program to submit a directory containing the following:

  1. All .java files necessary for compiling your code (including any of the classes that I gave you that you use in your solution).
  2. The three requested JUnit test files
  3. A README file with:

If you work with a partner, please only one of you submit your joint solution using handin.